Paraphrase any question with T5 (Text-To-Text Transfer Transformer) — Pretrained model and…

Input

The input to our program will be any general question that you can think of –

Which course should I take to get started in data Science?

Output

The output will be paraphrased versions of the same question. Paraphrasing a question means, you create a new question that expresses the same meaning using a different choice of words .

Paraphrased Questions generated from our T5 Model ::
0: What should I learn to become a data scientist?
1: How do I get started with data science?
2: How would you start a data science career?
3: How can I start learning data science?
4: How do you get started in data science?
5: What's the best course for data science?
6: Which course should I start with for data science?
7: What courses should I follow to get started in data science?
8: What degree should be taken by a data scientist?
9: Which course should I follow to become a Data Scientist?

As you can see we generated about 10 questions that are paraphrases to the original question — “ Which course should I take to get started in data science?”

Today we will see how we can train a T5 model from Huggingface’s transformers library to generate these paraphrased questions. We will also see how we can use the pre-trained model provided by me to generate these paraphrased questions.

Practical use case

Icon from Flaticon

Imagine a middle school teacher preparing a quiz for the class. Instead of giving a fixed question to every student he/she can generate multiple variants of a given question and distribute them across students. The school can also augment their question bank with several variants of a given question using this technique.

Let’s get started —

Dataset

Icon from Flaticon

I used the Quora Question Pairs dataset to filter all the questions marked as duplicates and prepared training and validation sets. Questions that are filtered as duplicates serve our purpose of getting paraphrase pairs.

We will discuss in detail how you can –

  1. Use my pre-trained model to generate paraphrased questions for any given question.
  2. Use my training code and dataset to replicate the results on your own GPU machine.

Training Algorithm — T5

Icon generated with Flaticon

T5 is a new transformer model from Google that is trained in an end-to-end manner with text as input and modified text as output . You can read more about it here .

It achieves state-of-the-art results on multiple NLP tasks like summarization, question answering, machine translation, etc using a text-to-text transformer trained on a large text corpus.

I trained T5 with the original sentence as input and paraphrased (duplicate sentence from Quora Question pairs) sentence as output .

Code

All the code for using pre-trained model and training the model with given data is available at –

Using Pre-trained model

The Jupiter notebook t5-pretrained-question-paraphraser contains the code presented below.

First, install the necessary libraries –

!pip install torch==1.4.0
!pip install transformers==2.9.0
!pip install pytorch_lightning==0.7.5

Download pre-trained model from S3 and unzip in the current folder.

Run inference with any question as input and see the paraphrased results.

The output from the above code is –

device cpu

Original Question ::
Which course should I take to get started in data science?


Paraphrased Questions ::
0: What should I learn to become a data scientist?
1: How do I get started with data science?
2: How would you start a data science career?
3: How can I start learning data science?
4: How do you get started in data science?
5: What's the best course for data science?
6: Which course should I start with for data science?
7: What courses should I follow to get started in data science?
8: What degree should be taken by a data scientist?
9: Which course should I follow to become a Data Scientist?

Training your own model

Again all the training code and dataset used for training are available in the Github repo mentioned earlier. We will go through the steps that I used to train the model.

1. Data Preparation

First I downloaded the Quora Question pairs tsv file (q uora_duplicate_questions.tsv ) as mentioned in this link .

Extracted only the rows that have is_duplicate =1 since they are the paraphrased question sentences. Then I had split the data into train and validation sets and stored them in separate CSV files.

In the end, each of the CSV files has two columns “ question1 ” and “ question2 ”. “question2” is a paraphrased version of “question1”. Since T5 expects a text as input, I gave “question1” as the input source and asked it to generate “question2” as target output .

The code used to generate the train and validation CSV files is shown below. The CSV files are available under the paraphrase_data folder in the Github repo.

filename = "quora_duplicate_questions.tsv"
import pandas as pd
question_pairs = pd.read_csv(filename, sep='\t')
question_pairs.drop(['qid1', 'qid2'], axis = 1,inplace = True)
question_pairs_correct_paraphrased = question_pairs[question_pairs['is_duplicate']==1]
question_pairs_correct_paraphrased.drop(['id', 'is_duplicate'], axis = 1,inplace = True)
from sklearn.model_selection import train_test_split
train, test = train_test_split(question_pairs_correct_paraphrased, test_size=0.1)
train.to_csv('Quora_Paraphrasing_train.csv', index = False)
test.to_csv('Quora_Paraphrasing_val.csv', index = False)

2. Training

Thanks to Suraj Patil for the amazing Colab notebook on training T5 for any text-to-text task. I borrowed most of the training code from the Colab notebook, changing only the dataset class and training parameters. I adapted the dataset class to our Quora Question Pair dataset.

The training code is available as train.py in the Github Repo.

All you need to do is clone the repo on any GPU machine, install requirements.txt , and run train.py to train the T5 model.

Training this model for 2 epochs (default) took about 20 hrs on p2.xlarge (AWS ec2).

The dataset class looks like below —

The key is how we give our input and output to the T5 model trainer. For any given question pair from the dataset, I gave input (source) and output (target) to the T5 model as shown below –

Input format to T5 for training

paraphrase: What are the ingredients required to make a perfect cake? </s>

Output format to T5 for training

How do you bake a delicious cake? </s>

That’s it! You have a state-of-the-art question paraphraser in your hand.

Perhaps this is the first work of it’s kind out there to generate paraphrased questions from any given question!

Happy coding!

我来评几句
登录后评论

已发表评论数()

相关站点

+订阅
热门文章