The article explains, how to develop and train a model to classify product reviews as good or bad
In today’s world, text data are generated in many different ways for every sector. Industries are using these text data to get public opinion and improve their overall performance and profit. Think of a company like Amazon, wants to know its customer review from the comments on their websites, emails, Facebook, Twitter, Instagram, or blogs. Will it be efficient, if they hire people to keep reading their reviews and feedback? Natural Language Processing (NLP) is an amazing technology that does this job very efficiently. This article explains the process of performing binary classification of a product review dataset.
The focus of this article is to explain the process of text classification using Tensorflow and Python. A product review dataset is used for this project that contains the reviews of Amazon baby products. It has three columns. The name column is the name of the products, the review column is the review of the products, and the rating column is the rating on the scale of 1 to 5. Our goal is to develop a classifier that takes the reviews as the input and outputs if the review is good or bad. The model will be trained using our existing review data so that whenever we will have new reviews, we can just input the reviews and the model will output if the review is good or bad.
Before diving into the model, I should mention that a Google Colab notebook is used for this project. Any other notebook should work as good but Tensorflow needs to be installed. I used TensorFlow 2.0. Here is the step by step guide to developing the model:
import tensorflow as tfimport pandas as pd import numpy as np%matplotlib inline import matplotlib.pyplot as plt import matplotlib.image as mpimgfrom tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences
2. Please check the version of your Tensorflow. If the version is less than 2.0, run the following command. Otherwise, this model will not work.
3. Import the dataset and take a careful look at the features and output column. As I mentioned before, the review column is the input feature and the rating column is the output column.
products = pd.read_csv('amazon_baby.csv') products.head()
Here are the five rows of the DataFrame:
4. The rating column has ratings ranging from 1 to 5. I considered the review is bad if the rating is 1 or 2. Otherwise, the review is good. Let’s add a new column ‘sentiment’ to the products DataFrame that will consist of 0 or 1 according to our definition of a good or bad review.
products['sentiment'] = products.rating.apply(lambda x: 0 if x in [1, 2] else 1)products.head()
Products DataFrame will turn out to be:
5. Split the dataset with a 75/25 proportion. 75% of the dataset for training the data and 25% of the dataset for testing the dataset. Keeping part of the dataset for testing is important so that the model can be tested with some known data. That way, we can check the performance of the model.
split = round(len(products)*0.75) train_reviews = products['review'][:split] train_label = products['sentiment'][:split]test_reviews = products['review'][split:] test_label = products['sentiment'][split:]
6. Make NumPy arrays of training sentences, training labels, testing sentences, and testing labels. Then, convert each sentence as a string to be safe. Because if any of the review data was not stored in a string format, the model will not work.
training_sentences = 
training_labels = testing_sentences = 
testing_labels = for row in train_reviews:
for row in train_label:
training_labels.append(row)for row in test_reviews:
for row in test_label:
testing_labels.append(row)training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)
7. Define the essential parameters for the model.
vocab_size = 20000 embedding_dim = 16 max_length = 120 trunc_type = 'post' oov_tok = '<OOV>' padding_type = 'post'
‘vocab_size’ takes a specific number of words based on their importance in the input text.
‘embedding_dim’ means for each vector, there will be 16 coefficients initialized randomly. You can experiment with other values.
‘max_length’ will be passed as input size in the embedding layer. So, the input of each text will be the same.
‘trunc_type’ means if the length of the input text of any input is more than max_length, it will be truncated automatically. I am setting a trunc_type as ‘post’. That means it will be truncated at the end.
‘padding_type’ means if the input text is smaller than ‘max_length’, it will add padding of zeros to make the input length equal to the ‘max_length’. I am setting ‘padding_type’ as ‘post’ so it adds the padding at the end and not at the beginning of the sentences.
The model will make its vocabulary with the words in the training dataset. But in the test dataset there might be some new words that the model has not seen yet. Those words are out of vocabulary words and will be replaced with ‘<OOV>’.
8. Now, tokenize the sentences using the Tokenizer. Tokenizer assigns a number for each word, ‘word_index’ will index the words accordingly and ‘text_to_sequence’ will sequence the sentences as per the assigned numbers instead of words. I suggest, please print after every step to see what each of these parameters doing, if the parameters are new to you.
tokenizer = Tokenizer(oov_token='<OOV>') tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok) tokenizer.fit_on_texts(training_sentences) word_index = tokenizer.word_index sequences = tokenizer.texts_to_sequences(training_sentences) padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type)testing_sentences = tokenizer.texts_to_sequences(testing_sentences) testing_padded = pad_sequences(testing_sentences, maxlen=max_length)
9. All the preprocessing is complete. Finally, put together the model. I used four layers. First, the embedding layer that provides a dense representation of words and their relative meaning, the second layer is the GlobalAveragePooling1D layer to flattens the vector, third and fourth layers are the dense layers. You can experiment with more Dense layers. Please take a look at the model summary. It gives a clear idea about what each layer does.
model = tf.keras.Sequential([ tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length), tf.keras.layers.GlobalAveragePooling1D(), tf.keras.layers.Dense(6, activation='relu'), tf.keras.layers.Dense(1, activation='sigmoid') ]) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary()
Here is the output of the model summary:
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 120, 16) 320000 _________________________________________________________________ global_average_pooling1d (Gl (None, 16) 0 _________________________________________________________________ dense (Dense) (None, 6) 102 _________________________________________________________________ dense_1 (Dense) (None, 1) 7 ================================================================= Total params: 320,109 Trainable params: 320,109 Non-trainable params: 0
The model summary above shows how the execution of each layer shrinks the shape of the output to eventually output one value that is 0 or 1.
10. Now fit the data into this model. This should output the training and validation accuracy. I ran this algorithm for twenty epochs. For a more complicated dataset more epochs may be required.
num_epochs = 20 history = model1.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
Here is the output of the t three epochs:
Train on 7269 samples, validate on 2423 samples Epoch 1/20 7269/7269 [==============================] - 3s 411us/sample - loss: 0.5665 - acc: 0.8002 - val_loss: 0.4658 - val_acc: 0.8205 Epoch 2/20 7269/7269 [==============================] - 2s 259us/sample - loss: 0.4775 - acc: 0.8056 - val_loss: 0.4501 - val_acc: 0.8205 Epoch 3/20 7269/7269 [==============================] - 2s 265us/sample - loss: 0.4638 - acc: 0.8056 - val_loss: 0.4392 - val_acc: 0.8205
This output shows the training loss, training accuracy, validation loss, and validation accuracy. In a perfect model, training and validation accuracy should go up and training and validation loss go down with each epoch. For this dataset, training accuracy was 80%, after the first epoch. But most of the time that does not happen. Training accuracy may start at as low as 20%. But if the model works well accuracy should keep climbing up with each epoch up to a satisfactory level.
Finally, the training set accuracy was 97.96% and validation set accuracy was 87.21%. A little overfitting there which is normal in an NLP project. Because no matter how big of a dataset you use to train the model with, there will always be new words in the testing set.
11. It is a good idea to plot the training vs validation accuracy and training vs validation loss to observe the trend.
acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss']epochs=range(len(acc))plt.plot(epochs, acc, 'r', 'Training Accuracy') plt.plot(epochs, val_acc, 'b', 'Validation Accuracy') plt.title('Training and validation accuracy') plt.figure()plt.plot(epochs, loss, 'r', 'Training Loss') plt.plot(epochs, val_loss, 'b', 'Validation Loss') plt.title('Training and validation loss') plt.figure()
This is how the plots look like:
Here, the red line shows the training data and the blue line shows the validation data. As expected, losses are going down with each epoch. But after a point validation loss started going up. If you want, you can run it again with fewer epochs to stop before the validation loss start rising.
Please look at this GitHub page to see the total code and outputs.