BERT: Sentiment Analysis of App Store Review

BERT: Sentiment Analysis of App Store Review

Nov 26 ·8min read

Utilizing state-of-the-art model to analyze users’ sentiments from app store

Photo by William Hook on Unsplash

This article provides you with the necessary steps to perform sentiment analysis on the reviews made by the public user at app store. In this tutorial, I will be using Bert-Base Chinese model instead to test out the performance of BERT when being applied to languages other than English. The steps for sentiment analysis are still the same regardless of which model that you are using. If you are unsure of which model to use, check out the following link for more information on the pre-trained model provided by the BERT team. If you are new to BERT, kindly check out my previous tutorial on Multi-Classifications Task using BERT . There are 5 sections in this tutorial:

  1. Dataset Preparation
  2. Training
  3. Prediction
  4. Results
  5. Conclusion

1. Dataset Preparation

I will be using the reviews from Taptap , which is a games app store catered for the Chinese market. Feel free to use your own dataset. You can even test it out on the reviews from Google Play Store and iOS Apple Store. If that is the case, make sure that you are using the English model for BERT. Let’s have a look at the details that we can obtain from the review in Taptap.

There are quite a lot of useful data here:

  • Comment posted by the user
  • Rating from 1 to 5
  • Count for thumbs up
  • Count for thumbs down
  • Replies from the other users to this comment

We can easily use the data available to do labeling for the sentiments. Labeling manually by hand is strongly recommended if you have the time. In this case, I am going to use the rating to determine the label.

  1. Negative: 1–3 :star:
  2. Neutral: 4 :star:
  3. Positive: 5 :star:

I managed to collect quite a lot of dataset based on the users’ reviews from several games. I have loaded the dataset into three dataframes.

  1. Train dataset
  2. Evaluation dataset
  3. Test dataset

Let’s have a look at the content of the data for train and evaluation . Both of them share the same structure.

  • Guid : Id for comment.
  • Label : Sentiments for the comment. Labeling is based on the rating from user.
  • Alpha : Throwaway column. I just filled it with a.
  • Text : The actual comment from user.

If you have trouble in creating the dataframe above, feel free to use the following code (modify it accordingly):

df_bert = pd.DataFrame({'guid': id_list,
    'label': label_list,
    'alpha': ['a']*len(count),
    'text': text_list})

The test data will be slightly different as it should only contains the guid and text .

Once you are done, let’s save it as tsv file using the following code (modify the name of the dataframe accordingly):

df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False)
df_bert_dev.to_csv('data/dev.tsv', sep='\t', index=False, header=False)
df_bert_test.to_csv('data/test.tsv', sep='\t', index=False, header=True)

Please be noted that the files are stored in the data folder. Feel free to modify it based on your use case but the name of the files must be as follow:

  • train.tsv
  • dev.tsv
  • test.tsv

Besides, the test.tsv data must have a header unlike the train.tsv and dev.tsv . Set the header to True for test.tsv .

Let’s move on to the next section once you are done with the data preparation.

2. Training

We will now start to train and fine-tune the model. Make sure that you have cloned the repository from the official site . Moreover, you should have the following files and folders located somewhere inside the repository:

  • Data directory : The directory where you stored train.tsv, dev.tsv and test.tsv
  • Vocab file : The vocab.txt file. It comes together with the model that you have downloaded. I created a new model folder and put the file inside it.
  • Config file : The config.json file. It is also included together with the model. Likewise, I put it inside the model folder.
  • Initial model : The model to be used for training. You can use the pre-trained based model or resume from an existing model that you have fine-tuned. I am storing it inside the model folder
  • Output directory : The folder in which the model will be written to. You can simply create an empty folder for it.

The next part is to determine the following variables:

  • Max sequence length : The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter will be padded. Default is 128 but I will be using 256 in this tutorial.
  • Train batch size : Total batch size for training. Default is 32. I will be using 8 in this tutorial since I am training on just one GeForce RTX 2080.
  • Learning rate : Initial learning rate for Adam. Default is 5e-5. I have set the value to 2e-5.
  • Num train epoch : Total number of training epochs to perform. I will just use the default value of 3.0

If you are unsure about which GPU to use, kindly run the following command to find it out:


We need to modify the code in the since we have 3 classes for this use case. Open up the python file and search for the get_labels () function inside the ColaProcessor(DataProcessor) class. Change it to the following and save it:

def get_labels(self):
    """See base class."""
    return ["0", "1", "2"]

Once you are done, activate the virtual environment and change the directory to the root of the repository. Type the following command in the terminal.

CUDA_VISIBLE_DEVICES=0 python --task_name=cola --do_train=true --do_eval=true --data_dir=./data/ --vocab_file=./model/vocab.txt --bert_config_file=./model/bert_config.json --init_checkpoint=./model/bert_model.ckpt --max_seq_length=256 --train_batch_size=8 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=./output/ --do_lower_case=False

Run it and you should see the following output:

It may take quite some time for it to train depends on the size of dataset that you used. The terminal will output the following once the training has been completed.

Let’s move on to the next step.

3. Prediction

A model will be generated at the output folder. Kindly check the highest number of steps to identify the latest model that you have. If you are unsure of which model is the latest, open the checkpoint file to find out. In my case, I have 37125 as the last step for the model.

In the same terminal, run the following code (make sure that the max sequence length is the same as what you have used during the training):

CUDA_VISIBLE_DEVICES=0 python --task_name=cola --do_predict=true --data_dir=./data/ --vocab_file=./model/vocab.txt --bert_config_file=./model/bert_config.json --init_checkpoint=./output/model.ckpt-37125 --max_seq_length=256 --output_dir=./output/

The code will generate a test_results.tsv file at the output folder. In my case, I got the following results.

Each columns represents the probabilities or confidence level of class predicted with the highest being the class predicted by the model.

4. Results

It is time for us to analyze the results. The first task is to load the test_results.tsv and convert it into dataframe based on the highest predicted probabilities. Read the file with the following code:

df_result = pd.read_csv('output/test_results.tsv', sep='\t', header=None)

You should have a dataframe for the test data with three columns (I named it df_test_with_label):

  • guid
  • label
  • text

Create a new dataframe and map the result using idxmax .

df_predict = pd.DataFrame({'guid':df_test_with_label['guid'],


Once you are done, let’s import the following metrics function from sklearn to calculate the performance of our model.

from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix


You can calculate the accuracy of the model as follow.

accuracy_score(df_test_with_label['label'], df_predict['label'])

I got 0.7033952594490711 as the result.


Based on the sklearn documentation . The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0. It also requires a parameter called average. I am setting it to macro .

recall_score(df_test_with_label['label'], df_predict['label'], average='macro')

Running the code resulted in 0.6312777479889565 as the output.


The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. The best value is 1 and the worst value is 0. Likewise, the average parameter is set to macro .

precision_score(df_test_with_label['label'], df_predict['label'], average='macro')

The precision is a little lower than recall at just 0.6303571005505256.

Confusion Matrix

Having just recall and precision is not good enough as we don’t know which class has the best prediction and which class got the worst results. We can use the confusion matrix method to provide us with more insight on this.

confusion_matrix(df_test_with_label['label'], df_predict['label'])

I got the following output.

We can see that the model have some difficulties in predicting the second label (Neutral). In this case, we might need to make some modification to our dataset and try to re-train it again.

5. Conclusion

Congratulations for completing this tutorial. Let’s recap on what we have learned today.

First, we started with preparing the dataset for our sentiment analysis project. This includes obtaining the data and labeling it automatically based on the details provided. In our case, we use the app review rating as the label for the sentiment. We finalized it into three classes, namely Negative, Neutral and Positive. We generated three tsv files from the dataset.

Next, we configured the required parameters such as max sequence length and batch size. We trained the model and used it to do prediction on the test data.

Finally, we loaded the results into a dataframe and analyzed it using the metrics functions from sklearn. The insights provided allows us to determine the performance of our model.

Thanks for reading and hope you enjoyed this tutorial. See you again in the next article. Have a great day ahead! :heart: