We started with open source ‘ code ’ contribution. Now we are at a stage where we do open source ‘ model ’ contribution.
But how to make new language models?
Recently, Huggingface released a blog on how to make a language model from scratch. It consists of training a tokeniser, defining the architecture and training the model.
This is the common approach where we take a pretrained model like AWD-LSTM or ALBERT and then fine-tune it for our task.
Splitting into very small chunks and <unk> can lead to poor performance of the model.
This methodology is the sweet spot between using transfer learning and making a model from scratch.
This was first explained nicely in the fastai lectures. Using the below convert_weights function they add zero vectors to the embedding matrix of AWD-LSTM for new vocab. AWD-LSTM has a vocab of ~33k and hidden of 400. If you add 10k new vocab, your total vocab is now 43k.
So the embedding matrix is now changed from (33k, 400) to (43k, 400) where the new 10k rows added are just 0 vectors of size hidden.
Using this methodology, we don’t need to start from scratch for old vocab which is a huge advantage! :fire:
Fastai code is superb and does all these automatically without worries. The problem is how to do this with transformers library?
On a lonely night with no progress of accuracy on a problem we were working with, this solution struck to me. We had tried out transfer learning with AWD-LSTM, BERT, ALBERT, XLM-R, and making a new model from scratch too.
I thought why not try convert_weights approach for transformers.
I tried two models canwenxu/BERT-of-Theseus-MNLI and TinyBERT . I selected these for a few reasons
Theseus has 66M,
AWD-LSTM has 24M and
TinyBERT has 15M parameters :panda_face:
I didn’t select distilBERT as Theseus has same parameters and better performance as shown in the snap.
Now other range of small BERT models have also become available.
Basically, we don’t need a model as big as BERT-base all the time and the latency requirements push a need to make smaller models.
Overall the results of these small models are very very impressive.
So here is how the changes look like. You need to choose appropriate tokenizer as per your model. Theseus is a distilled model of BERT and hence uses BertWordPieceTokenizer.
The below method takes the dataframe containing column ‘text’ and fits a wordpiece tokenizer for the vocab_size breaking words into subwords as needed. Then we export the vocab and load it as a list.
Now we add the vocab to the original tokenizer and pass the length of tokenizer to model to initialise new empty rows for new vocab.
You need to use this method after the tokenizer and model are loaded in the run_language_modeling.py
Be careful about vocab_size because if you add a huge number of new vocab, the model might become inferior. It’s a parameter to be hyper-tuned.
Once you have the model, do a before/after analyses of the tokenization with old and new model.
The new model should tokenize a sentence in less number of tokens.
I was able to get the same metric and faster task training with TinyBERT. Although this was for a competition where goal is to get high score, it can save a ton of headache and money for inference in the production scenario :heart_eyes:
I highly recommend using TinyBERT and suggest being open to evaluating smaller models before finalising a heavy transformer model :monkey_face: