Multiple approaches have been proposed for language modeling, they can be classified into 2 main categories
The encoderis built using Bidirectional LSTM , to encode the input text, to build an internal encoding,
The decoderreceives both, the generated internal encoding and the reference words, the decoder also contains LSTM , to be able generate output one word at a time.
Another research efforts, tried to build the language models without using recurrent models, to give the system even more power in working with long sentences, as LSTM finds it difficult to represent long sequence of data, hence long sentences.
Transformers are built to rely on attention models, specifically self-attention , which are neural networks built to understand how to attend to specific words in the input sentences, transformers are also built in an encoder decoder structure.
The encoder and decoder, each contains a set of blocks,
Encoder :contains a stack of blocks each containing (self-attention , feed forward network), where it receives the input, and in a bidirectional manner, attends to all text from the input, the previous and the next words, then passes it to the feed forward network, this structure (block) is repeated multiple times according to the number of blocks in the encoder
Decoder :Then after encoding is done, the encoder passes this internal encoding to the decoder step, which also contains multiple blocks, where each of them contains the same self-attention* (with a catch) and an encoder decoder attention, then a feed-forward network. *The difference in that self-attention, is that it only attends to the previous words not the whole sentence. So the decoder receives both the reference and the internal encoding of the encoder (same in concept as the encoder of the seq2seq encoder-decoder recurrent model )
You can know more about the Transformer architecture in jalammar’s amazing blog
It turns out, we don’t need entire Transformer to adopt a fine-tunable language model for NLP tasks , we can work with only the decoder like in what OpenAI has proposed , however, since it uses the decoder, the model only trains a forward model, without looking in both the previous and the coming (hence bi-directional), this is why BERT was introduced, where we only use the Transformer Encoder.
BERT is a modification to the original Transformer, which only relies on the Encoder structure, we apply the bidirectional manner using only the encoder block, it can seem counter intuitive, which it is !!, as bidirectional conditioning would allow each word to indirectly see itself in a multi-layered context ( more about it here ), so BERT uses the ingenious method of using MASKS in training.
BERT is trained given a huge amount of text, applying masking [MASK] to 15% of the words, then it is trained to predict the MASKED word.
We mainly use a pretrained BERT model, and then use it as our corner-stone step for our tasks, which are mainly categorized into 2 main types
so lets built a contextualized word embeddings
There are actually multiple ways to generate the embeddings from BERT encoder blocks (12 blocks in this example)
In this tutorial we will focus on the task of using pre-trainined BERT to build the embeddings of sentences, we would simply pass our sentences to pre-trained BERT, to generate our own contextualized embeddings.
the the processed dataset can be found here , the steps for reading and processing the json files can be found here , where we convert the json files to a csv, we use the same process used by maksimeren
We use the library provided by UKPLab called sentence-transformers , this library makes it truly easy to use BERT and other architectures like ALBERT and XLNet for sentence embedding, they also provide simple interface to query and cluster data.
!pip install -U sentence-transformers
then we would download the pre-trained BERT model which was fine-tuned on Natural Language Inference (NLI) data ( code section )
from sentence_transformers import SentenceTransformer
import pickle as pkl
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
then we would encode the list of the paragraphs ( the processed data can be found here )
corpus = df_sentences_list
corpus_embeddings = embedder.encode(corpus,show_progress_bar=True)
the query are the sentences we need to find answers to, or in other words, search the paragraph dataset for similar paragraphs, hence similar literature papers
# Query sentences:
queries = ['What has been published about medical care?', 'Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest', 'Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually', 'Resources to support skilled nursing facilities and long term care facilities.', 'Mobilization of surge medical staff to address shortages in overwhelmed communities .', 'Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies .']query_embeddings = embedder.encode(queries,show_progress_bar=True)
Then we would run cosine similarity between both the embedded query and the previously embedded paragraphs, and return the 5 most similar paragraphs, and the details of their papers
# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 5
print("\nTop 5 most similar sentences in corpus:")
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x)
for idx, distance in results[0:closest_n]:
print("Score: ", "(Score: %.4f)" % (1-distance) , "\n" )
print("Paragraph: ", corpus[idx].strip(), "\n" )
row_dict = df.loc[df.index== corpus[idx]].to_dict()
print("paper_id: " , row_dict["paper_id"][corpus[idx]] , "\n")
print("Title: " , row_dict["title"][corpus[idx]] , "\n")
print("Abstract: " , row_dict["abstract"][corpus[idx]] , "\n")
print("Abstract_Summary: " , row_dict["abstract_summary"][corpus[idx]] , "\n")
========================================================= ==========================Query========================== === What has been published about medical care? ========= ========================================================= Score: (Score: 0.8296) Paragraph: how may state authorities require persons to undergo medical treatment Title: Chapter 10 Legal Aspects of Biosecurity----------------------------------Score: (Score: 0.8220) Paragraph: to identify how one health has been used recently in the medical literature Title: One Health and Zoonoses: The Evolution of One<br>Health and Incorporation of Zoonoses ========================================================= ==========================Query============================== === Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest ===== =========================================================Score: (Score: 0.8139) Paragraph: clinical signs in hcm are explained by leftsided chf complications of arterial thromboembolism ate lv outflow tract obstruction or arrhythmias capable of Title: Chapter 150 Cardiomyopathy------------------------------------ Score: (Score: 0.7966) Paragraph: the term arrhythmogenic cardiomyopathy is a useful expression that refers to recurrent or persistent ventricular or atrial arrhythmias in the setting of a normal echocardiogram the most commonly observed rhythm disturbances are pvcs and ventricular tachycardia vt however atrial rhythm disturbances may be recognized including atrial fibrillation paroxysmal or sustained atrial tachycardia and atrial flutter Title: Chapter 150 Cardiomyopathy========================================================= ==========================Query========================== === Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually ========================================================= Score: (Score: 0.8002) Paragraph: conclusion several methods and approaches could be used in the healthcare arena time series is an analytical tool to study diseases and resources management at healthcare institutions the flexibility to follow up and recognize data patterns and provide explanations must not be neglected in studies of healthcare interventions in this study the arima model was introduced without the use of mathematical details or other extensions to the model the investigator or the healthcare organization involved in disease management programs could have great advantages when using analytical methodology in several areas with the ability to perform provisions in many cases despite the analytical possibility by statistical means this approach does not replace investigators common sense and experience in disease interventionsTitle: Disease management with ARIMA model in time<br>series ------------------------------------------- Score: (Score: 0.7745) Paragraph: whether the health sector is in fact more skillintensive than all other sectors is an empirical question as is that of whether the incidence of illness and the provision and effectiveness of health care are independent of labour type in a multisectoral model with more than two factors possibly health carespecific and other reallife complexities the foregoing predictions are unlikely to be wholly true nevertheless these effects will still operate in the background and thus give a useful guide to the interpretation of the outcomes of such a modelTitle: A comparative analysis of some policy options<br>to reduce rationing in the UK's NHS: Lessons from a<br>general equilibrium model incorporating positive<br>health effects
for the full results refer to our code notebook
We were truly impressed by both,