Enhance Solr search with Word2Vec and Acronym resolution.

Enhance Solr search with Word2Vec and Acronym resolution.

If you’ve ever worked with the open-source search engines like Solr, Elastic Search, one of the basic limitations of these search engines are they only provide keyword search by default, however, Solr provides enough flexibility to improvise the search based on your requirements.

Solr scoring model works on TFIDF which honors the keywords count matching in the document and to number documents the keyword occurred, so we can see the whole search is emphasized on searching for relevant documents consisting of query keyword, but things get trickier when the user based expectation rises from keyword search to context-based search.

For eg, A user might use an abstract query like “War”

the search engine should be capable of returning the most relevant results by recognizing the query’s connections to recession, battle, fear, tension and sanction, etc

Now with the help of the Word2Vec Model, we try to solve the problem of getting results that are in context with the query.

This will enables Solr to perform a search on given documents consisting of the phrase but also include those documents which are in context with the phrase.

Additional support of Acronyms resolution can be added along with the Word2Vec Model so that the user can also use acronyms to get the same results he’s looking for.

Get The Data:

The model is trained on the financial articles written by various authors regarding many different topics in finance, the reason behind selecting this data because the articles have title and thoughts of the author (document text) based upon the title, we wanted the model to pick up the features from document-text and title. So that during inference, when we give a phrase to the model, it should recommend all the set of possible words that might be lying in the context of phrase and associated with documents. During exploring the data in “document-text” we’ve observed acronyms like (ROIC, ETF, EBITDA etc) the full form of these acronyms were also present in the vicinity of 3 to 5 words along with the acronyms.

Prepare The Data:

We passed the data through some data transformation process listed below:

  1. Remove random numerical digits from the text.
  2. Remove special characters from text, along with white-spaces.
  3. Acronym Extraction.
  4. Sentence tokenization.
  5. Removed Stopwords and Lemmatize.
  6. Word tokenize.

Below is the example of how does the data look like after going through above transformation steps.

document_text = “Claire walks her dog, The quick brown fox jumps over the lazy dog. United States initiated trade war against china, refusing to allow imports of each other’s goods”

After transformation = [ [“clarie”, ‘walk’, “dog”],[“quick”, ‘brown’, “fox”, “jump”, “over”, “lazy”, “dog”],

[“united”, “states”, “initiated”, “trade”, “war”, “against”, “china”, “refusing”, “allow”, “import”, “each”, “other”, “goods”] ]

Selecting a Suitable Model:

We selected model Word2Vec over Topic modeling algorithms like Latent Dirichlet Allocation (LDA) or Topic Classifier because the nature of the problem is unsupervised we cannot use Topic Classifiers and LDA serves the different purpose of what words co-occur in documents but does not pay attention to the immediate context of words. This means the words can appear anywhere in the document and in any order.

Whereas, In our case, we need a model which can understand the context behind words in the text,

Word2vec solves the problem of context by understanding the relation of word with its neighboring words, with the help of Word Embeddings.

It also allows us to use vector geometry to resolve word analogy, e.g. king−man+woman≈queen, this will be helpful when the user tries to perform phrase search like “trade war”, So if you inspect both of these words individually “Trade” and “War” they have completely different meaning but if you use them together “Trade war” they project a different context and the objective of this model is to pick up words related in that context.

Word2Vec Model:

A high-level overview of this Model:

The Word2Vec model predicts the neighboring words based upon given input word or phrase.

I might not go into deeper details as this blog is emphasized on the applications of the Word2Vec model but I will touch upon some parts of the training process like SkipGram with Negative Sampling (SGNS)

SkipGram is one of the unsupervised learning technique which is used to learn the relation a word with its neighboring words.

So for the Given input Sentence: “The quick brown fox jumps over the lazy dog”

After Processing: [ [‘The’, ‘quick’, ‘ brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’ ]]

SkipGram generates a training dataset by considering the current word as a feature and its neighboring words as the target to it.

The below image shows how the training dataset is created by the SkipGram with window=5.

sample of training dataset with window=5

Here the target value will be one because we have added all the neighboring words to the input words.

Now the target values are one for all the samples in the dataset, that means the model will always return one with 100% probability and irrelevant predictions, this shows us the model barely learn anything from these samples, to address this issue we have to introduce negative samples, those samples which are not in the neighboring words of the input word since they are non-neighboring words the model should return zero for such samples.

The below image show training dataset created by SkipGram with Negative Sampling.

sample of training dataset created by SGNS

When this training dataset is fed to the Neural network the error is calculated by the dot product of input embedding and the output (context embedding) scaling them between 0 to 1 by sigmoid function and subtracting them by the target value. This error is propagated back to update the weights during backpropagation.

Solving Analogies.

It solves this type of question:

If man => king, then Women => ?

Answer from model = Queen

In the context of above eg: “The quick brown fox jumps over the lazy dog”

If fox => quick, then dog => ?

Answer from model = Lazy

How does this could solve our problem in search?

Ans: for the given keyword/phrase it predicts another set of related words along with the probability which lies in the context of given keyword/phrase, which indicated how the other words in the document have been used or related along with the given keyword.

Eg keyword= Fixed Income

Model predicted words: [equity, interest, bond, asset, floating, straitlaced, yield, tsang, sleeve, taxable, municipal, investment, balanced, multiasset, fee]

Using this Model predicted words + keyword, we can create a query to search and score the documents that consist of these words.

Acronym Support feature:

Note: This logic was built, based upon the pattern observed in the dataset usually some financial articles do have the acronyms preceded by their values if you see a similar pattern you can use this logic.

We have added a vocabulary for acronym lookup. So that whenever the user tries to search with acronyms like “ETF”, “TIPS” with any other keyword in the phrase, the acronyms will be expanded with there actual values.

For eg: ETF => will be Expanded to => “Exchange Traded Funds”, TIPS => “Treasury Inflation Protection Security”

Note: These acronyms feature will be supported in search only if they are provided in capital letters. (logic added to distinguish between normal word and acronyms)

While exploring the dataset we found that the lots of acronyms present inside the dataset have their expansion right behind them within the offset of 5 to 6 words.

This image is taken from the regex engine shows that with the help of regex we were able to capture the acronyms along with their expansions with some amount of additional noise.

Now that we got the bunch of preceding words corresponding to every acronym, We have to find out which set of words are truly the correct expansion of acronyms.

To Identify acronyms expansion from the text we hold some high-level assumptions regarding Acronyms:

1) Acronyms are usually created from the first letter of the word from the phrases which are more likely to be Nouns, The first letter of the noun is usually written in capital representing Company name, Organization or Brand, etc.

2) If we put the preceding words in the array, then the first few words should be a noun whose first letter matches the first letter of acronyms (in Capital).

3) The majority of Letters in the Acronyms can be present (in sequence) from where the first assumption is satisfied in the array.

We tried to solve this problem with the help of the Dynamic Programming approach, where I will reward every preceding word in the array with a certain score based on how it satisfied my assumptions listed above.

Later I will calculate the score to filter the best set of words that are likely to be an expansion of the acronym corresponding to that array.

Calculation Example:

Let’s take an example of an acronym JEC its expansion is “Jacob Engineering Group Inc” (you can check this in the above image)

Here we will match all the preceding array of words in for loop with J, E, C

As per the first assumption if the First letter of the words (zeroth index) in array matches with “J” or “E” or “C” we will reward that word with 10 points.

if they matched between 1st to (length of the acronym) index of the word then we will reward them with 1 point.

J => Jacobs (10 point), J => Engineering (0 point), J => Group (0 point), J => Inc (0 point)

E => Jacobs (0 point), E => Engineering (10 point), E => Group (0 point), E => Inc (0 point)

C => Jacobs (1 point), C => Engineering (0 point), C => Group (0 point), C =>Inc (1 point)

if letter J or E or C matches with first letter of any word [Jacobs Engineering Group Inc] we will give 10 point.

if letter J or E or C matches with second to (lenth of accronym) index of any word [Jacobs Engineering Group Inc] we will give 1 point.

Now we will sum up the above points i.e (10 + 1 + 10 + 1) = 22

So the Score of words is 22, this shows that the majority of the words in the array are aligned with our assumption that states that based on the score the given set of words are very likely to be the expansion of acronym JEC.

Let’s take another example: (from above image)

Acronym: ROIC

Preceding array of words: “Top”, “Holding”, “Wtg”

R => Top (0 point), R => Holding (0 point), R => Wtg (0 point)

O => Top (1 point), O => Holding (1 point), O => Wtg (0 point)

I => Top (0 point), I => Holding (1 point), I => Wtg (0 point)

C => Top (0 point), C => Holding (0 point), C => Wtg (0 point)

Score = (1 + 1 + 1) = 3 (based on score being too low, we can say they are not the correct expansion)

As you can see the score of the given candidate it very low, this mean they are very less likely to be the expansion of acronym ROIC.

Note: Candidates having higher the Score is, more likely be the expansion of its corresponding acronym.

lets combine all the pieces together to make it functional.

The above workflow Diagram expresses the following details:

User Query Phrase: “TIPS for investors”

Acronym Lookup provides the expansion for “TIPS”: “Treasury Inflation-Protected Securities”

Word2Vec Model is fed with all the words in the phrase along with acronyms (words sequence maintained) and it gives a set of predicted words that have been used in the articles in the context of query phrase.

Create Solr Query params given below:





q_s:TIPS investors +(treasury+inflation+protected+security+investor+bond+tip+yield+duration+fund+short+mortgage)^=0











There are various ways to enhance the search results as Solr being flexible enough and open for customization, word embedding models like Word2Vec do help to enhance search experience by providing anticipated results, we hope this blog helps you to tune your search results Thank you.