Embeddings for sentence fragments harvested from a document can serve as extractive summary facets of that document and potentially accelerate its discovery, particularly when user input is a sentence fragment. These fragment embeddings not only yield better quality results than traditional text matching systems, but also circumvent a problem inherent in modern distributed representation driven search approaches — the challenge to create effective document embeddings, i.e. a single embedding, at a document level, that captures all facets of a document and enables its discovery through search.
Examples of sentence fragments are “bats as a source of coronavirus”, “coronavirus in pangolins” — short sequences with one or more noun phrases connected by prepositions, adjectives etc. These highlighted connective terms that are largely ignored by traditional search systems, can play a key role not only in capturing user intent (e.g. “coronavirus in bats” has a distinct search intent than “bats as a source of coronavirus” or “coronavirus not present in bats”) , but sentence fragments that preserve them can also be valuable index candidates serving as extractive summary facets (sub-summaries) of a document. By embedding these sentence fragments into an appropriate embedding space (e.g. BERT ) we can use the search input fragment as a probe into that embedding space for discovering relevant documents.
Finding a comprehensive answer supported by literature evidence to the questions “What are the animal sources of COVID-19 ?” or “receptors coronavirus binds to” is challenging even on a tiny data set like the recently released COVID-19 data set (~500 MB corpus size, ~13k documents, 85+ million words, about a million unique words in normalized text).
Traditional document search methods are quite effective for the typical use case where the answer is obtainable from a few documents by using one or more noun phrases in search. Traditional document search methods also satisfy the following user experience constraint for words and phrases:
“what we see (results) is what we typed (searched for)”
For instance, when we search for words and phrases ( contiguous word sequences like New York, Rio De Janeiro) — the results typically contains the terms we entered or their synonyms (e.g. COVID-19 search yields results with Sars-COV-2 or novel coronavirus etc.).
However, the quality of results tends to degrade as the number of words in search input increases, particularly with the use of connective terms between noun phrases. This degradation of results quality is visibly evident even through the highlighted terms in results by these search engines.
For instance, in the figure below, mostly the nouns in “bats as a source of coronavirus” are selectively highlighted by current search engines in a distributed manner within and across sentences/paragraphs in the displayed extractive summary for a document, at times not even honoring the ordering of those words in the input sequence. Even though document relevance ordering often mitigates this to a large degree, we are still left with the task of examining the extractive summary of each document, often having to navigate into the document only to return back to main results set again because the document didn’t satisfy our search intent.
The document search approach described in this article may reduce this cognitive overload present in search systems in addition to yielding more relevant results, particularly when searching for sentence fragments. As an illustration, the same query we used in existing search systems above can yield results of the form shown below (the interface is a schematic purely intended to show the search approach). A key point worth noting in the schematic below is that the extractive summary facets are actual matches in documents (the numbers in parenthesis are the number of documents containing a fragment and the cosine distance of a fragment with the input search fragment) , as opposed to suggested queries or related search queries displayed in traditional search systems. These summary facets give a panoramic view of the results space reducing futile document navigation and accelerating convergence to documents of interest.
The input fragments can be a full or partial sentences with no constraint in its composition or style. For instance, they can be interrogative as opposed to the affirmative query above — we can find the protein receptors coronavirus binds to, by searching for “ receptors coronavirus binds to ”
The comparison between search systems above is only meant to illustrate the differences in the underlying approach of document discovery. It would be an unfair comparison otherwise given the orders of magnitude difference in corpus sizes — we are bound to get more relevant results in a tiny corpus.
Distributed representations have become integral to any form of search given their advantages over traditional purely symbolic search approaches. Modern search systems are increasingly leveraging off their properties to complement symbolic search methods. If we consider document search broadly as a combination of breadth first and depth first traversal of document space, these two forms of traversals require embeddings that have characteristics specific to those traversals. For instance, we might start off broadly with the search of animals causing coronavirus, then drill down into bats, then broaden again to reptiles etc.
Distributed representations of document facets — be it word, phrase, or sentence fragments, drawn from the embedding space of Word2vec and BERT, have unique and complementary attributes that are useful to perform broad and deep searches. Specifically,
Expanded terms/fragments obtained from word2vec/BERT embeddings for user input are used to exact match documents that were already indexed offline using these terms/fragments. Offline, fragments are harvested from corpus using a combination of part-of-speech tagger and chunker, and embeddings are created for them using both models, word2vec and BERT.
As mentioned earlier, word2vec embeddings expand the breadth of search for words and phrases. They do not expand the breadth of search for fragments — the histogram of neighborhood lacks a distinct tail quite often (figure 8 below) . This is because fragments, by virtue of their lengths do not have enough neighboring context to learn quality embeddings. This deficiency could in part be addressed by expanding the window size of training and increase the surrounding context by ignoring sentence boundaries, but it still is inadequate in practice, given the low occurrence counts of fragments (figure 8) .
BERT embeddings largely only increase the depth of search particularly for fragments and phrases (depth of search expansion for words using BERT embeddings is not useful in practice) . While they do increase the breadth to some degree, for instance, the query “coronavirus in macaques” broadens to include “coronavirus in palm civets” within the distribution tail of significant results, the breadth is not much as what word2vec offers in depth for words and phrase. Figure 6 caption below illustrates where it is deficient. Implementation notes have additional examples of lack of breadth in fragment search and ways to circumvent this limitation.
Word2vec was perhaps the first model that clearly established the power of distributed representations about seven years ago. The embeddings output by this simple model with essentially two arrays of vectors for its “architecture”, is still of immense value for downstream applications such as the document search method described above.
Word2vec , in concert with BERT embeddings, offers a solution to document search that potentially improves upon traditional approaches in the quality of results and the time to converge on them (this claim needs to be quantified) . The solution circumvents the problem of learning all the important aspects of a document in a single vector representation that could then be used by a search system to not only pick a particular document but also find documents similar to the picked one.
The circumventing is made possible by the use of embeddings, be it a word, phrase, or sentence fragment, to broaden/deepen search prior to document picking. Word2vec embeddings for word and phrases largely increase breadth of document search. BERT embeddings for sentence fragments largely increase depth of search. BERT embeddings also eliminate out of vocabulary scenario as well as facilitate searchable extractive summarizations of different salient facets of a document which accelerate convergence on to relevant documents.