Document search with fragment embeddings

COVID-19 questions — a use case for improving sentence fragment search

Figure 1. Illustrates embeddings driven fragment search used to answer specific questions (left panel) as well broader questions(right panel). The highlighted text fragments in yellow are document matches to search input obtained using BERT embeddings . The right panel is a sample of animals with literature evidence for presence of coronavirus — the font size is a qualitative measure of reference counts in literature. Bats (in general and chinese horseshoe bats specifically) and birds have been mentioned as sources of coronavirus — bats as the gene source of alpha and beta coronaviruses and birds as the gene source of gammacoronavirus and deltacoronaviruses. Zoonotic transmission of coronavirus from civet cats and pangolins(betacoronavirus) have also been reported. All the information above was obtained automated using machine learning models without human curation . For the broad question in right panel, a bootstrap list was created by the search for term “animals” and clustering result in the neighborhood of Word2vec embeddings. This list was then filtered for biological entity types using unsupervised NER with BERT , which was then used to create the final list of animals with literature evidence captured in fragments as extractive summary of the corresponding documents. The animal source of COVID-19 is not confirmed to date. Coronavirus illustration created at CDC


Embeddings for sentence fragments harvested from a document can serve as extractive summary facets of that document and potentially accelerate its discovery, particularly when user input is a sentence fragment. These fragment embeddings not only yield better quality results than traditional text matching systems, but also circumvent a problem inherent in modern distributed representation driven search approaches — the challenge to create effective document embeddings, i.e. a single embedding, at a document level, that captures all facets of a document and enables its discovery through search.

Examples of sentence fragments are “bats as a source of coronavirus”, “coronavirus in pangolins” — short sequences with one or more noun phrases connected by prepositions, adjectives etc. These highlighted connective terms that are largely ignored by traditional search systems, can play a key role not only in capturing user intent (e.g. “coronavirus in bats” has a distinct search intent than “bats as a source of coronavirus” or “coronavirus not present in bats”) , but sentence fragments that preserve them can also be valuable index candidates serving as extractive summary facets (sub-summaries) of a document. By embedding these sentence fragments into an appropriate embedding space (e.g. BERT ) we can use the search input fragment as a probe into that embedding space for discovering relevant documents.

The need to improve search using sentence fragments

Finding a comprehensive answer supported by literature evidence to the questions “What are the animal sources of COVID-19 ?” or “receptors coronavirus binds to” is challenging even on a tiny data set like the recently released COVID-19 data set (~500 MB corpus size, ~13k documents, 85+ million words, about a million unique words in normalized text).

Traditional document search methods are quite effective for the typical use case where the answer is obtainable from a few documents by using one or more noun phrases in search. Traditional document search methods also satisfy the following user experience constraint for words and phrases:

“what we see (results) is what we typed (searched for)”

For instance, when we search for words and phrases ( contiguous word sequences like New York, Rio De Janeiro) — the results typically contains the terms we entered or their synonyms (e.g. COVID-19 search yields results with Sars-COV-2 or novel coronavirus etc.).

However, the quality of results tends to degrade as the number of words in search input increases, particularly with the use of connective terms between noun phrases. This degradation of results quality is visibly evident even through the highlighted terms in results by these search engines.

For instance, in the figure below, mostly the nouns in “bats as a source of coronavirus” are selectively highlighted by current search engines in a distributed manner within and across sentences/paragraphs in the displayed extractive summary for a document, at times not even honoring the ordering of those words in the input sequence. Even though document relevance ordering often mitigates this to a large degree, we are still left with the task of examining the extractive summary of each document, often having to navigate into the document only to return back to main results set again because the document didn’t satisfy our search intent.

Figure 2. Current search engines including a literature evidence search engines like Google Scholar are not so effective for fragment searches as they are for word and phrase searches.

The document search approach described in this article may reduce this cognitive overload present in search systems in addition to yielding more relevant results, particularly when searching for sentence fragments. As an illustration, the same query we used in existing search systems above can yield results of the form shown below (the interface is a schematic purely intended to show the search approach). A key point worth noting in the schematic below is that the extractive summary facets are actual matches in documents (the numbers in parenthesis are the number of documents containing a fragment and the cosine distance of a fragment with the input search fragment) , as opposed to suggested queries or related search queries displayed in traditional search systems. These summary facets give a panoramic view of the results space reducing futile document navigation and accelerating convergence to documents of interest.

Figure 3. Schematic of fragment search. An extractive summary of fragments matching user input helps reduce the time to find documents of interest

The input fragments can be a full or partial sentences with no constraint in its composition or style. For instance, they can be interrogative as opposed to the affirmative query above — we can find the protein receptors coronavirus binds to, by searching for “ receptors coronavirus binds to

Figure 4. Another example schematic of fragment search. Input could be any sequence of words that is a full or partial sentence and could be of any nature (affirmative, interrogative etc.)

The comparison between search systems above is only meant to illustrate the differences in the underlying approach of document discovery. It would be an unfair comparison otherwise given the orders of magnitude difference in corpus sizes — we are bound to get more relevant results in a tiny corpus.

Role of embeddings in document search

Distributed representations have become integral to any form of search given their advantages over traditional purely symbolic search approaches. Modern search systems are increasingly leveraging off their properties to complement symbolic search methods. If we consider document search broadly as a combination of breadth first and depth first traversal of document space, these two forms of traversals require embeddings that have characteristics specific to those traversals. For instance, we might start off broadly with the search of animals causing coronavirus, then drill down into bats, then broaden again to reptiles etc.

Distributed representations of document facets — be it word, phrase, or sentence fragments, drawn from the embedding space of Word2vec and BERT, have unique and complementary attributes that are useful to perform broad and deep searches. Specifically,

  • Word2vec embeddings for terms ( terms refer to words and phrases — e.g. bats, civet cats) are effective for breadth first search with a entity based clustering applied on the results. Results of a search for the single word “ bats” or phrase “ civet cats” would yield other animals like pangolins, camels, etc.
  • BERT embeddings for sentence fragments (“ coronavirus in pangolins” , “ bats as source of coronavirus” etc.) are useful to find fragment variants. For instance, “bats as a source of coronavirus” would yield fragment variants such as “bats harbor coronavirus” , ”coronaviruses arising from bats” etc.
  • These embeddings while largely complementary have overlapping properties too — word2vec embeddings can yield depth first results and BERT embeddings yield breadth first results within the distribution tail of statistically significant results. For instance, a search for bats using word2vec embeddings would also yield bat species (e.g. fruit bats, rousettus, flying fox, pteropus) in addition to other animals like camels, pangolins etc. A fragment search for “coronavirus in peacock” using BERT yields “coronavirus disease of cats”, “coronavirus in cheetahs” even though the results are predominantly coronavirus in avian species.
  • BERT model allows for search input (terms or fragments) to be out of vocabulary enabling any user input a candidate to find relevant documents.

How does this approach work?

Figure 5. Offline and real-time processes/flows involved in fragment search.

Expanded terms/fragments obtained from word2vec/BERT embeddings for user input are used to exact match documents that were already indexed offline using these terms/fragments. Offline, fragments are harvested from corpus using a combination of part-of-speech tagger and chunker, and embeddings are created for them using both models, word2vec and BERT.

  • The intermediate step of mapping user input to term and fragment embeddings, which then serve to find documents, not only has the advantage of increasing the breadth and depth of search, but also circumvents the problem of creating quality document embeddings that match user input. Specifically fragments play the dual role of indexes for a document as well as enabling a single document to have searchable multiple “extractive sub-summaries” characterized by the fragments embedded in it. Fragments also increase the chances of finding the “needle in haystack” document (a single document in a corpus that holds the answer to a question) compared to finding such a document purely using terms or phrases.
  • Using embeddings purely for the intermediate step of finding term/fragment candidates and leveraging off traditional search indexing methods for finding documents matching those terms/fragments enables us to perform document search at scale.
  • Lastly while finding the answer to broad questions like What are the animal sources of COVID-19 ? is done automated and offline given the large scope and processing time for such a task, the fragment embeddings driven search approach described here is applicable for “not so broad” live search use cases like the question “ receptors coronavirus binds to” given sufficient compute resources and efficient hashing approaches to perform embedding space search at scale .

Limitation of the current approach

As mentioned earlier, word2vec embeddings expand the breadth of search for words and phrases. They do not expand the breadth of search for fragments — the histogram of neighborhood lacks a distinct tail quite often (figure 8 below) . This is because fragments, by virtue of their lengths do not have enough neighboring context to learn quality embeddings. This deficiency could in part be addressed by expanding the window size of training and increase the surrounding context by ignoring sentence boundaries, but it still is inadequate in practice, given the low occurrence counts of fragments (figure 8) .

BERT embeddings largely only increase the depth of search particularly for fragments and phrases (depth of search expansion for words using BERT embeddings is not useful in practice) . While they do increase the breadth to some degree, for instance, the query “coronavirus in macaques” broadens to include “coronavirus in palm civets” within the distribution tail of significant results, the breadth is not much as what word2vec offers in depth for words and phrase. Figure 6 caption below illustrates where it is deficient. Implementation notes have additional examples of lack of breadth in fragment search and ways to circumvent this limitation.

Figure 6. Illustrates the embedding types used based on search input length ( word sequences less than 5 — terms; ≥5 fragments) and depth/breadth of search. Word2vec embeddings are effective for both depth and breadth (they offer more breadth than depth) whereas BERT embeddings are more effective for depth search using fragments. BERT neighborhood for a fragment has results that offer breadth too but not as much as Word2vec — we typically see more breadth in BERT neighborhood (within the tail of significant results) when the input fragment does not have many matching depth results. For instance “bats as a source of coronavirus” has sentence fragments almost exclusively about bats. However a search for “coronavirus in peacocks” would start off broadly with coronavirus in avian birds and have peppered breadth results like “coronavirus in cheetah”. This in practice, may be perceived as an advantage as opposed to being a disadvantage, particularly when user types a fragment to find specific answers. It is however inadequate when user types a fragment for a broad search such as “animals with coronavirus”. However, such a question cannot be addressed by just a single search — it is inherently a multistep process as described in Figure 1.

Final thoughts/Summary

Word2vec was perhaps the first model that clearly established the power of distributed representations about seven years ago. The embeddings output by this simple model with essentially two arrays of vectors for its “architecture”, is still of immense value for downstream applications such as the document search method described above.

Word2vec , in concert with BERT embeddings, offers a solution to document search that potentially improves upon traditional approaches in the quality of results and the time to converge on them (this claim needs to be quantified) . The solution circumvents the problem of learning all the important aspects of a document in a single vector representation that could then be used by a search system to not only pick a particular document but also find documents similar to the picked one.

The circumventing is made possible by the use of embeddings, be it a word, phrase, or sentence fragment, to broaden/deepen search prior to document picking. Word2vec embeddings for word and phrases largely increase breadth of document search. BERT embeddings for sentence fragments largely increase depth of search. BERT embeddings also eliminate out of vocabulary scenario as well as facilitate searchable extractive summarizations of different salient facets of a document which accelerate convergence on to relevant documents.