CoRise Instrcutor Search Fundamentals and Search with Machine Learning; Former CTO at Wikimedia

Uplimit instructor, Search Fundamentals and Search with Machine Learning; Machine Learning Consultant

Search is about finding a needle in a haystack. Search engines enable us to find the right information at the right time from a vast collection of documents or data. The foundations of search are indexing, retrieval, and ranking; and all of these can benefit tremendously from machine learning.

Search is all around us. Whenever you see a ranked list or feed of content, there’s a good chance that it’s powered by a search engine determining what content to retrieve and how to order it. If that search engine uses machine learning, it’s probably doing a better job.

In this post, we’ll explore ways to improve search using content understanding. If you want to learn more about this, check out our upcoming 

 course, where we go into these concepts in depth and work on projects that apply these concepts to real world scenarios.

 is the first step — and thus the foundation — of the search process. Without indexed content, we don’t have search. And without content understanding, we can’t have intelligent indexing and robust search.


In short, content understanding is what makes content findable.

The simplest form of content understanding is creating an inverted index that associates each document with the words or tokens contained in that document. An inverted index is similar to a back-of-the-book index showing which words occur on which pages. In this case, the words map to document ids rather than pages. The words are actually tokens, and the process of tokenizing the document can include steps like stemming, so that different forms of a word (e.g., cat, cats) are normalized to a single canonical representation.

But content understanding should go beyond associating documents with their words. Machine learning can help associate documents with topics and entities. Here are a few techniques to improve on content understanding.

is a way to obtain a holistic understanding of a document. Content classification maps a piece of content to one or more elements from a predefined set of categories. The set of categories depends on the application; typical examples of categories are product types and topics.

Classifying content makes it easier to find, enabling better retrieval and ranking. The categories can be a flat list, or they can be organized in a hierarchical (tree) taxonomy.

It’s possible to implement content classification using rules, such as regular expressions. For example, an article whose title contains a word ending in “ball” is probably about sports.


A rule-based approach is conceptually simple, but it can quickly spiral out of control. Creating and maintaining rules requires subject-matter expertise, extreme attention to detail, and continuous monitoring. It can work if the set of rules is small, but a rule-based approach doesn't scale well. As an application ages, the list of rules becomes unwieldy and creates technical debt. Eventually it becomes a multi-headed hydra full of caveats and edge cases, delivering ever-diminishing returns.

The smarter and more scalable approach for content classification is to use machine learning. This approach may require a bit more work up front, but it’s much easier to maintain in the long run.


As with most machine learning approaches, you’ll need to start with a collection of training data. The training data consists of examples of documents with known categories. For example, training data for a product catalog could contain examples like 

(title: “Apple iPhone 13”, category: “Cell Phones”), (title: “Canon Pixma MG3620”, category: “Printers”

), etc. The effectiveness of a classifier depends on the quantity and quality of the training data. But collecting training data can be expensive, especially if you’re requiring people to label the documents. As with most things, there’s a trade-off.

Beyond quantity and quality, it’s important that the training data be representative. For example, if you’re training a classifier for a product catalog in which phones represent 10% of the catalog, then phones should represent 10% of your training data. If the training data contains a significantly larger or smaller fraction of phones, your classifier will be biased. Always watch out for biases in your training data, especially if those biases can lead to real harm affecting people’s lives and livelihoods.

There are lots of ways to implement classification using machine learning. Decision tree methods, such as random forests and gradient-boosted decision trees, work well for categorical, ordinal, or numerical data. For text and images, you probably want to represent the content in a vector space using embeddings. There’s a wealth of pre-trained embeddings already available for text and images. You can use these pre-trained models out of the box, but most applications benefit from fine-tuning using your own data.

Remember that the quantity, quality, and representativeness of your training data are far more critical to your success than the sophistication of your machine learning model. Learn by iterating: it’s better to iterate and learn quickly – even if that means working with less training data – than to try to do everything perfectly in one shot.

The other requirement for robust classification is a good set of categories. Ideally the categories are coherent, distinctive, and exhaustive. A good rule of thumb is that it should be easy for a human to put content in the right category; after all, if a human can’t do it, a machine isn’t likely to do any better –don’t let perfection be the enemy of progress and iteration.

A second form of content understanding is 

. While content classification assigns a category to an entire document, content annotation focuses on specific words or phrases within the document. These are also called 

, because they represent spans of consecutive words or tokens.

The common form of content annotation is entity recognition. Entities can be of a particular type (e.g., person names, company names, place names) or they can be untyped (e.g., technical terms). In either case, entities generally comprise a 

As with classification, content annotation can be rule-based. The simplest approach is to match strings against a table of known entities. This approach can be quite effective. For example, if a document contains the span 

, it’s easy to recognize this span and map it to the city. It’s a bit tricker for 

, which could refer to the city or to the mythical beast. And there are at least 40 different US cities named 

A more sophisticated rule-based approach for content annotation is to use regular expressions. For example you could use this regex to recognize a span as a US phone number.


This expression recognizes phone numbers of the form of  

. But it won’t recognize a simple variation like 

. Trying to design a regular expression to catch every possible way to express a US phone number yields a monstrosity like this on:

(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2–9]1[02–9]|[2–9][02–8]1|[2–9][02–8][02–9])\s*\)|([2–9]1[02–9]|[2–9][02–8]1|[2–9][02–8][02–9]))\s*(?:[.-]\s*)?)?([2–9]1[02–9]|[2–9][02–9]1|[2–9][02–9]{2})\s*(?:[.-]\s*)?([0–9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?

As with classification, you can use machine learning for content annotation. But machine-learned annotation is a bit trickier in practice than machine-learned classification since the annotation model has to make decisions about every token. For example, an annotation of the sentence 

 as the beginning of a place name and the token 

 and the continuation of a place name, while classifying all of the other tokens as unknowns. That’s a lot harder – and creates a lot more room for error – than content classification.

That said, there is a long history of using machine learning for entity recognition. Traditionally, people used hidden Markov models (HMM) and conditional random fields (CRF);a more modern approach would use neural networks, such as a long short-term memory (LSTM) network or a sequence-to-sequence (seq2seq) model.

As with all machine learning, you’re unlikely to find a pre-trained model that “just works”. Your success will depend on the quantity and quality of the training data you use to fine-tune a pre-trained model or build one from scratch. You’ll need to invest time, effort, and money to create a properly labeled set of training data.

Content classification and annotation offer two approaches for content understanding. The first, for determining what a document is about,  second, for determining what entities the document mentions. A third approach for content understanding is to focus on 

. Content similarity is especially useful for recommendation systems that show results related to a particular document.

A document is more than a category and a bag of entities. In order to measure the similarity between two documents, we need to represent the documents in a geometric space that allows us to perform such measurements mathematically.

A simple way to represent a document is as a bag of words or tokens. A bag of words translates naturally to a 

 in a space where every possible word gets its own dimension. The vector for a document has a 1 for each word contained in the document and a 0 for every other dimension. Using this simple representation, we can measure the similarity between two documents by computing how many words the two documents have in common, and then normalizing this number based on the document lengths. This process gives us a 

 measure between 0 (for documents with nothing in common) and 1 (for two identical documents).

We can improve on this approach by upgrading bags of words to a more intelligent vector representation. We can use stemming to normalize different forms of a word (e.g., cat, cats) to a single dimension. We can introduce 

 weights, to give more importance to a word that is repeated within a document, and to give less importance to words that occur in lots of documents.

But the best way to compute content similarity is to use word embeddings that reduce the space of words to a semantic vector space, typically with several hundred dimensions. These dimensions aren’t directly interpretable, but they’re a very effective and efficient way to represent documents. There are lots of pretrained models, like BERT, that you can use to produce embeddings, so you don’t need to produce embeddings from scratch. But you’ll probably want to fine-tune a model using your own data to make it more robust.

Better Understanding, Better Applications

By using these above approaches to increase content understanding at index time, you’ll increase the overall findability of the content that your users are trying to find and access. Rule-based methods might seem easier, but machine learning approaches are usually better in the long run.

In our next post, we’ll look at the other side of the coin, which is query understanding.

If you’re interested in learning more, check out our upcoming course on