Uplimit instructor, Search Fundamentals and Search with Machine Learning; Machine Learning Consultant

CoRise Instrcutor Search Fundamentals and Search with Machine Learning; Former CTO at Wikimedia

 focused on using content understanding to make indexed content more findable. This post will explore ways to improve search through 

, the process by which a search engine transforms a search query to represent the searcher’s intent.

If you’re curious about how to apply the fundamentals of content understanding and query understanding to improve search applications, check out our upcoming 

 course, where we go into these concepts in depth and build real-world projects that apply them.

The most direct way to increase query understanding is

 Query classification looks at the query as a whole and attempts to classify it into a category. We can think of it as analogous to content classification but for queries. For example, the query 

 probably indicates an interest in politics.

Query classification can be used to boost results that match the query category, or even to filter out results from other categories. That can be useful when the words in a query can appear in multiple contexts, e.g. a query for 

 probably shouldn’t return dress shoes.

As with content classification, you can implement query classification using rules or machine learning.

A rule-based approach to query classification can use a manually maintained set of strings or regular expressions. As we saw with content classification, this approach starts off simple when you have a small set of rules but breaks down if you try to scale it.

Machine learning tends to work especially well for query classification. Even if you don’t have resources to manually label lots of queries, you might be able to derive labeled training data from searcher behavior. This means that  if a searcher performs a query and then engages with a document or product, you can infer that the query intent maps to that document’s category.  This approach assumes that the content is already classified.

Since queries are text, at least for a traditional search interface, they tend to work well with embeddings and neural network models. Again, you can use a pre-trained model, but you’re better off fine-tuning it for your data.

 Analogous to content annotation, query annotation identifies spans within a query that represent entities or belong to a controlled vocabulary.

. As in content annotation, query annotation may not be able to assign labels to all of the query tokens. The unlabeled tokens are typically labeled as 

. But, since queries tend to be short, they tend to have a higher concentration of entities than longer text documents.

Recognizing entities within a query can help the search engine improve retrieval or ranking. As with query classification, the search engine can boost or filter results based on whether they contain the entities identified in the query. For example, the search engine should recognize from the query 

 that Apple is intended a brand, and thus not return phones that are shaped like apples (yes, they exist!).

Like content annotation, query annotation can use rules or machine learning. But, since queries are short, machine learning tends to work better for query annotation than content annotation. Specifically, models that combine long short-term memory (LSTM) neural networks with conditional random fields (CRF) perform well on entity recognition. 


Again, machine learning requires training data. But, as with query classification, it may be possible to derive training from user behavior, e.g., inferring that someone who searches for 

 and engages with a pair of Nike shoes intended 

Finally, just as documents can be similar to one another, so can queries. In fact, it’s often the case that multiple queries can represent equivalent or near-equivalent intent, e.g., 

 can help ensure that you return the same results for searchers regardless of how they express their intent.

There are two general strategies for recognizing query similarity.

The first strategy is to identify surface similarity. Queries that only differ in stemming (e.g., 

, word order, or the inclusion of stop words (like 

) likely express the same intent. Though watch out for false positives: a 

The second strategy to recognize similar post-search behavior. Queries that express the same intent tend to be followed by the same behavior, i.e., engagement with the same kinds of results. If the results are represented as vectors using embeddings, then the average vectors of engaged results for two similar queries should have a high cosine similarity, i.e., close to 1.

Measuring query similarity this way is a highly sophisticated application of machine learning to improve search. But it gives you an idea of how powerful machine learning can be as a tool to help understand search intent.

You can use these approaches to query understanding, as well as the approaches in the previous post to increase content understanding during indexing. By combining them, you’ll increase the overall findability of the content that your users are trying to find and access. And remember that, while rule-based methods might seem easier at first, machine learning approaches usually scale better and are more maintainable in the long run.


If you’re interested in learning more, check out our upcoming course on