Our previous post focused on using content understanding to make indexed content more findable. This post will explore ways to improve search through query understanding, the process by which a search engine transforms a search query to represent the searcher’s intent.
If you’re curious about how to apply the fundamentals of content understanding and query understanding to improve search applications, check out our upcoming Search with Machine Learning course, where we go into these concepts in depth and build real-world projects that apply them.
Figuring Out What You Want
The most direct way to increase query understanding is query classification. Query classification looks at the query as a whole and attempts to classify it into a category. We can think of it as analogous to content classification but for queries. For example, the query midterm elections probably indicates an interest in politics.
Query classification can be used to boost results that match the query category, or even to filter out results from other categories. That can be useful when the words in a query can appear in multiple contexts, e.g. a query for black dress probably shouldn’t return dress shoes.
As with content classification, you can implement query classification using rules or machine learning.
A rule-based approach to query classification can use a manually maintained set of strings or regular expressions. As we saw with content classification, this approach starts off simple when you have a small set of rules but breaks down if you try to scale it.
Machine learning tends to work especially well for query classification. Even if you don’t have resources to manually label lots of queries, you might be able to derive labeled training data from searcher behavior. This means that if a searcher performs a query and then engages with a document or product, you can infer that the query intent maps to that document’s category. This approach assumes that the content is already classified.
Since queries are text, at least for a traditional search interface, they tend to work well with embeddings and neural network models. Again, you can use a pre-trained model, but you’re better off fine-tuning it for your data.
Picking Out the Pieces
Another path to query understanding is query annotation. Analogous to content annotation, query annotation identifies spans within a query that represent entities or belong to a controlled vocabulary.
For example, the query nike sneakers can be annotated as Brand:Nike Product_Type:sneakers. As in content annotation, query annotation may not be able to assign labels to all of the query tokens. The unlabeled tokens are typically labeled as Unknown. But, since queries tend to be short, they tend to have a higher concentration of entities than longer text documents.
Recognizing entities within a query can help the search engine improve retrieval or ranking. As with query classification, the search engine can boost or filter results based on whether they contain the entities identified in the query. For example, the search engine should recognize from the query apple phone that Apple is intended a brand, and thus not return phones that are shaped like apples (yes, they exist!).
Like content annotation, query annotation can use rules or machine learning. But, since queries are short, machine learning tends to work better for query annotation than content annotation. Specifically, models that combine long short-term memory (LSTM) neural networks with conditional random fields (CRF) perform well on entity recognition.
Again, machine learning requires training data. But, as with query classification, it may be possible to derive training from user behavior, e.g., inferring that someone who searches for nike sneakers and engages with a pair of Nike shoes intended nike as a brand and sneakers as a product type.
Different Question, Same Intent
Finally, just as documents can be similar to one another, so can queries. In fact, it’s often the case that multiple queries can represent equivalent or near-equivalent intent, e.g., mens shoes and shoes for men. Recognizing query similarity can help ensure that you return the same results for searchers regardless of how they express their intent.
There are two general strategies for recognizing query similarity.
The first strategy is to identify surface similarity. Queries that only differ in stemming (e.g., men, mens), word order, or the inclusion of stop words (like for) likely express the same intent. Though watch out for false positives: a shirt dress is not a dress shirt!
The second strategy to recognize similar post-search behavior. Queries that express the same intent tend to be followed by the same behavior, i.e., engagement with the same kinds of results. If the results are represented as vectors using embeddings, then the average vectors of engaged results for two similar queries should have a high cosine similarity, i.e., close to 1.
Measuring query similarity this way is a highly sophisticated application of machine learning to improve search. But it gives you an idea of how powerful machine learning can be as a tool to help understand search intent.
You can use these approaches to query understanding, as well as the approaches in the previous post to increase content understanding during indexing. By combining them, you’ll increase the overall findability of the content that your users are trying to find and access. And remember that, while rule-based methods might seem easier at first, machine learning approaches usually scale better and are more maintainable in the long run.
If you’re interested in learning more, check out our upcoming course on Search with Machine Learning.