Vector search and vector databases were all the rage in 2023, but search techniques have receded into the background a little bit as we’ve been bombarded with model releases over the last year. For better or for worse, however, search is still at the center of almost every B2B AI application, so we wanted to take some time share what we’ve learned about search for AI applications in building RunLLM over the last couple years.
The impression we get is that being called “just” a RAG application is something of an insult (similar to being called a GPT wrapper 18 months ago). If “all” you’re doing is search, then how hard could it really be to replicate? Pretty hard, it turns out!
Don’t underestimate the complexity of search
As we alluded to above, RAG has become so common that there’s a sense that search-and-generate is a boring/uninteresting application. In a narrow sense, that is true — if all you’re doing is generating an embedding, doing similarity search, and piping documents into GPT-4, you’re not building anything very defensible (and the results probably aren’t great either!).
Doing search properly however requires a lot more than just a vector database. A good search process requires a mix of a few different things: an understanding of the data you’re ingesting, an understanding of how that data is going to be used, and an understanding of when different kinds of data are going to be the most useful.
In practice, we’ve seen examples of (and built at RunLLM, of course) search pipelines that are significantly more complex than this. The core of the search pipeline we use at RunLLM today was the result of 5-7 engineer years of work between the fall of 2023 and summer of 2024. The focus of that work wasn’t just what happens when a query comes in — it was also doing the work to clean diverse forms of text data, to build a good set of indices that had all the relevant data and metadata that allowed us to search effectively, and implementing small optimizations (e.g., detecting and extracting code examples) that helped us incrementally improve the quality of results.
No one traditional technique is good enough
Part of the complexity we described above comes from the fact that there’s no search technique for AI applications that’s good enough on its own. You might be confused by this — Google has been good at search for decades now, why can’t we recreate that? Counterintuitively, it’s a UX limitation rather than a technical limitation.
Google (until the last few years with its knowledge graph then with AI answers) never purported to give you the answer to what you’re looking for; instead, it made it dramatically easier to find the right answer. Most peoples’ expectations were that Google was good enough to get you the right answer in the top 5 or 10 results, and we were used to clicking through multiple results. AI applications, on the other hand, present a single answer — or a couple alternatives at most. That means finding the right data to answer your question is critical. If you use the wrong data, you’re simply wrong.
We’ve found that techniques like text search (BM25) and vector search are good at getting you in the right zip code — the right document will usually be in the top 20 or 30, but not reliably in the top 5 of any search technique. What we’ve done at RunLLM is thrown the kitchen sink at the problem — we use fine-tuned models to analyze and rewrite the question, we do text search & vector search, we use boolean predicates and graph search to filter the data, we use heuristics about which data is the most relevant to each question, and we re-rank the results with LLMs, all before we ever try to answer the question. No one technique is good enough on its own — it’s a combination of everything that generates high quality search results.
Nothing beats reading
While the search techniques get us in the right zip code, we’ve found that having an LLM read each document in the context of the question we’re answering and give it a relevance score — often called re-ranking — is the best correlated with our success in solving any problem.
Intuitively, this makes sense. The reason searching over text is hard is because the dimensionality of text is incredibly high. Most search techniques reduce the dimensionality to make the problem tractable but lose fidelity in the process. While LLMs also rely on embeddings to process text, it’s clear that the extra processing they do enable them to understand and generate text in a way that no other technique can match.
As a result, having an LLM actually read a whole document and analyze its relevance to a question is incredibly valuable. This technique doesn’t just serve as the basis for answering questions but is used for a variety of other features in RunLLM as well — everything from finding documentation gaps to categorizing question topics.
Domain-specificity matters
Until now, we’ve mostly covered the generic techniques that we think would probably work well for any application. The thing that’s also true, however, is that there are domain-specific aspects to search that are critical for quality but difficult to generalize.
If you’ve observed how a product like Cursor has gotten better, it’s probably pretty clear that the ability to do code-specific search is super critical. Even pre-LLMs, traditional search techniques worked pretty poorly over large codebases, which is why products like Sourcegraph were able to find so much traction. Code may be an extreme example of a form of text that is difficult to search over, but the same is true for many other domains as well.
In RunLLM for example, we have to be able to do things like:
extract code examples from documentation, so that they don’t get lost in regular text embeddings
understand how different product features connect to each other using a knowledge graph, so we don’t mistakenly give an answer about a feature on AWS to a user who’s using GCP — the knowledge graph itself is constructed to reflect these concepts
use heuristics (e.g., new features are more relevant than old ones) and user requests (e.g., show me code examples) about what kinds of data we should be using to provide answers to questions
keep our knowledge base accurate as features change, bugs are fixed, and workarounds are suggested
All of these play into how we do search in RunLLM, and many of them probably don’t apply to your application — but understanding the specificity required is critical.
Static techniques are just a starting point
The points above are built on the implicit assumption that all the data you need is going to be available for you to read, analyze, and index well in advance of when the data is going to be used. It’s great if that’s the case, but there are plenty of examples of cases where that isn’t going to be possible.
The one that we’ve been looking closely at solving with RunLLM is user-specific data — for example, log data, plan/billing information, and configuration parameters. It doesn’t make sense (or in many cases, it isn’t possible) for us to index all of this information up front, so we’re adding support for dynamically finding the relevant data in order to better inform the guidance that we provide.
The main change is that we can’t pre-analyze and pre-read each of the relevant data sources, so it makes knowing what data is available and accessing the correct data the most important task. The issue with things like logs is that you might get a massive amount of data that needs to be filtered down to the relevant bits before analyzing. The core approach around categorization, filtering, and reading will be shared, but there’s likely more we have to learn around how to min-max latency against accuracy in these cases. We’re just getting started with this, so we don’t have any war stories — yet. 🙂
While search may have lost some of its shine over the last year, it’s still a critical aspect of almost every LLM application. What we hope this post helps you understand is just how critical search is to the success of an AI application and also how much you’re going to have to customize your search techniques to your specific application. The techniques outlined in blog posts about generic RAG systems (or even RAG-aaS products) will probably work reasonably well for high-level tasks like a HR helpdesk, but they’re not going to scale well for more complex tasks.
We know we’re also probably missing a trick or two here, so if you have experiences you’d like to share, we’d love to hear from you!