Last week, we wrote about why search is important in the context of AI applications and what we’ve learned about doing search well while building RunLLM. The natural response to that — which we considered including in last week’s post — is that long-context windows will solve all these problems. Every so often, we’ll get a response to a post like that one saying that none of the techniques we’re discussing will matter because in the limit, we’ll just throw all the necessary data into a single prompt.
Especially as the leading model providers have continued to expand context window (Google will have 2M token context soon) throwing more compute at the problem seems to be the solution. In that world, if you can use vector search (or keyword search or predicate filters) to get into the right zipcode, you can simply dump all of the relevant information into a large context window and hit go… right?
Not quite. While long context windows are useful for a variety of tasks, like large scale data processing and summarization, most interactive applications aren’t going to be solved by throwing everything you can find into a single LLM call. Ultimately, search (or some form of information retrieval) is still going to matter, and it’s probably worth your time to get used to doing search well.
There’s a number of reasons why:
Cost: The most obvious reason why long-context windows won’t cut it is because it’s expensive. You’re including a lot of data that simply isn’t necessary to solve the problem, and if you’re solving a complex problem like coding or technical support then you’re going to need a large model to generate the ultimate output. Throwing unnecessary into the prompt because you haven’t sufficiently narrowed down the candidate set of documents is going to simply waste your money. While LLMs have been getting cheaper, we’re still talking about $2.50 per 1M input tokens for Gemini 2.5 Pro. While LLM costs have been coming down, we’d still have to see costs cut by 1-2 orders of magnitude before you can do this for every support ticket or code generation request. Depending on the volume of data you’re searching, 1M input token still might not cut it.
Latency: Perhaps even more importantly, throwing unnecessary tokens into your prompts is going to impact your latency and therefore your UX. While there are probably micro-optimizations in most model providers’ implementations, latency for attention-based models is typically $O(n^2)$ in the size of the input tokens. If you 20-100x the size if your input, that means you’re going to see a 400-10,000x increase in latency. That might be fine for a batch processing task that doesn’t have a user waiting on the other side, but it won’t cut it for an interactive application in software engineering or support. Again, LLM inference has been getting more efficient, but we’ve typically seen that apply to cost reductions rather than performance improvements over the last couple of years.
Outsized Data: There are plenty of applications — even interactive ones — that require you to process a relatively large volume of data. For example, some of our customers at RunLLM have been sharing use cases recently that require us to process large log files as a part of helping debug a user issue. The log file itself may be 1-2M tokens, which means that we’d need to pare down what’s contained in the log file before we ever get to answering the question. By the nature of log and telemetry data, the pared down version still may include thousands of tokens that are relevant to the issue that user is solving, so we can’t blindly fill up the context window with unhelpful data.
Multi-step compute: All of the above problems are exacerbated when you introduce more complex systems. Whether you’re implementing the re-ranking process we described in the previous post or you’re developing an agentic planner that takes the relevant data into account before deciding on a next step — you will struggle with cost and latency if you’re repeatedly bombarding each model with hundreds of thousands of tokens of unnecessary data. Even before we introduce truly agentic systems, we already use 30-50 LLM calls per-answer we generate in RunLLM. Not every call requires the full context, but a handful of them do, and including unnecessary data repeatedly would make the system absurdly expensive. Looking ahead, we believe pretty strongly that compound systems and planning-based agents are the direction that most applications will evolve — and it’s where we’re headed with RunLLM as well — so being wasteful with input prompt size is simply not an option.
Quality Variance: While it’s probably the hardest to measure, we typically also find that including more data increases the chances that the LLM itself gets confused or provides an unhelpful output. Large context windows are typically evaluated with needle-in-a-haystack and summarization benchmarks. Intuitively, it makes sense that you’re able to pick one very specific thing out of a large set of disparate data, but that’s very different from finding the right code file when you have 3 similar implementations of a feature. While it’s not guaranteed that the LLM will get confused, you’re increasing the chances that you’ll get an unhelpful result by including unnecessary data. (i.e., you’re giving the model more chances to get confused).
This isn’t to say that there aren’t useful applications of long context windows. Being able to summarize or process a large volume of text (e.g., the log files we mentioned above) simply wasn’t possible before. There’s many applications that will benefit from being able to munge large amounts of data.
However, the reality is that search is still an important part of every AI application, and that likely isn’t going to change anytime soon. Finding the right information at the right time is going to enable you to build applications that are ultimately smarter, because you can use the same time and compute to do more with narrower data rather than do less by throwing all the data in the world at the problem. Applications are going to only need to get smarter — and therefore more complex — from here on out, which means that we all need to continue to refine our search story.