Don't build your application on a single LLM call

and

Mar 14, 2024

One of the most consistent trends in new foundation models over the last year has been increasing context window length. As of today, GPT-4 is up to 128K tokens, Claude 3 has support for up to 1 million tokens, and Gemini claims it will soon have support for 10 million tokens. Millions of tokens theoretically enable indiscriminately stuffing data sources into a single LLM call. While we’ve been skeptical of the uses long context windows in the past, we have to admit that recent advances are impressive.

In this context, we’ve seen a lot of LLM-powered applications build on a single, large API call to GPT-4 (or an equivalent model). While the models are capable of encoding this context, we consider this to be an anti-pattern in LLM app development. This might be obvious to some of you, in which case you probably don’t need to read the rest of this post — but we’ve seen it enough to be worth calling out.

Typically, when an application is building on a single LLM call, you will have an extensively detailed prompt that includes examples for in-context learning and instructions for the model to follow. Your first thought might have been, “That’s silly! I don’t use a single LLM call but use my application logic to pick between a handful of different cases.” We’re sorry, but this blog post is for you too — if all the LLM magic for each case in your app comes from a single call, you can most likely do better. (Of course, there are always exceptions, but again, we’ve seen enough examples to spot a clear trend.)

Instead, we believe you should break your problem apart into multiple steps, like classifying the problem, detecting out-of-scope inputs, evaluating the relevance of data sources, handling edge cases, generating responses, etc.

There are a few key pitfalls in having large, nuanced prompts:

Instruction following is not information processing. Many of the benchmarks that test long-context windows have focused on testing recall quality — whether the model can find bits of information inserted throughout a long prompt. Models have gotten significantly better at this, which bodes well for large-scale information processing, but this is distinct from including a large number of instructions in a prompt. In our own experiments, we’ve seen cases where a long list of instructions for handling different scenarios can confuse even a powerful model like GPT-4, especially when an input is at the edge of multiple cases. Instruction following will likely improve over time, but today’s state of the art models are better at executing on narrow tasks rather than broad, multi-step projects.
It’s bad software engineering. Debugging LLM-powered applications is hard — no matter how many new tools track prompts and visualize lineage, many of us still spend time debugging by trying out tweaks to our prompts. Pinpointing where exactly an error has occurred in a large prompt is a nightmare — we don’t know whether the model ignored certain instructions, misunderstood the request, or misinterpreted the sources it was provided. Obviously, this means debugging becomes complicated or impossible, and shipping fixes with confidence becomes difficult.
It leads to a clunky UX. Processing a large input stream takes a significant amount of time, especially when models like GPT-4 can have significant latency variance based on how much load they’re under. Waiting for a model to process a long nuanced input and then start generating an output can lead the user to feel like they are waiting forever. This is probably the biggest pitfall in using a single LLM call. Breaking the problem into multiple steps allows you to surface results back to the user more quickly. As a counterexample, Perplexity.AI parallelizes its data source retrieval with answer generation, which means it shows users what websites its using to generate answers before it starts generating an answer. This significantly improves UX, as users can begin processing the output before the LLM has finished its work.
Different models are better at different tasks. While GPT-4 and Claude 3 Opus are incredibly powerful models, they aren’t the best-suited model to every task. Each model will have different strengths, and simpler tasks might not need the heft of a top-quality model. Breaking your application into different tasks allows you to use the best tool (read: model) for each task rather than forcing one model to do everything. This leads to better application quality and can reduce both latency and cost.

There are of course drawbacks to breaking your application into multiple LLM calls — the most notable one is that this multiple serial LLM calls can increase end-to-end latencies (especially tail latency) under high load. In practice, however, this can be masked with parallelization and speculative execution, which will lead to better end-to-end performance. Generally, we believe the engineering overhead is worth the effort.

The state of the art in LLMs is always changing, so we fully expect that some of these criticisms (instruction following, perhaps even decoding latency) will become less valid over time. However, as everyone building products with LLMs today knows, getting traction requires working around existing constraints. Waiting for the technology to improve is not a great startup-building strategy. 🙂 As such, building applications on a single LLM call significantly limits what you can accomplish.

The AI Frontier

Discussion about this post