Our post from last week about throwing more AI at your problems turned out to (unexpectedly!) be our second most popular blog post of all time with over 10k views. It generated more strong opinions than we expected — if you’re looking for a laugh, we recommend reading the HackerNews comments (though we wouldn’t recommend trying to make too much sense of them 😉).
Central to that what we discussed was the idea of compound AI systems, which has become increasingly popular in the last few months. We measure the popularity of an idea by the most reliable metric, of course: how frequently it shows up in the headline on startups’ website. By that measure, compound AI has positively taken the world by storm of late.
Composing LLMs with business logic is the natural maturation path of AI applications — it’s critical to AI systems’ ability to perform job functions. The productivity of these applications is also bolstered by the fact that LLMs are increasingly getting better at reasoning over complex problems.
The reasoning capabilities have come as a result of significant investment in “test-time-compute” has increased. Models like OpenAI’s o1-preview have made it so that you can get significantly better answers by waiting longer than we’ve come to expect from LLMs (though not always better!). In our experience at RunLLM, we’ve found that a pipeline customized to o1 can significantly improve reasoning for hard problems but adds up to 10 seconds of latency. Some customers are more than willing to sacrifice a few seconds for higher-quality answers — others are not interested in this tradeoff at all.
To some extent, this latency cost will decrease over time as LLM inference latencies have generally done in the last 2 years, but the fundamental paradigm is to do more work for each prompt which means that there’s only so much latency you can shave off. This reminds us of Jevon’s Paradox: As we improve inference efficiency, we use the time saved to increase test-time compute.
When you put these trends together — adding more foundation model calls to your application and adding more latency to each foundation model — what you’re left with is applications that are probably going to get increasingly slow and frustrating for your users to use. As much as we’d like to give ourselves the benefit of the doubt by saying that an extra minute of latency is still better than what you’d get from a person, it’s clear that end users’ expectations of AI systems don’t align with their expectations of humans. It’s like that in the not-so-distant-future, users will prefer AI systems because they’re accurate enough and significantly faster than a person. We already see this at RunLLM, but that’s a topic for a separate post.
Thankfully, none of these problems are new problems.
While the idea of compound AI is a useful framing for anyone thinking about AI applications, it’s not a particularly new concept. We’ve spent most of the last decade thinking about how to productionize AI (née ML) systems, so the idea that you’d want to compose multiple models and intersperse AI and business logic is in many ways table stakes. The power of the implementation, though, is obviously dramatically higher given how much more powerful language models are compared to even the previous generation of deep learning models.
The core systems challenge, however, is largely analogous: All the way back in 2016, teams at Facebook and Google were investing in the efficiency and scale of production ML systems to minimize latency, often to sub-50ms. In those days, the holy grail was 200ms for interactive websites, and while we’re unlikely to hit those targets anytime soon with LLMs, we can look to the techniques that were popular back then (many of which we worked on!) in order to make modern AI applications successful. This is going to be a mix of systems work, thoughtful UX decisions, and careful product design.
Here are a few principles that we think are critical to carry over from the last generation of ML systems and into modern AI applications.
Compound AI requires many different models, not one all powerful one. Throwing the most powerful model at every stage of your compound AI system is going to waste money, and it’s going to waste your users’ time. Rather than waiting for a person to answer a question or update a document, users will now be waiting for AI systems to walk through intricate reasoning processes in order to reach a perhaps obvious conclusion that a smaller model could have reached faster.
As AI applications mature, we’ll need to become more thoughtful consumers of different models. As we’ve argued in the past, different models will have different areas of strength, which will allow us to pick the best tool for the job at each stage of a compound system. In other words, there’s absolutely a critical role that small and medium-sized models will play in every AI application, and there’s tons of room for innovation in how those models are built and used.
Model ensembles and parallel inference can be powerful. For us, this also evokes the use of techniques from ML systems that already addressed many of these problems. Perhaps the most obviously applicable one is the use of model cascades, where cheaper and faster models attempt to solve a problem but are enabled to say “I don’t know.” If a cheaper model can’t solve the problem, it’s then passed on to a more complex model. In the worst case scenario, there’s wasted time and compute, but in the average scenario, users get cheaper & faster results.
Similarly, you might be better served by having many small LLMs attempt a task and taking the votes of each one in order to make a decision. Even if one or two models are unexpectedly slow for a particular call, your expected median latency will be dramatically lower than if you’re waiting for o1 to finish a task.
This type of approach would have sounded crazy just 12 or 24 months ago, but the rate at which LLMs continue to get smaller, faster, and cheaper has made these techniques possible.
Return something quickly and get better. A common practice in the mid-2010s was to return some data within the target 200ms for a web application and then update the page incrementally as the ML model returned results that were more customized to the user. The idea was that the user would have something to look at and would feel engaged, but all the high-quality results would be there by the time they processed the whole page — the real latency of inference was hidden. There’s no reason that we shouldn’t be doing the same thing with LLMs. We think of this paradigm an AsyncAI.
If inference costs continue to plummet, it will become obvious to have multiple models solve a problem simultaneously. A small model’s solution can be surfaced to the user with very low latency if it meets a confidence threshold, and it can then be edited or updated by a more powerful model that can add more nuance. This parallel computation can achieve the same latency hiding as above.
The open question here is how exactly these updates are surfaced to the user — reordering a list or incrementally surfacing a feed was easy, but editing a large chunk of text while a user is reading it could be worse. While it’s early, we’re interested in ideas like showing an annotation based on model updates or letting the user toggle between multiple versions with a git diff
style view. We’re excited to see what others comes up with here!
AI systems have a long way to go to reach full maturity and scale, but compound AI and so-called test-time compute are both headed in the right direction. Along the way, they’re reintroducing many challenges that ML systems solved in the past decade, and while history doesn’t always repeat itself, it often rhymes. Learning lessons from what’s worked — and what hasn’t — will help us get there faster and with fewer GPU crises along the way.