Last week, we published our longest-ever post outlining how we think the AI market will evolve. The feedback on that post has been super positive so far — we’d love to hear what you think of it! This week, we’re going in the exact opposite direction with a much more tactical observation about how you should build AI applications.
You’re probably familiar the fundamental theorem of software engineering, which states that any problem can be solved by introducing an extra level of indirection. At RunLLM, our fundamental theorem of AI applications is that any problem can be solved by an extra LLM call. In other words, one of our favorite solutions is to throw more AI (LLM calls) at our problems. This may sound a little silly, but we’ve found that it’s a remarkably good heuristic for us.
Before we dive into explaining why and how this works, a little bit of background is in order. Way back in 2023, when everyone was still figuring out what to do with LLMs, there was a lot of discussion about what the best technique for building AI applications was. The most common debate was whether RAG or fine-tuning was a better approach, and long-context worked its way into the conversation when models like Gemini were released. That whole conversation has more or less evaporated in the last year, and we think that’s a good thing. (Interestingly, long context windows specifically have almost completely disappeared from the conversation though we’re not completely sure why.)
We weren’t innocent in this now-pointless debate, of course — we were generally in the pro-RAG camp, but our opinions changed very quickly as our experience grew. What we should have all seen at the time (which seems to be consensus today) is that there’s no approach that’s going to be dominant. Realistically, a combination of a number of different techniques — RAG, fine-tuning, etc. — that results in the best possible LLM applications.
The maturation of LLMs and the surrounding ecosystem has made all this much easier. RAG was always a fairly straightforward paradigm to implement, and there are plenty of blog posts on the internet that can help you optimize the quality of a RAG pipeline. At the same time, the proliferation of small, high-quality models like Llama 3 and platforms like Together AI or Fireworks AI have made it significantly easier & cheaper to fine-tune a model. That means that a combination of all these approaches is available to most AI applications.
What all of this has enabled us to do — and what we’ve learned is incredibly valuable at RunLLM — is breaking down the problem into more manageable components. This pattern is now more commonly being referred to as building “compound AI systems.” While the branding sounds very fancy, the benefits are somewhat obvious and have been codified mature AI systems from the beginning. Composing multiple LLM calls with supporting infrastructure means you can narrow the scope of each inference step. That means you can fine-tune where possible and generally use smaller and faster models for many steps. The leads to better reliability, lower cost, and higher quality. Generally, we think that this technique is dramatically under-used in most AI applications.
The biggest concerns are typically cost and latency, and these are valid concerns — you don’t want to blindly throw GPT-4o or Claude 3.5 at every problem you have because you will quickly run up your costs. This is where compound systems are a valuable framework because you can break down the problem into bite-sized chunks that smaller LLMs can solve. Oftentimes, a call to Llama-3 8B might be enough if you need to a simple classification step or to analyze a small piece of text.
Latency is a little harder to hide if done naively. Again, naively using large models means users will have to wait for each additional AI call — the modern equivalent of AI bureaucracy. We’ve found that parallelization and asynchronous workflows have both been valuable for us in building RunLLM — we’ve unfortunately ended up reinventing workflow orchestration along the way. In the interest of building AI-powered workers, we’ve also found that returning incrementally useful results can also help make your UX feel more similar to how a person would behave. Rather than trying to get exactly the right answer up front, you can return basic information followed by more detailed analysis or even clarifications. Of course, you’ll have to figure out what’s acceptable for your own application, but don’t let basic assumptions about cost or latency stop you from making your application smarter.
As a caveat, you probably don’t want this to be your first solution to every problem because it will still run up your costs even if you’re using small models. Typically, our heuristic at RunLLM starts with data and prompt engineering, then looks at application components, then looks to add an LLM call. We don’t shy away from it, but we certainly don’t use it to solve every problem.
A secondary benefit of this approach is that it helps increase resilience to the ridiculous prompt hacking attacks that end users still insist on trying. When you have a pipeline of LLM calls, you can enforce much stricter limits on the outputs of each stage, which in turn means that prompt hacking one step will simply cause the pipeline to fail (rather than returning a malicious output). In the enterprise context, providing your customers confidence that you’re not going to be manipulated into saying something problematic turns out to be incredibly important.
This approach also allows us to mature our components incrementally over time. The first cut at a newly introduced LLM call might just use a generic model, but as you gather more data, you can use that to improve quality by fine-tuning that component to be more precise and reliable. This, in turn, allows you to replace a more powerful model with a smaller one that’s purpose built for the task, which can help reduce cost and latency. (This is funnily reminiscent of Google’s old 43 rules for ML which encourages data scientists to collect basic data before trying to build custom models.)
If you take away one thing from this post, it should be that you shouldn’t be shy about adding more LLM calls to your application. We believe this approach is only going to get more obvious as LLMs become more powerful and cheaper — a trend that has been more or less inexorable for the last year. Relying on a handful of incredibly powerful LLMs isn’t going to cut it because compound systems will (again) be faster, more reliable, and higher quality.
Right in line with what I was saying this week about Gemini Flash being so cheap good and fast being important.
(Point 2, in second half). https://www.interconnects.ai/p/claudes-agency
пошёл нахуй