Twitter is not real life, but it’s not too far away in the world of AI. With how quickly things are changing, Twitter is the place to go to find out the news — the latest Llama releases (or advance leaks), the coolest new demos (which may or may not be real), and of course the most recent RAG optimization techniques.
In the context of these announcements, someone is always eager to declare the news to be a game-changer — this is the thing that’s going to dramatically change how we build AI applications, while also erasing OpenAI’s moat in a single day and disrupting every incumbent software provider in the world. That’s a lot for a single model to accomplish!
We hate to rain on everyone’s parade, but most of these predictions are frankly unrealistic. Don’t get us wrong: Llama-3.1, Claude 3.5, and GPT-4o Mini were all big deals. Each one of them opened up new points in the design space for LLM applications.
The thing with AI, though, is that there is no silver bullet! No one improvement is going to change the world as we know it. Of course, LLMs did change the world — but even the GPT-3.5 API and ChatGPT were preceded by a steady stream of technical improvements to GPT-2, GPT-3, and so on. These wins were about distribution and awareness, not about a sudden shift in the technology. Since then, we’ve continued to see consistent (and impressive!) technical advances, but the reality is that building good AI applications is going to still be about compounding lots of small wins.
If you’ve been reading us for a while, this is really what we’ve been writing about on this blog for the better part of a year. We’ve written about how data matters, (again and again), how prompts don’t matter that much, how model choice matters, how evals matter, and how product definition matters (again). All of the lessons in those posts were directly derived from our experiences at RunLLM, as we put the pieces together to build a high-quality technical support assistant. None of these ideas are new!
But what’s become obvious to us is that no single thing that we can do — LLM fine-tuning, data cleaning & evaluation, data engineering, optimized search, etc. — is sufficient to build a good LLM application. It’s really the combination of all these things that creates high-enough quality to make something production-ready.
We believe that should affect the way that you build AI applications. Building a moat is not about finding some insurmountable advantage; it’s about compounding small, valuable wins. When looking to make improvements, there’s not just one possible solution. You need to understand the whole tooling stack in order to make consistent progress.
Data is king. We won’t belabor the point because we’ve made it so many times before, but AI systems are still garbage in, garbage out. If you don’t have good data and don’t have a good understanding of your data, you’re going to get bad results. That’s why we always start here!
More LLMs is often better. We like to joke an extra LLM call solves every problem, in the same way that all problems in computer science can be solved by another level of indirection. This is of course dangerous: You can quickly run up costs if you’re not careful. That said, thoughtfully breaking down your problem into multiple LLM calls can yield huge quality improvements.
Know how to measure. If you’re doing this right, you’re going to be experimenting with many different possible improvements to your application. Deploying them without a clear understanding of the impact of your changes is extremely dangerous. Unfortunately, generic LLM evaluations aren’t going to help you out here. We’ve built our own evaluation framework at RunLLM to circumvent this exact problem. (We’re hoping to open source this soon, so keep your eyes out!)
Model choice matters. Looking at all the impressive models that have come out, you might be tempted to say all models are the same. To some extent, this is true. The reality is that as you add more LLM calls, you’ll find that there are key differences in what models are good at, and those differences compound as you build more complex applications. Being able to understand where to use what model is part of why measurement is so important.
This list almost certainly isn’t comprehensive. These are lessons we’ve learned from our own experiences at RunLLM, but that means the choice of application and the nuances of our technical approach are going to affect the lessons here.
As a meta point, what’s clear is that you need to understand what options are on the table as you build something. For example, at RunLLM, we consider data engineering to be easy (if a little tedious), whereas changing our fine-tuning pipeline or adding LLM calls is like using a big hammer — we’ll do it, but we only turn to those solutions when simpler ones aren’t sufficiently.
Perhaps counterintuitively, we believe this makes our quality advantages more defensible, not less. Having a single magical advantage is great until someone else figures out your magic. With LLMs, quality is all about execution — focusing relentlessly on providing the best answers and iterating quickly. Having any one advantage competed away doesn’t make all that much of a difference.
If you only take one thing away from this post, it should be this: Don’t hang your hat on any one technology or any one approach. Things are changing so fast in this space that you’ll need to be nimble and adapt quickly. If you don’t build that muscle in the quieter times, it’s going to be difficult for you to keep up when things are crazy (which is always).