AI applications are more than an LLM
Programming note: We’ll be off for a couple weeks because we have a team onsite next week, and the week after is July 4th. We’ll be back on 7/11 with a new post!
One of the most remarkable substantive changes in AI over the last couple years has been that AI has gotten orders of magnitude cheaper and more accessible. In 2022, you had to hire a data scientist to clean & analyze your data, build a model, test it — and then you’d probably need an MLOps or DevOps person to productionize and deploy it. That was probably multiple months of time and tens of thousands of dollars of salary — without even accounting for infrastructure.
GPT changed that overnight. With an API key, a Python script, and a few minutes, you could get an app up and running — and it would only cost a few cents. Of course, in the last 18 months, the quality of models has increased while cost per-token has continued to decrease.
All of this is probably obvious — this is what’s created all the hype over the past year. Unfortunately, it’s also created unrealistic expectations about how AI applications should work. There are two key aspects to these expectations: (1) all AI applications should be cheap because LLMs are so cheap; and (2) all AI applications rely on the same foundation models, so performance is comparable across applications.
Both of these misunderstandings are grounded in the belief that what LLM you use is all that matters for AI. In reality, an LLM does not determine an AI application’s performance no more than a choice of database or programming language determines the quality of a traditional software system.
Let’s deconstruct each of these misunderstandings. You might find yourself thinking that this all sounds like it’d be true of traditional software as well. You’d be right, but as with many of our recent arguments, we find traditional objections being turned up to 11 with AI.
LLMs costs aren’t the only costs. Building high-quality AI applications takes time, effort, and money. On one end of the spectrum, you can take GPT-4 and a vector DB and toss them together and answer some questions — in the same way that you can nail 5 pieces of wood together and call it a table. Unfortunately, there are plenty of low-quality products out on the market that have this exact architecture without any of the guardrails that would ensure quality.
On the other end of the spectrum, RunLLM uses 25+ LLM calls to 4 different LLMs to answer each question we get. That’s how we’re able to achieve the best result quality. This is just one example — we’ve talked about many of things things we think are required to build good AI applications over the last year.
The applications that are built thoughtfully and carefully are naturally going to be more expensive. There’s two reasons for this. First, as we hinted at above, building an application with a clear focus on quality requires doing more work; we don’t just use LLMs to generate answers but to analyze & tag data, to identify irrelevant questions, and so on. Each of those operations incurs costs in order to maximize quality and enforce guardrails. Second, the complexity of our inference pipeline requires constant experimentation and iteration, which takes significant engineering time — experimenting with new models, tweaking prompts, and iterating on data engineering.
The difference in price between high-quality and cheap solutions might be surprising at first, but we’ve actually found that some of our best and most engaged customers at RunLLM are those who have: (1) built or tried other solutions; (2) found them to be lacking in quality; and (3) prioritized using a solution that has the highest quality.
Not all AI applications are built the same. An application that combines a vector DB with GPT-4 ostensibly does the same thing as a carefully engineered product, but they do them obviously in very different ways. Unfortunately, the prevalence of quickly-built solutions has led to bad customer experiences, which in turn leads customers to paint AI with a broad brush — it’s simply not ready for production because this particular application was not able to function well.
LLMs are of course incredibly powerful, but they’re not omnipotent (yet?). We all know just how easy it is to trick an LLM into saying silly things. As we already described above, thoughtful products are the result of significant time & money investment into ensuring that interactions are high quality and enterprise ready. Just because both applications are using GPT-4 (or Claude, or Gemini, or…) that doesn’t mean you’ll get the same performance out of both of them. In reality, data engineering, search, prompting, fine-tuning, and a number of other configuration parameters make a huge difference.
LLMs are incredibly powerful tools, but they’re just that — tools. Those tools can be used well to build high-quality reliable applications and they can also be used to build slapdash applications that generate low-quality results. To torture the analogy, a hammer can be used build a beautiful piece of furniture and also to build something shoddy & messy (or even to destroy something). LLMs are no different.
At the end of the day, you get what you pay for. As companies moving beyond checking the AI box and look to use something that’s high-quality, there will be a need to think carefully about what options exist and how they’re evaluated.