LLMs are the variable interest credit card of tech debt
Silicon Valley companies have historically been famous for “Not Invented Here” syndrome — companies prioritize building tools in-house rather than using an off-the-shelf solution. Traditionally, the argument against “Not Invented Here” goes something like this:
It will take (something like) 3 engineers about 3 months to build the product. After that, you’ll require a quarter of 1 engineer’s time to fix, update, and improve the product.
Assuming each engineer costs the company about $200K per year, that’s $150K to build the product and $50K to maintain it over time.
As a service provider, we’ll give you the product today (instead of 3 months from now), it’ll work immediately, and we’ll do it for a fraction of the cost — say, $20k. Of course, with a whole team improving the product, it’ll get better faster
Traditionally, machine learning and AI tools have not run into build vs. buy as much as traditional software. Machine learning has been something of a black box — more math and alchemy than software engineering — and the prospect of recreating a packaged machine learning tool would be daunting.
LLMs have changed the game. AI is now just an API call away! This has broadly been a positive trend — AI is orders of magnitude more accessible, and tons of developers have entered the space. However, it’s also reignited the build vs. buy debate. In reality, LLMs are the variable-interest credit card of tech debt.
LLMs are easy to get started with and quick to show results, but building production-ready software is much more challenges than it seems. With experimentation costs, supporting infrastructure, and a constantly shifting landscape, you’ll be paying the cost of building and maintaining an app far longer than you might think. Here’s why you shouldn’t just build most AI products yourself.
Time is not linear in AI
In pre-LLM machine learning, model builders would spend days cleaning data and tweaking parameters. The results of these efforts wouldn’t always be immediately obvious. Getting a model right would take many iterations of data exploration & cleaning followed by many iterations of parameters combinations, especially as models got larger.
LLMs aren’t quite so finicky — you’re mainly focused on prompts, along with hyperparameters like temperature. Nonetheless, anyone who’s built with LLMs knows that there’s plenty of experimentation. Getting the prompt just right, making sure the model doesn’t hallucinate, and avoiding prompt injection requires plenty of experimentation. As your product matures, you’ll likely create different prompts for different settings. AI projects will naturally have to account for this uncertainty, which makes planning and shipping a project from scratch difficult.
This type of experimentation requires a new kind of discipline and time commitment. You’ll need to track your prompt experiments and have a set of tests to evaluate the quality of your experiments. Most of all, you’ll need to have the patience (and creativity) to experiment your way into high-quality results.
It’s more than just the model
LLM applications are more than just an API call to OpenAI. LLMs are obviously incredibly powerful, but they’re not enough on their own, and a single API call — no matter how well-crafted — is not going to power your application. Different models are well-suited for different tasks. To ship production-ready features, you’ll need to combine multiple LLMs, vector databases & other search indices, and data management.
For example, our production inference pipeline at RunLLM includes calls to 3 different fine-tuned LLMs combined with 2 different search calls to retrieve the right data. Depending on customers’ needs, this pipeline can be configured with different parameters or to return results in different formats.
This compounds the experimentation costs described above — you’re now contending with the interactions between different models in addition to getting each model’s output just right. In practice, there’s more: You’ll have to manage data ingestion & consistency, testing across multiple APIs, and tracking cost and performance.
Again, none of these challenges are insurmountable, but it’ll further slow down your ability to ship.
The world is always changing
It’s safe to say that the state of the art in LLMs is not static. There are new research papers, blog posts, model releases and best practices every time we open Twitter (and we open Twitter a lot!). It’s tough to keep up with for us — and we’re doing this full-time.
The benefit of drinking from the firehose, however, is that we’re able to constantly find ways to improve the quality & consistency of our models while adding new features and stronger guardrails.
If you’re building and shipping your own AI features from scratch, you’ll have to manage these changes. The base models themselves are constantly changing, which means that even maintaining the status quo requires work. A prompt that worked on an older version of the model might have drastically different results after an update. Keeping up with the state of the art requires more than tracking model updates — reading research papers, trying out new techniques, and of course, more experimentation.
This is where the variable interest rate is perhaps the most painful: Keeping track of model changes and new techniques is not simple and can require fundamental changes to your applications. In practice, your team will be spending maintaining these features is only going to increase, and the likelihood that you unintentionally introduce bugs is that much higher.
Running a startup building an LLM-powered product, we’re obviously biased. Taking a step back, there are obviously a number of situations where teams should be building (or have no choice but to build) AI features from scratch. But on balance, enterprises that are looking for AI-driven solutions should be looking to third-party services — they will ship faster & more reliably and improve significantly more over time. If it’s not your moat, you probably shouldn’t be spending time on it!