LLMs are becoming commodities
In a world of uniform models, it's increasingly hard to differentiate.
As LLMs releases have accelerated this year, we’ve all become a little inured to the news. It’s still true that model releases that have been hyped for a while — like Llama 3 — generate a decent amount of press. But changes that would’ve felt massive 6 months ago — like GPT-4o — have received relatively little attention for their modeling advancements (controversy notwithstanding).
On the relative timescale of previous technology advancements, this feels a little wild — GPT-4o is genuinely a huge step up from a technology perspective. For perspective, Moore’s Law said that chips would double in power every 18 months. Assuming GPT-4o is equivalent in text generation quality to GPT-4 Turbo, 4o reduced cost and latency by 2x over only 6 months. Software iteration is always faster than hardware iteration, but the rate of acceleration is still incredible — especially given that we were in a GPU crunch 6 months ago!
The rate of acceleration has huge implications for LLM providers. Despite the apparent lead that OpenAI, Anthropic, Meta, and Google have, there are still new entrants raising unbelievable amounts of money. The pressure is now on all these companies — the leaders and the disruptors alike — to differentiate themselves.
As we’ve touched on briefly, model quality is no longer a clear differentiator. Over the course of this year, we’ve seen Claude-3 Opus catch GPT-4 Turbo in terms of Elo, we’ve seen Llama 3 get within ~4% of GPT-4 Turbo, and most recently, we’ve seen Gemini-1.5 Pro get within 1.5% of GPT-4o. We’re not even halfway through the year!
Judging from this rate of progress, it’s a matter of when, not if, Claude and Llama catch up to the current state-of-the-art. This has been reflected in our own experience at RunLLM: We’ve switched portions of our production inference pipeline to use Llama-3 (both 8B and 70B), getting lower latency, lower cost, and equal (or higher) quality than with the combination of GPT-3.5 and 4 we were using earlier. We still rely on GPT-4o for the final step in our pipeline, but that’s no longer a given.
That means model providers must look elsewhere for differentiation.
At first glance, the obvious answer is multimodality. GPT-4o’s headline improvement was obviously the seamless multimodality integrated into a single model. However, Gemini already demoed (however dubiously at first) its multimodality, and with its Android integrations has better surfaced this functionality to endusers than OpenAI has. It’s likely that we’ll see other models follow suit with similar techniques (even if the packaging is different), meaning further commodification of these features.
For consumers, that puts the product experience front-and-center. As our CTO, Chenggang, said about a bad chatbot answer recently, “If I didn’t know any better, it looks like yet-another-chatbot answer.” That’s a major concern for anyone building with LLMs — but especially for model providers. Depending on where you started and what your preferences are, you might find Claude or GPT’s styles more appealing, but it’s hard to quantify the substantive differences between the models. You can find data pointing to specific task differences, like Claude-3’s lower coding Elo score compare to GPT-4o’s — but the differences are small.
One thing that’s kept us with OpenAI — beyond the force of habit — is access to Dall-E 3 (which we use for this blog’s images), but this isn’t a huge use case for us. OpenAI’s tried to extend this advantage with the GPTs store, but despite our predictions from earlier this year, this hasn’t really established itself. Given where we are today, it’s hard to imagine that the GPTs store is a huge draw for consumers.
For enterprises consuming LLMs, the only argument that matters is providing the best quality at the lowest price. From that perspective, increasing commodification is great — it means more opportunity for cost arbitrage. From the perspective of the model providers, focusing on cost and speed creates a race to the bottom. Competing on cost means that someone will always undercut you, but whoever has the biggest war chest (see: Uber vs. Lyft) will last the longest. Techniques like over-provisioning and speculative or delayed execution also allow providers to convert money into improved speed — another expensive race to the bottom.
All this is to say that LLMs are becoming increasingly commodified. That puts the leading model providers in the strongest position. It’s becoming increasingly challenging to understand the difference in model provider offerings. Even the standard system metrics like latency and throughput are poorly defined and hard to measure. If everything is the same, then perhaps the only remaining difference is the perceived safety in going with an established provider.
For consumers & enterprises alike, there are fewer reasons to make the not-safe choice. In turn, that puts newer entrants in a difficult position. There are smarter people than us thinking about how to generate adoption, so we won’t go so far as to say that these challenges are unsolvable — but they certainly put model providers in an unenviable position. The focus for most of these orgs will continue to be on model quality, but as LLMs get better, model quality will be harder to discern than it already is. We’re assuming that all model providers can continue to improve at the same rate — the latest legislation from the state of California, for example, smells a little bit of regulatory capture to us.
Meta’s commitment to an open-weight approach is an interesting twist that we haven’t yet discussed. Smaller providers obviously won’t be able to compete with Meta on resources, but it seems obvious that Llama-3 is perhaps single-handedly supporting the market for third-party hosting (companies like Together.AI, Fireworks.AI, etc.). If it wasn’t for Llama-3, it’s not clear that open-weight models would have nearly the usage they do, especially with Mistral’s wavering commitment to openness.
We’re not sure what exactly the long-term effect of Meta’s strategy will be on. On one hand, Meta seems to have little intention of being a model provider, so this may be a red herring — infrastructure providers will benefit from the open Llama models without there being a pattern for other providers to follow. On the other hand, fostering open model usage may lead to broader ecosystem adoption that benefits smaller providers.
It’s worth emphasizing again that this puts the leading providers (OpenAI, Anthropic, Google, and Meta) in an incredibly strong position. Given the rate of change of LLMs, it’s hard to say anything with any certainty. But as a general principle, commodification and too much choice means that most people will stick with what they know. This, in turn, leads to all the traditional arguments around creating data & feedback flywheels that we won’t recapitulate here.
For everyone else, this means finding a niche and differentiating from the defaults. Whether it’s size, use cases, or expertise, we all need a great reason to switch away from what we know. Not only is it okay to build non-general purpose LLMs, it’s probably the right move.