Since the beginning of the internet, data flywheels have created giant companies — first Google, quickly followed by social media companies, and now, OpenAI and other LLM providers.
OpenAI alone likely has more usage than all the other model providers combined, with Google and Anthropic making up most of the rest. These companies are collecting enormous amounts of data — not only can they see user prompts, they also get explicit feedback (thumbs up/thumbs down) as well as implicit feedback (e.g., asking more questions if you didn’t get the answer you wanted). Better yet, they are also at the forefront of customer conversations, understanding where LLM users are pushing the boundaries and where the models fail.
All of this is grist for the mill of future model training, and investment is only accelerating: Anthropic CEO Dario Amodei recently predicted that we’ll have models that cost $10B in 2 years.
Model quality is definitely a big advantage, but it’s only a part of the story. The more impressive moat these companies have is the scalability of their infrastructure and the quality of their service. Let’s look at fine-tuning APIs as an illustrative example.
Our team at RunLLM has been running experiments recently with the GPT fine-tuning API. A single fine-tuning run on GPT-3.5 costs us anywhere from $4-12 dollars and take about 1-1.5 hours to fine-tune over about 1 million tokens.
Meanwhile, a single p4d.24xlarge
on AWS costs $32.77 per-hour on-demand or $19.22 per-hour if you reserve for 1 year. Each machine comes with 8 Nvidia A100 GPUs. Assuming that OpenAI only uses 8 GPUs to fine-tune GPT-3.5, it’s 3-8x cheaper to use OpenAI than it is to rent a p4d.24xlarge
from Amazon — without even accounting for the technical expertise required to deploy and run the jobs.
AWS is obviously charging a markup on its EC2 instances, but OpenAI’s costs include training and storing the model weights (likely reasonably cheap with LoRA), building & maintaining the fine-tuning infrastructure, and the expertise needed to rack & stack thousands of GPUs internally1.
If we had a dense enough workload, perhaps we could justify renting a p4d.24xlarge
at the yearly reserved cost. At $19.22 per-hour, we’ll be paying about $166K per-year.
Let’s assume again we’re using LoRA to fine-tune a model on 8 A100s, perhaps at 2 hours per run. We can do 12 fine-tuning runs per-day. on these GPUs, or 4,380 fine-tuning runs per year. We’ll allocate one engineer to deploy, check, and validate fine-tuning runs full-time (we don’t envy them!), which will cost us perhaps $200K per-year. (Let’s also assume that we have plenty of data readily available to us to keep fine-tuning jobs going constantly.)
At $366K ($166K AWS + $200K talent), we’re paying around $80 per-fine-tuning run, about 8-20x higher than what we’re paying OpenAI!
This just to fine-tune a model. While per-token inference costs for fine-tuned GPT-3.5 is 10x more expensive than GPT-3.5 it is still 10x cheaper than GPT-4! Serving a model on your own hardware is significantly more expensive unless you can reach a large enough scale to fully utilize serving hardware or elastically scale (impossible when GPU availability is limited).
We’ll give the back of the envelope math a rest, but it proves a critical point: The major LLM providers’ advantage doesn’t just lie in the quality of the model but in their ability to serve models at extreme economies of scale. It simply doesn’t make sense for most organizations to run after their own open-source LLM deployments. They’ll be sinking needless time, talent, and money into an unsolvable optimization problem, while competitors will move faster and likely achieve better quality by layering on top of OpenAI.
Of course, that doesn’t mean that open-source models have no future. We touched on this last week, and our friend Nathan Lambert at Interconnects recently wrote about the future of open-source models as well. Open-source models must get smaller over time to reduce the cost, complexity, and time required to customize and run them.
For everything else, the major LLM providers will dominate.
You might be wondering if OpenAI is eating the cost of fine-tuning and serving in order to build market share, similar to what Uber and Lyft famously did in the rideshare market for many years. The ridesharing companies were famously never able to stomp out competition in the way many predicted, but the switching cost with software infrastructure is significantly higher than the switching cost between two apps on your phone. Even if prices eventually go up, these companies will dominate the market, and they still have huge gaps to fill until they reach the cost of hand-rolled models.
It’s also worth noting that we’re comparing off-the-shelf AWS GPU prices to OpenAI’s, likely-highly-subsidized GPU pricing on Azure — but the scale of OpenAI’s usage only cements their advantage here.
Also, pre-empting the haters. OpenAI is positive margin. It's expensive to serve inference, but you can still make money.
"per-token inference costs for fine-tuned GPT-3.5 is 10x more expensive than GPT-3.5 it is still 10x cheaper than GPT-4!"
this is off.
fintuned-GPT3.5 is 3x cheaper than GPT4 -- $0.0120 / 1K tokens vs $0.03 / 1K tokens