Model inference, model products, and AI applications
Some thoughts on where model providers are headed
GPT-5 is the first major model release from OpenAI that we didn’t immediately jump to put in production at RunLLM. To be clear, we tried, but the models that were released simply didn’t make sense for us to prioritize amongst the many other things that we could spend our time on, and the models that we would have normally prioritized didn’t show meaningful improvements on our benchmarks.
Reflecting on why, we have a few key takeaways about the evolution of LLM providers over the last couple years and where the API-first products are headed. Our headline takeaway is that there are early signs of a divergence between pure LLM and inference and smarter LLM-based “products.” As application builders, we have a distinct need for control over what the underlying models are doing, so LLM inference is more appealing to us than a more tightly coupled product. GPT-5 is the first major model release that feels like a product rather than a pure LLM, which is part of the reason why we feel GPT-5 doesn’t quite meet our needs.
Consumer apps vs. developer APIs. We — like many others — complained about OpenAI’s product UX in the days of the model picker with 7+ models. Giving a user that many options when they wanted to run a simple query didn’t make much sense. The new UX is dramatically better, even if it had a rocky rollout (and even if GPT-5 has some room for improvement compared to previous iterations like o3). Developer APIs on the other hand are different: Application builders are going to make thoughtful, informed, often empirical choices about what options to pick, so having a variety of options is good. The options let an application builder trade off cost, latency, and quality depending on the complexity of each task they’re doing. That said, the previous model line ups were more intuitive — thinking about the reasoning effort in
gpt-5
and how that compares to a smaller model size’s output is not immediately obvious to us, and reasoning models in general seem to be higher variance (more on that below). The migration guide on the OpenAI docs helps, but we didn’t find the analogs to be drop-in replacement in our experiments. As you can tell, there’s no simple answer here, but there’s definitely room for improvement.Reasoning applications vs. agentic applications. We noticed a wave of GPT-5 based product announcements after the launch, though fewer than before. What sticks out to us is that there are applications which benefit from throwing more compute in a single LLM call for a particular task — examples like the ones Box CEO Aaron Levie posted around document processing fall into this bucket. For these use cases, switching to a more powerful reasoning model makes perfect sense. However, for agentic applications like RunLLM, improving quality often comes from composing multiple LLM calls rather than having one big call to a powerful model. In fact, we’ve found that reasoning models — dating back to
o1
— tend to increase variance in our performance in a way that our customers (and therefore our team) find unnerving. As a result, we’ve yet to put a reasoning model in our production inference pipeline; instead, we primarily rely on non-reasoning LLMs that are scoped to very narrow tasks and composed into more powerful pipelines. Some of this might be fixed with better prompt tuning, but the variance in latency and output has led us to be more conservative than before with these models.Open-weight vs. closed-weight. With all of the quality open-weight models being released over the last 6 months, it feels increasingly viable to construct high-quality applications that rely on open models. The main motivation is a sense that there isn’t any confusing logic hiding under the hood. Again, this wasn’t a concern 1-2 years ago, but as the model providers try to build more verticalized model products, it feels like a concern that might begin to emerge over the next few model releases. This will especially feel true if OpenAI moves to deprecate the GPT-4.1 family of models, which we still rely on heavily and would be quite sad to lose. Being forced to switch to a reasoning model for a simple task like filtering out irrelevant questions is not our first choice. The non-reasoning version of GPT-5 is too one-note to replace everything we do, especially since the smaller model is not available via API. For the reasons described above, we find that reasoning models aren’t (yet) the best fit for us. In that world, relying on open-source models from unbiased infrastructure feels like a safer bet — or at least moving to lean more heavily on model providers like Google, that seem to be closer to exposing “just the model.” We’re not necessarily confident that this will be an issue for us, but we’re cognizant that this is a risk area for our product today. If Google and Anthropic feel the need to follow suit with OpenAI’s latest releases, there will be a strong motivation to switch to more decentralized inference services that are transparent with model behavior rather than more abstracted services.
Model inference vs. model products. Our bigger picture observation on this front is that pure inference seems to be diverging from a model as a “product.” In 2023 and 2024, an open-source inference product provided a similar output to a vertically integrated model provider API — just with different weights. It’s starting to feel like that is changing, with the model providers trying to own more of the stack and include more functionality in the system itself. Reasoning and tool use are the two most obvious examples of this. Both are great from a consumer perspective, but they don’t fit our needs as application builders. That is going to affect the direction of the development of future models. If the major model providers are in an arms race to develop the best model for consumer applications, that might start to diverge from what a model that’s useful as a composable building block in applications would look like. We don’t know enough about the development roadmaps at the frontier labs to say for certain, but we’ve certainly heard engineers from these teams say things like “a good enough model will solve enterprise problems” (which we strongly disagree with). Products becoming increasingly vertically integrated and opinionated would undoubtedly push us towards open models.
As with most things in AI over the last few years, we’re speedrunning the hype cycle of past platform shifts. What’s unique about this situation is that OpenAI has managed to build both a consumer and enterprise business over just a few years, and unlike Google or Amazon as they built out their cloud businesses, both sides of OpenAI’s business rely on the same foundation.
OpenAI might very well see its consumer product as the end-all-be-all that will replace other applications. We disagree, and so does likely everyone building AI applications. The question is whether OpenAI enables its enterprise customers to continue to rely on its API products or pushes them away with too much abstraction.
Good post. Although I think reasoning models are going to get lower on variance so not sure it’ll be long before that point changes.
More investment is also just going into reasoning models too, and therefore will be smarter.
Some good predictions here. OpenAI’s productisation strategy signals they’re doubling down on eyeballs and understanding the wetwear behind them. Memory, search, and detailed buyer profiling will be packaged as seamless social experiences rather than developer tools. Some existing big tech lunch will be eaten.
I expect moves in the next 12 months to socialise interactions: integrated forums, shared chats, AI-mediated social functions (eg model mediated dispute resolution, meeting facilitation…). This prioritises consumer stickiness over enterprise flexibility. Are OpenAi betting that owning the social layer compensates for losing developers who want compositional control? Versus Anthropic?
This might create a bifurcated market. Consumer platforms (old Google, old Meta) will face genuine disruption from OpenAI’s integrated features. But enterprise/developer segments may fragment differently. As noted, OpenAI’s abstraction layers push application builders toward open models precisely because they need “just the model.”
Open model economics will require revenue to come from providing : specialised inference services, enterprise APIs, compliance-friendly alternatives.
If product integration pushes away enterprise customers, open alternatives may gain faster adoption. Especially if decent small open models can make their way onto phones and laptops.
Lunch will be eaten selectively - creating new opportunities in segments OpenAI implicitly abandons.