Last week, ChatGPT added support for GPT-4.1 (originally only available via API), which reminded us that we hadn’t a chance to write about model selector hell yet. We now have seven (7) models to choose from every time you want to ask ChatGPT something. If you’re like us, you probably used o4-mini-high once when it came out and reverted back to using o3 for complex tasks and tool use and 4o for everything else. It’s a little baffling to us that OpenAI continues down this path — even for power users, this is a dizzying array of overlapping options: Does my coding task count as quick or complex? If quick, use 4.1, but otherwise use o4-mini-high. By the way, what levels of advanced reasoning do I have access to, how fast are they, and how does speed relate to the advanced-ness of the reasoning?
From the outside, this feel like an obvious own goal. We believe OpenAI should either use their editorial judgment to pick 2-3 key options (probably just o3 and 4o), or they use their extremely strong AI time to build a triage system that automatically routes each user’s request to the best model for the task (which is probably something the technology isn’t cheap enough to do yet). It’s worth noting that from a developer perspective, having more options is certainly good (as we’ve talked about before). Our criticism here is from a consumer perspective with our product hats on.
Leaving it as is, to us, feels like a cultural decision that is very telling. OpenAI — and Anthropic (see “Claude 3.5 Sonnet (New)”) to a lesser extent — seem to believe that the state-of-the-art frontier models they’re building are full-fledged products. The expectation seems to be that users are going to be making complex decisions about what model they need for every single task they’re going to do, just like a user picks which iPhone they’re going to buy. The difference is that you buy an iPhone once every 2-4 years (and mostly just decide on screen size and camera), whereas you pick an LLM every day.
Fundamentally, most AI tools today are about automating tedious things that humans could do on their own but don’t have the time or interest to complete. Doing it with AI is not only offloading the work from a person, it also delivers a better experience because the work gets done in a fraction of the time that it would take a human. What a user cares about is getting the work done, not whether the model uses reasoning or whether it’s a “Research Preview” of GPT-4.5 — just like you don’t care whether the Uber you get in has a V6 or an inline 4 engine.
There’s an extent to which the model’s capabilities are fundamental to the experience. o3’s advanced tool use feels like a genuine step up from anything that came before, and increasing model capabilities obviously make users more productive. But everything from per-user memory to better web search to the canvas are critical part of GPT’s experience as well, and there’s likely opportunity for much more, and adding 7 models on top of that muddles the picture.
This “model is the product” mindset to narrow a way to look at OpenAI’s product from our point of view. Users don’t need or want this level of complexity, and even the most informed users don’t know when to use which model. It’s hard for us to analyze exactly why the frontier model labs might be choosing this, but it seems to be driven by a focus on the coolness of the model — not the work done — being the ultimate end goal. This misses the point and delivers a poor user experience.
In some ways, this is not surprising. After all, OpenAI is a research lab that’s focused on building better models, so of course they’re excited about all the new shiny toys they’ve built and how those toys can be best shown off. Who wouldn’t be? Historically, companies that excel in one area aren’t the best in other areas. Apple has been a phenomenal hardware designer and reasonably good at building operating systems (which ship on the cadence of hardware), but as software cycle times have reduced, the quality of their software has correspondingly suffered. Their most recent struggles with Apple Intelligence features highlight exactly this — they aren’t able to keep up with the pace of innovation required.
While cycle times for LLMs are getting shorter, these are still projects that take months or quarters to complete, and OpenAI doesn’t seem to ship and iterate the way a traditional SaaS product would. The fact that ChatGPT looks mostly the same as it did in 2023 is a testament to that culture. Even the product launches live streams from OpenAI focus primarily on “what can the newest model do” rather than “how can this help me in my day to to day life.”
Of course, OpenAI is trying to change this via acquisition, as we touched on a few weeks ago. You can read the full post for the details on how we think about this, but we have some healthy skepticism about changing culture via acquisitions. There has to be a concerted effort to change the way that product planning and engineering decisions are done, and bringing in another product is only a part of that process.
What occurs to us about this is that Google is the “dark horse” here — which is a funny thing to say about one of the five most highly valued companies in the world. In 2023, everything we heard about Google was that the culture was in shambles: Bard was a disaster, they were worried about the search business, and their products didn’t have any AI in them. They’ve solved half the problem, as the latest Gemini models seem to be some of the highest quality LLMs on the market. Somehow, they’ve still failed to integrate these models well into any of the business applications they own. Gmail’s compose features are mediocre at best, and none of the Google Drive apps have great AI integration. It’s not clear to us why a model like Gemini can’t power an application similar to beautiful.ai or Gamma. It feels like the left hand isn’t talking to the right. That said, Google has shown the capability of building widely used applications in the recent past, so of all the frontier model labs, they’re the ones that are most likely to (re?)develop the muscle quickly.
Regardless of whether and when any of these companies figure out how to build application, this is the reason why we’re bullish about AI startups. Building foundational infrastructure — whether data centers, phones, or foundation models — is an incredibly lucrative business, but culturally, it looks very different from using those components well. Being good at one doesn’t mean you’ll be good at the other — in fact, there seems to be a negative correlation. Until superintelligent LLMs start harvesting our kidneys, we’re pretty confident that there will be plenty of areas for startups to innovate.