We passed the blog’s secondary birthday last month, and this is our 99th post! We wanted to save post #100 for a look ahead on the blog and where we’re headed with our own learnings in AI, so we thought we’d use this post to look back at the last 100 (almost) posts.
By its nature, this post will be broad and avoid going into too many details to keep us from running into novella territory. There’s a lot of small lessons we’ve picked up along the way — things we were right about, and tons of things we were wrong about — but for today, we’re focused on the bigger themes that have emerged over the last two years. Where relevant, we’ll link back to old posts. Let’s dive in!
The economics of AI are still uncertain
Our most popular post of all time was titled OpenAI is too cheap to beat and was written almost 2 years ago (October 2023). The idea behind that post — intentionally provocatively titled! — is that OpenAI’s investment in making inference cheaper would pay dividends over time as others struggled to catch up. In broad strokes, this has held true: Even if GPT-5 was no better than GPT-4.1, it cut per-token costs almost by half — though there are other reasons not to use GPT-5.
What’s more interesting is that token volume has become more of a driver of growing costs than token costs have. As LLMs have gotten more advanced, we’re all using them dramatically more in our applications than we were in 2023. We’ve heard through the grapevine that some products have 7,000 line system prompts and < 10% gross margins because of how high their token volumes are. This is by no means a solved problem.
The coding tools are the posterchildren for this challenge. With how easy it is to run up token costs working in a medium-large code base, Cursor and Claude Code have pushed towards a more consumption-based billing model. The rumors we’ve heard are that even still, Cursor is bleeding money on token costs. It’s probably still a good use of resources for an engineer to spend hundreds or thousands of dollars per-month on these tools, but enterprise adoption is going to be severely hampered if $1k x 500 engineers shows up randomly one month. This might all change, of course, as inference gets yet cheaper, but there’s clearly a long way to go on economics.
Hype evolved into substance more than we expected
Personality-wise, we hate getting on hype trains. Of course, we’re working in AI, so there’s a limit to that — but we try to stay on the substantive side of things as much as possible. In 2023, all the hype was around the holy war between RAG and fine-tuning. We even had a draft blog post early on exactly that topic that never saw the light of day. Our approach to all of the follow-ons here (text search is Good, Actually™, graph RAG, re-ranking, etc.) was yes-and. We threw the kitchen sink at the problem, and basically all these approaches, which felt noisy at first, have turned out to be quite useful. None of them were panaceas, but all of them provided incremental gains.
The area where we’re perhaps the most surprised is agents. We had a conversation with a founding engineer at another startup in mid-2023 who told us in excruciating detail just how much of a headache building open-loop decision systems had been. They were so dispirited that they ended up pivoting the whole product to a data extraction system. Our bet at the time was that open-loop agents were probably ~5 years away.
Boy, were we wrong, and here we are 2 years later… building agents. There’s two reasons why this was different than expected. The obvious one is that LLMs have just gotten way better way faster than we envisioned even then. More interestingly, we underestimated how quickly we’d learn as a community what to do and what not to do. Even in our post from last week on Artists vs. Engineers, the base assumption was about how you build and structure your agents — not whether building agents was a good idea or not.
We have no idea what to do with evals
Early on in the life of this blog, we wrote a lot about evals. This was selfish: We knew that our product at RunLLM was better than all the generic tools we were competing with, but we didn’t have any empirical evidence to back this up. We experimented with using some existing evals frameworks and even with building our own, but none of this actually led to any changes in perception. We eventually concluded that vibes-based evals were the dominant method, at least for the time being.
For a while, the Chatbot Arena Elo scores were helpful in evaluating new models, but it’s been at least 6 months since we’ve looked at those scores in the context of evaluating a new model. Especially as consumer and developer requirements have diverged, we’ve found that the wisdom of the crows doesn’t particularly apply to our decision-making at RunLLM.
What most people we’ve talked to have fallen back to is internal testing framework for sanity and correctness, mostly built on Python scripts rather than on evaluation frameworks. We can’t tell if this is because the pain isn’t significant enough to bother investing in anything more robust or if the existing tools just don’t quite hit the mark yet. Perhaps more AI applications just aren’t operating at enough scale to justify further investment in this tooling. Either way, even conversation we have that touches on evals seems to end in something of a shrug.
AI is still garbage-in, garbage-out
This is probably the least interesting and the most important takeaway for us. Being data people before everyone become an AI person, we’ve written about data many times over the last two years. We find ourselves saying garbage-in, garbage-out internally, to customers, and in blogs far more often than we’d have expected.
Intuitively, this is the most obvious point in the world. LLMs are (very, very good) text processing machines, so if you’re not putting the right information into the machine at the right time, you’re simply not going to get the right answers. This reminds us of our favorite quote from Charles Babbage, who invented and presented the first calculator to the British Parliament:
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
— Charles Babbage
Not giving an LLM the right context is the equivalent of expecting a calculator to get the right answer when you give it the wrong inputs. Humans, since we learn from each interaction we have, can (over time) detect if we’re being given the wrong inputs. LLMs don’t have that advantage today. Whether that’s an argument for continual learning or better context management remains to be seen, but for the foreseeable future, context management is going to be your most valuable skill.
No one cares about AGI anymore
The debate around AGI seems to have subsided, and we’re pretty happy about it. In our opinions, it was always something of a pointless debate, but since it had become such a buzzword in the aftermath of the ChatGPT launch, we felt the need to comment on it. Increasingly, the mainstream consensus that we’ve already achieved AGI, and the question isn’t how we get there but what we do it. ASI (artificial superintelligence) is, of course, another story, but that remains more of a philosophical debate.
What’s emerged in the space of the AGI debate seems to be a conversation around what AI can actually do, and we think that’s a much more productive debate. Andrej Karpathy was commenting recently on what the latest LLMs can do that’s surprising to them based on previous experience, and we feel the same way. We’ve been impressed with how far we can push LLMs and how well they adapt to the more complex tasks we give them — especially when managed well.
The natural takeaway remains that no matter how much better LLMs get from here, we have a lot of innovation to do in the applications built with them. This also shows in the increased recent focus on the right UX for AI applications — something we’re fascinated by without feeling like we’ve fully solved the challenge.
There’s much more we could have included here, but we wanted to keep this post a manageable length. We’ve immensely enjoyed writing these posts over the last couple years, and we’re grateful for the all the support and feedback you’ve given us. As always, if you have any thoughts, comments, or questions, please feel free to reach out. More soon!
This was a great read!