LLMs are too helpful

Why hallucinations matter and how you can still build useful applications

and

Mar 06, 2025

Hallucinations were a hot topic back in 2023, but the interest in and concern about hallucinations seems to have waned over the last year. This is partially because the core technology has gotten better, but it’s also driven by the fact that most users now have a better sense for how to interact with LLMs (and when to be skeptical of their answers). What we’ve also noticed is that many people seem to be convinced that good application builders (even if not always the core models) are pretty good at keeping LLMs on the rails — as a simple heuristic, we get asked about RunLLM’s “hallucination rate” half as often today as compared to a year ago.

High-quality applications have absolutely been able to better manage LLMs, but has the problem universally gone away? The answer is more complicated than you might think.

It’s worth noting that the hallucination problem was likely a little overblown in the first place. Humans hallucinate plenty (ever catch yourself saying, “Am I crazy, or…”?). We even write many of those hallucinations down on the internet (sometimes on Substack) — and yet we’ve survived the scourge of mass disinformation caused by Google. People also tend to start testing LLMs on topics they know best and are disappointed when the answers are less than perfect. Meanwhile, if they’d started with a less precious topic, they would have been perfectly satisfied — a form of Gell-Mann Amnesia. But beyond that, what does it really mean to prevent hallucinations?

What we’ve found in building RunLLM is that transformers based models have a helpfulness problem. The reason why is somewhat obvious when you think about how transformers models are built: They provide answers by predicting the next token, but the next token isn’t very likely to be, “I don’t know.” Their training data simply doesn’t have that many examples of those kinds of interactions. (This is the same reason why early LLMs tended towards such verbose responses — and still do to an extent.) In the outcome space of all possible responses, it’s more likely that an example it’s seen before provides some coherent sounding set of words, even if those words are totally wrong. To an extent, adding more guardrails to your prompts helps, but not always, and adding more input data for the LLM to read can often make things more muddled.

There is plenty of work trying to teach models to say “I don’t know” or be more cautious in providing help through post-training (we did this a while ago with IDK Cascades), there are significant issues with this approach. Primarily, it’s that models trained to be more conservative often become too conservative — they’re too eager to say I don’t know” or the more commercially acceptable “I am a helpful chatbot and I am unable to answer questions of that form”. You may have experienced this in earlier generations of “safer” models from Google and Meta. We want this behavior in some situations but it is challenging to tune this behavior to the right “level” using post-training techniques like SFT and RL.

Another analogy that illustrates the flaw is to think of a model like GPT-4o as a smart college graduate: If you give a smart college grad 8,000 words and 47 instructions to read, they’re also probably going to give you the wrong answer half the time. Unfortunately, this kind of omnibus single LLM call is still what most LLM applications do today — and is the reason why we see so many low-quality chat interfaces out in the world. You’re simply asking LLMs to do far too much for you, and you’re getting bad results.

The implication is obvious: You can’t trust LLMs to directly solve your problems in the way that a human can, because you can’t trust LLMs to give you negative responses.

This might sound a little familiar to you. If you’ve been reading us for long enough, you’ll remember we previously wrote about using compositions of LLM calls to build applications, which eventually started to be called compound AI. The reason we’re bringing this topic back up is that the helpfulness problem is a useful framing for anyone building an application, and it’s significantly influenced our thinking about how to plan and implement good features in RunLLM. There are two main principles that we follow.

First, you have to be careful what you ask for. To use RunLLM as an example, if you provide 5+ possibly helpful documents along with a variety of instructions about answering a technical support question, chances are (unacceptably) high that you’re going to get a hallucination. Instead, we run through many different LLM-powered steps before we ever get to the point of answering a question. In most steps, we give the LLM a simple question that it can helpfully answer without risking the possibility of an out-of-scope answer. For example, the first question we always ask is, “Is this question even relevant to the product we’re supporting?” There’s no risk of an LLM trying to do something that it shouldn’t do — provide an answer that’s not warranted, or infer facts that aren’t real — because all it has to do is compare the question to a description of the product.

Second, your business logic matters a lot — and needs strong guardrails. While the focus in AI applications is of course on the cool AI calls, the code that you’re writing to connect the dots matters just as much. To continue the example from above, if we determine that a question is not relevant to the product we’re supporting, we never even get to a point where we’ll try to answer it. Not only does that immediately cut all possibility of a hallucinated answer, the guardrails give us strong confidence that malicious actors trying to prompt hack our system won’t have any luck. They’ll prompt hack the initial check, the results will be broken but unhelpful, and our system will short circuit. We have similar break points built into many different parts of our pipeline that both build trust by enforcing discipline and also prevent abuse.

As we say almost every week, LLMs are incredibly powerful tools, but as always, they have to be used in the right way. The helpfulness problem isn’t something that should prevent your use of LLMs. Rather, knowing how to frame the problem — the fact that any sufficiently open-ended problem will tend towards being solved — helps you build your application in a way that avoids that pitfalls. Just as you shouldn’t be using a saw without the proper protections, you shouldn’t be using LLMs without the proper helpfulness guardrails.

The AI Frontier

Discussion about this post