The zero-shot trap
Rethinking expectation manage for agents
A lot of our thinking – and so our writing – over the last couple months has been dominated by trying to understand what makes certain AI products extremely sticky, while others can feel extremely difficult to get started with. Cursor’s been the posterchild for this line of thinking. Their secret sauce was the code completion problem framing, which enabled them to gather massive amounts of human feedback while providing incremental but noticeable value. The result was a UX that was both familiar (and so easy to use) and also graceful. In turn, this created a data flywheel that allowed Cursor to dramatically improve the quality of its agent over time.
In thinking about how users engage with Cursor, we realized there is another advantage their UX has created that other problem spaces (including ours!) seem to lack. We’ll admit this comes from a place of a little jealousy: Users never seem to expect an agent like Cursor or Claude Code to get an answer right on the first try. It is almost an expectation that you start a session, provide some detail, see what comes back, and then determine how to update your instructions to get what you want. Very often, you might think about what you failed to mention in your original instructions, which led to an undesirable output.
Our experience building an AI SRE at RunLLM has been a little different. Users often look at the first cut of a root cause analysis report that our agent generates and grade it on a binary: correct or incorrect. Certainly, there is value in doing a few minutes of work and coming back with a correct RCA — and we do that a non-trivial percentage of the time. However, just like you wouldn’t expect Cursor to zero-shot your code, you shouldn’t expect an AI SRE to zero-shot an incident.
We’ve spent a lot of time trying to understand why this is the case. Naturally, our customers are going to get more value out of our product if they use it the way they use Cursor: interactively debugging with it, treating it as a co-pilot, and guiding it. Our conclusion is that if a product isn’t getting the types of interactions it wants, that is usually a sign that the product design needs improvement. To fix the design, you have to start with why users expect a certain kind of output.
Why users want zero-shot results
Why do users expect a zero-shot perfect result from an AI SRE when they don’t do that with Cursor? We actually struggled with this question quite a bit before we were able to come to a satisfactory answer.
First, product positioning plays a huge role. Cursor and Claude Code are referred to as coding agents, not AI software engineers. As we all know from the extensive discourse on Twitter over the last couple years, software engineers do a lot more than code. The expectation is not an end-to-end completed product that magically solves everything; it’s to generate code and iterate. The role of the agent is to generate code quickly, and the role of the user is to guide the agent. Neither one is fully capable without the other.
On the other hand, an AI SRE, an AI SDR, or AI support agent is labeled in such a way that it implies it is doing the full job that a person would do. If I’m buying one of these products, you’re setting the expectation that it will deliver the full value that a person in that role would deliver – resolved incidents, qualified sales leads, or closed-out support tickets. If you only deliver a part of that promise up front, people are going to be disappointed. (For the record, we find the AI SRE moniker to be a poor label for what most agents in this space do, but Gartner has spoken.)
The second factor is driven by human psychology. In chatting with an advisor about this dynamic recently, the point they made was that users interact with coding agents as a part of the creative process – they’re open to batting around ideas, trying different solutions, and seeing what works. It’s okay if progress isn’t linear because even without agents, progress is often not linear in the creative process. Stakes are lower.
When you’re doing an operational task – especially a high pressure one like handling an unhappy customer or debugging a production outage – it feels like there’s less room for leeway. Even if the agent will have the net effect of accelerating your work, the natural reaction is to take a much more tactical view of what’s okay and what’s not. If you get an incorrect RCA, you might feel like a junior intern is coming to bother you about some crazy hypothesis. It’s important for a product like ours to present its work in a way that’s understandable and productive even if it’s not complete.
Humans aren’t perfect – but they know their limitations
When we’re feeling a little more frustrated than usual, we keep coming back to the fact that humans rarely zero-shot issues. Even if you hire the smartest engineer in the world, on the day they onboard, they aren’t going to debug a complicated issue in seconds. They spend the first few hours figuring out what the code does, which observability tools have the data, and who changed what. From a capability perspective, agents are not different. You can’t just give an agent access to your tools and hope it figures it out.
The key difference between agents and humans in this case is how they present their work. A new engineer isn’t going to join a team and provide confident-sounding opinions on debugging production outages on day one. If you ask an agent to provide an opinion, it likely will – and you might even be annoyed if it said, “Hm, I’m not sure yet.”
Your agent needs to be useful enough to show value on day 1, but it can’t be so confident that it rushes into making a claim that it doesn’t have enough information to back that conclusion. This is a bit of a Catch-22, and we’ll admit it’s not one that we’ve fully solved yet – but we have some ideas for how to start.
Designing products to be multi-shot
To reiterate, our belief is that the burden is on the product builder to encourage the right usage patterns from users. If you aren’t building your product in a way that encourages people to give you the chances to succeed, then your product is the problem, not your users.
We have three key focus areas we’ve either already implemented or are actively working on at RunLLM:
1. Learn from feedback. When you onboard a new employee, other employees give them feedback and explain concepts to them because they know the new person will get better with time. They are investing in the future. If that person doesn’t end up learning after a few months, you’d probably fire them. The same principle applies to agents: If the agent you’re using isn’t built to learn from experience, users might be tempted to wonder why they should waste their time giving it feedback? Learning is a hard requirement for a complex agent, otherwise providing feedback feels like a chore with no ROI.
2. Granularity of work. We’ve talked before about how a smaller unit of work is easier to get feedback on than a larger piece of work that is full of assumptions. The core concept is simple. When there are lots of incorrect or confusing assumptions baked into your work, it’s hard to give feedback because you have to understand everything that went wrong before you can redirect the agent (or human). On the other hand, if you’re able to give feedback early on, it’s much easier to redirect the effort and feel confident in the output. You need to give users opportunities to give feedback at a frequency and granularity that is actually easily reasoned about.
3. Self-directed growth. In addition to taking feedback, humans can learn on their own. Even the most junior employee should be encouraged to try out a few possibilities in solving a problem before asking for feedback. The experience they gain from those attempts will inform how they approach things in the future. If your agent waits for human feedback on every action it takes, it’s going to get minimal feedback and improve minimally. Agents have to be able to learn from their own mistakes – when they demonstrate that growth, they’ll get more investment from their users.
From tools to teammate
We are still in the early innings of understanding how to design interaction models for agents that don’t just do a task but fill a role. The lesson from Cursor is that users will tolerate a lot of imperfection if the feedback loop is tight and the stakes of a single failure are low. For those of us building in high-stakes operational categories, the challenge is harder. We have to design products that are humble enough to ask for help, smart enough to learn from it, and autonomous enough to not be a nuisance.
Ultimately, the goal isn’t to build an agent that is never wrong. The goal is to build an agent that is worth the investment of onboarding. If you can get your user to think “I am training my next great teammate,” the data flywheel will finally start to spin for the rest of the enterprise.
We’re thinking about this as a trend towards building collaborative agents (perhaps co-agents). What exactly that looks like in practice will change dramatically over the next few years – but if you can get the user experience right for your domain, you’re going to be in a very strong position.




The AI Frontier is a huge landscape that will continue to grow as more super, Frontier AI modules are crafted.
Refer to Frontier AI more depth: https://promptengineer-1.weebly.com/frontier-ai.html