Agents can't check their own work

How agents made creation cheap and validation impossible

Apr 09, 2026

Imagine you ask Claude to build you a financial model in Excel. The spreadsheet you get back probably has the right structure, reasonable assumptions, and formulas that link together in a way that looks correct. But now you have to check it. Do you open every cell and inspect every formula? If you choose to do that, you might as well have built the model yourself. If you ship it as-is, you’re putting your trust in a junior employee who works at superhuman speed but might have mistakenly encoded some very strange assumptions that didn’t stand out at first glance.

The core challenge we face is how we can efficiently validate the work that an agent has done. This problem is lost amongst all the discourse around agents curing cancer and taking jobs. Independent of any future advances in model and agent quality, it is already true today that coding agents, document generators, and AI co-workers have made work cheap. What they haven’t done is make it any easier to know whether the thing that was created is actually right. We’re not talking about code quality or best practices — we’re talking about the basic question of whether the output does what you intended. And right now, the answer is that most people don’t have a good way to check.

The bottleneck has moved. It used to be creation; now it’s validation. And if you’re not careful, you’re just replacing one form of slow, expensive work with another.

Why validation is so hard

The obvious explanation is that validation takes time — someone has to review the output, and at the volume agents produce, there simply aren’t enough hours in the day to do that. The rule of thumb used to be that one person might manage 5-7 employees – now you can manage as many as you can keep track of. Except those employees start from a blank slate with every task and only get better when new model updates are released.

That’s painful enough, but it’s perhaps not the deepest problem with validation. The more insidious challenge is that most people can’t validate agent output even in principle, because they don’t have enough clarity about what they wanted in the first place.

Think about how UI work used to happen before LLMs. A designer would spend days thinking through a feature — mapping out interactions, edge cases, and state transitions. They’d produce detailed mocks and hand them to an engineer. When the engineer delivered an implementation, validation was straightforward: does this match the mocks? The mocks were a checklist. If the designer had missed something, the engineer would probably realize that during implementation. The designer had done much of the hard thinking about UX upfront, and that clarity made it possible to evaluate the result quickly and confidently.

Now imagine you’re building the same feature, but instead of going through that process, someone types a loose description with missing details into a coding agent and gets a working UI back in minutes. There are no mocks to check against. The agent made dozens of decisions about interaction patterns, edge cases, and visual hierarchy that nobody specified. The output looks complete — it runs, it renders, you can click through it — but the completeness is an illusion. The agent didn’t resolve any ambiguity thoroughly; it just papered over it with plausible defaults. And now you’re stuck trying to figure out whether those defaults were good ones, without any reference point for what “good” was supposed to look like.

Before agents, ambiguity in your thinking got resolved during the work. When you mocked features or wrote code, you encountered edge cases and made decisions about them as you went. When you built a financial model, you were forced to confront your own assumptions cell by cell. The work itself was a forcing function for clarity. Agents have removed that forcing function, and nothing has replaced it yet.

This is why clarity of intent has become so critical — both because it makes agents produce better output and because without it, you can’t evaluate the quality of what you get back. If you knew exactly what you wanted, validation becomes something like a checklist: did the agent match these assumptions, handle these edge cases, produce these outputs? You can move fast. But if you went in with a vague sense of direction and let the agent make a hundred small decisions on your behalf, you’re now stuck trying to reverse-engineer whether each of those decisions was a good one. If you don’t start with clear thinking, it can often be harder than doing the work yourself.

The analogy we keep coming back to is delegating to an extremely intelligent college grad with opaque judgment. Whether an agent has good or bad judgment is subjective, but what matters is ensuring that the agent understands exactly what it’s supposed to do – and executes on it. When it’s only humans working on something, the back and forth between a group of people helps discover and test assumptions in the specification. Now, the model is going that all internally – the end result might be great, but not great in the way that you need.

The tools aren’t ready

Even for people who do have clarity of intent, the tools and workflows we rely on aren’t designed for agent-speed validation. Think about the financial model example again. You know what assumptions the model should encode. You know what the outputs should look like for a few test cases. But spreadsheets don’t give you a way to express those expectations and check them programmatically – or frankly even quickly for that matter. You’re stuck eyeballing formulas or plugging in test numbers manually, essentially doing the same work you always did.

The same pattern shows up in software engineering, where it’s even more acute. Previously, you thought about edge cases and validated your assumptions while writing code. That consideration happened incrementally, spread across hours or days of work. Now an agent collapses all of that into a single output, which means all of the validation that used to be distributed across the creation process gets compressed into the review phase. The result is that code review — which used to be a manageable check on largely human-produced work — has become the load-bearing wall of software quality. There’s no chance it lasts.

There’s no incremental testing when an agent does a task, because it delivers everything at once, almost instantaneously. If you skip the validation step or do it halfheartedly, you’re consigning yourself to shoddy results that you won’t discover until something breaks downstream. And if you do the validation step thoroughly, you’ve burned most of the time you saved by using the agent in the first place.

The interaction mode has to change

We’ve established that validation is more necessary than ever and also that it’s harder than ever. We’ve also established that existing modes of validating work aren’t going to work as we dramatically scale how much there is to check. This creates a question we haven’t seen anyone answer well yet: How should validation actually work?

Consider the financial model example again. You know what assumptions the model should encode, and you know what the outputs should look like for a handful of test cases. But the way you interact with a spreadsheet today is by opening cells and mucking around with value — a mode designed for a world where a person carefully built the model and you’re spot-checking their work. That mode collapses when an agent built the entire thing at once and every cell is a potential surprise.

What you actually want to do is interrogate the model. Ask it questions – what happens to revenue if churn doubles? What’s driving the margin assumption in Q3? If the answers match your expectations, you build confidence. If they don’t, you’ve found the problem without having to reverse-engineer deeply linked formulas. The validation step becomes a conversation with the output, not an inspection of its internals.

Not every validation step is going to look like a conversation, but similar principles apply to code, documents, or designs. When an agent generates a complete feature, you need to see whether the UX that was implemented matches what you had in mind. When an agent writes a report, you need to understand what the thesis of the document was and how that argument was validated. The UX of agent-assisted work needs to be built around this kind of fast, targeted verification, not around the line-by-line review workflows we inherited from a world where humans did the creating.

Today, almost none of our tools support this. Spreadsheets don’t let you express expectations and test them. Code review tools are built for diffing human-authored changes, not for interrogating agent-generated systems (where PRs are quickly ballooning in size). Document editors assume you’ll read and redline, not ask and verify. The interaction mode hasn’t caught up to the speed of creation, and until it does, validation is going to remain the thing that eats all the time agents save.

What this means

We think this is among the most important unsolved problems in AI tooling right now – and it’s barely even being discussed. The industry has invested enormously in making generation faster and cheaper, and it’s worked. But generation without validation is just moving the bottleneck, not eliminating it. People are going to be overwhelmed by agent output, and they’re either going to be disappointed by the quality of what ships or they’re going to stop checking and be surprised when things break.

The UX of AI-assisted work has to change to match this reality. It can’t just be about faster creation; it has to be about faster verification. That might mean agents that validate their own output against explicit expectations. It might mean production tooling that watches for the consequences of bad agent work in real time. It will almost certainly mean both, and probably things we haven’t imagined yet.

But one thing is clear: Without solving validation, we haven’t actually solved the productivity problem. We’ve just moved it.

Eduardo

Apr 11

Very good points. One option is we need to create data checks . For example: in a spreadsheet created by EXCEL via AI, we can have a data set to run the excel solution, that we previously know the results. This is a good alternative if we will use the EXCEL solution many times. But it is usefull for more simple or common problems.

Jurgen Appelo

Apr 10

Great post. But I see this being discussed every day. The overwhelmed human-in-the-loop is the nr 1 concern for people I talk to. But maybe that's just my bubble.

1 reply

6 more comments...

The AI Frontier

Discussion about this post

Ready for more?