Port of Contextpctx · port of context
← The Context Board
Governed

The agent said the job was done. It was 53% wrong.

We asked an AI agent to do a job and it reported back that it had done it perfectly. In fact it was wrong on nearly half the work, and its own summary gave no sign of that. Only the recorded log of what it actually did caught the failure.

PCTX Editorial · May 13, 2026 · 5 min

The self-report is not evidence

When an agent says the job is done, you should not always believe it. We saw it in our open benchmark for AI agents, the Agent Voyager Project (AVP). In one run, an agent rebuilt a hard table, reported that every value matched, and scored 53 percent. Nothing in the agent's own summary hinted at the trouble, and only the recorded trace of what it actually did caught the error.

When the sole record you keep is the agent's summary of its own work, you are leaning on the one account that fails exactly when you most need it to hold.

The failed run was the fast, cheap one

The agent had to read a dense page from a 2012 econometrics paper, two correlation matrices stacked on one sheet with half the cells blank, and rebuild it as a structured table. The model was Claude Haiku 4.5, and one run did it on text alone.

RunTurnsCostAccuracy
Text-only5$0.0553%
Vision-equipped24$0.33100%

The fast, cheap run finished in five turns for five cents and reported back in a single clean line, that all values matched perfectly, when it had in fact scored 53 percent.

Close to half the cells were wrong, and nothing in the run raised a flag. The whole record is in Captain's Log #2.

Confidence has nothing to do with correctness

A stronger model would have scored higher here, which is true and also beside the harder point. A model's stated confidence does not track whether it was actually right.

The run that failed reported success in the same even tone a passing run would use, with no hedge and no warning. Your record is least trustworthy on the runs that went wrong, which is exactly where the agent's account and the reality come apart.

Why this is an evaluation and observability problem

Most monitoring grew up watching outputs. A system returns a result, you log it, you alert on the ones that look wrong. Agents do not fit that shape.

An agent does not hand back a single result to inspect. It takes a sequence of actions, and the same instruction can follow a different path on the next run, so there is no one artifact to check, only a trajectory. The part of that trajectory you most want to trust, the agent's own narration, is the part that holds up worst.

AI agent evaluation and observability have to drop a level, from what the agent said it did down to what it actually did, step by step.

Can the agent grade its own work? Can an LLM judge it?

You could have the agent check itself, or hand its output to a second model as a judge. Both carry the same flaw. Self-checking is what failed here, since the agent inspected its own work and pronounced it perfect. A second model judging does no better, because it still scores the output instead of the process that produced it.

The account that holds up is the recorded execution, every step and every tool call, what got read and written, which model made each decision, and what it cost, captured as it ran and complete enough to replay. That kind of record does not depend on the agent being honest about itself, or even being right.

The trace caught what the score missed

The passing run makes the same case from the other direction. It scored 100 percent and looked just as clean from the outside. The recording told a different story.

The agent won without ever using the vision tool we gave it for the job, after eight dead ends, on a text fallback nobody designed for it. The score confirmed the work got done, but only the trace explained how, and whether the approach would survive the next input that looked a little different.

When you have to prove what the agent did

If you run agents where the result has to be correct, the agent's own report is the wrong thing to trust. A confident summary tells you nothing reliable about what actually ran.

The thing you can lean on is the execution itself, the full record of what the agent did, captured where it runs and kept by you, rather than narrated after the fact or pieced together later from whatever happened to land in the logs. For a regulated team, that record is also what turns 'prove what it did' from an uncomfortable question into one with an answer, which is often what separates a pilot that ships from one that stalls.

Common questions

Can you trust an AI agent's report that it finished a task?
No. In our open agent benchmark, an agent reported that every value matched and had scored 53 percent. A model's stated confidence does not track whether it was right, so the self-report is not evidence the work was done.
What is a silent failure in an AI agent?
A silent failure is a run that reports success while producing wrong output, with nothing in the agent's own account flagging the problem. The only reliable way to catch it is the recorded execution trace.
Is using an LLM as a judge reliable?
It carries the same weakness as letting the agent grade itself, because it scores the output rather than the process that produced it. The account that holds up is the recorded execution, not a second model's verdict on the answer.
What should AI agent evaluation and observability capture?
The full execution trace: every step, every tool call, what was read and written, which model made each decision, and what it cost, captured as it ran and complete enough to replay.