Model Watch

Does the model matter for AI agents? Less than how you build the agent.

We built the same model into an agent six different ways. It scored 67 percent the worst way and 95 percent the best. The model never changed, so the gap is all in how the agent was built.

PCTX Editorial · May 20, 2026 · 5 min

One model, six builds

These runs come from our open benchmark for AI agents, the Agent Voyager Project (AVP). Every build ran the same cheap frontier model, Claude Haiku 4.5, on the same task, reading a dense PDF page and rebuilding it as a structured HTML table. The only thing that changed from one run to the next was the build around the model.

Build	Accuracy	Cost/run	Pass rate
Plan + a self-check step	95%	$0.33	10/10
Plain prompt (baseline)	82%	$0.35	9/10
Plain prompt + packaged Skill	81%	$0.31	8/10
Plain prompt + external tool	70%	$0.20	9/10
Terser prompt	68%	$0.59	7/10
Worked example (few-shot)	67%	$0.82	8/10

The full run is in Captain's Log #1. The best build beat the worst by 28 points, and since the model was identical in every row, that whole gap is the work of the build.

Why a model leaderboard can't answer this

Leaderboards like Galileo's, Artificial Analysis, and the arena rankings are built to compare models. To produce a ranking they change the model, which makes them good at showing which model scores highest and unable to show how much the model mattered, since the model is the variable they move.

Our run does the opposite. We fixed the model and changed the build, so any difference in the score has to come from the build. That is the only way to separate what the model contributes from what the build does.

How this sits with the wider research

Our result lines up with what others have found. Ethan Mollick wrote in February that the gaps between models have grown small enough that, for most people, the app and harness around the model matter more. Research on agent scaffolding keeps reaching the same conclusion, that loading in every available component degrades performance and a leaner build beats the fully loaded one (More Is Not Always Better).

What this run adds is a controlled measurement on a single fixed model, so the result rests on a number you can check rather than a claim you have to take on faith.

When the model really is the bottleneck

This only holds above a certain level of model capability. Below it, the model is the bottleneck and no build will save it, because a weak model wrapped in a careful harness is still a weak model.

That is what makes the result worth weighing. Haiku 4.5 is a capable, cheap model that sits well above that floor, which is the hardest case for the argument, since a strong model is exactly where you would expect the build to stop mattering. The build still changed the score by 28 points.

How to choose a model for an agent

Start with a capable model you can actually run under your constraints. That choice sets the ceiling on what the agent can reach, and a leaderboard is a fair guide to where that ceiling sits. Everything after it is the build, and the build is where the agent gets good. Give it a clear plan for the task, only the tools it needs, and a step that makes it check its own work before finishing.

You pick the model once and then spend months on the build. On this benchmark the build was worth 28 points on a model that never changed, which is a plain answer to where your effort should go.

Common questions

Does the model matter for AI agents?

It matters up to a point. In our open agent benchmark, holding the model fixed and changing only how the agent was built took the same model from 67 to 95 percent. Above a basic capability bar, most of the difference between a fragile agent and a reliable one comes from construction, not model choice.

What is an agent harness?

The harness is everything around the model that turns it into an agent, like the plan it follows, the tools it can call, and the logic that decides when it is done. On a fixed model, the harness was worth 28 points of accuracy in our benchmark.

How do you choose an LLM for an agent?

Pick a capable model you can run under your constraints, then put the real effort into the build. Leaderboards show you roughly where the ceiling sits, but not how close your agent will get to it.

Is the model or the harness more important?

Above a capability floor, the harness. A weak model cannot be rescued by a good build, but among capable models the build decides most of the result.