Does the model matter for AI agents? Less than how you build the agent.
We built the same model into an agent six different ways. It scored 67 percent the worst way and 95 percent the best. The model never changed, so the gap is all in how the agent was built.
One model, six builds
These runs come from our open benchmark for AI agents, the Agent Voyager Project (AVP). Every build ran the same cheap frontier model, Claude Haiku 4.5, on the same task, reading a dense PDF page and rebuilding it as a structured HTML table. The only thing that changed from one run to the next was the build around the model.
| Build | Accuracy | Cost/run | Pass rate |
|---|---|---|---|
| Plan + a self-check step | 95% | $0.33 | 10/10 |
| Plain prompt (baseline) | 82% | $0.35 | 9/10 |
| Plain prompt + packaged Skill | 81% | $0.31 | 8/10 |
| Plain prompt + external tool | 70% | $0.20 | 9/10 |
| Terser prompt | 68% | $0.59 | 7/10 |
| Worked example (few-shot) | 67% | $0.82 | 8/10 |
The full run is in Captain's Log #1. The best build beat the worst by 28 points, and since the model was identical in every row, that whole gap is the work of the build.
Why a model leaderboard can't answer this
Leaderboards like Galileo's, Artificial Analysis, and the arena rankings are built to compare models. To produce a ranking they change the model, which makes them good at showing which model scores highest and unable to show how much the model mattered, since the model is the variable they move.
Our run does the opposite. We fixed the model and changed the build, so any difference in the score has to come from the build. That is the only way to separate what the model contributes from what the build does.
How this sits with the wider research
Our result lines up with what others have found. Ethan Mollick wrote in February that the gaps between models have grown small enough that, for most people, the app and harness around the model matter more. Research on agent scaffolding keeps reaching the same conclusion, that loading in every available component degrades performance and a leaner build beats the fully loaded one (More Is Not Always Better).
What this run adds is a controlled measurement on a single fixed model, so the result rests on a number you can check rather than a claim you have to take on faith.
When the model really is the bottleneck
This only holds above a certain level of model capability. Below it, the model is the bottleneck and no build will save it, because a weak model wrapped in a careful harness is still a weak model.
That is what makes the result worth weighing. Haiku 4.5 is a capable, cheap model that sits well above that floor, which is the hardest case for the argument, since a strong model is exactly where you would expect the build to stop mattering. The build still changed the score by 28 points.
How to choose a model for an agent
Start with a capable model you can actually run under your constraints. That choice sets the ceiling on what the agent can reach, and a leaderboard is a fair guide to where that ceiling sits. Everything after it is the build, and the build is where the agent gets good. Give it a clear plan for the task, only the tools it needs, and a step that makes it check its own work before finishing.
You pick the model once and then spend months on the build. On this benchmark the build was worth 28 points on a model that never changed, which is a plain answer to where your effort should go.