CLI vs. MCP vs. Code Mode

We benchmarked CLI tools vs raw MCP vs Code Mode MCP across 12 Stripe tasks. Code mode is 56% cheaper and uses 58% fewer tokens. Here's why.

Patrick Kelly · Mar 16, 2026

Summary

We ran 12 real Stripe tasks across three agent configurations: CLI tool, raw MCP, and Code Mode MCP. All three passed. But CLI used 2.4x more tokens than Code Mode MCP. The MCP vs. CLI debate is the wrong frame. What matters is how your client uses the primitive.

---

There's a debate playing out across AI engineering forums, Discord servers, and Twitter threads: should you give your AI agent a CLI tool or an MCP server? Teams building on top of APIs like Stripe, GitHub, and Linear are genuinely asking this. The MCP camp says tool-calling is the future. The CLI camp says structured commands are more reliable. Like most things in the AI world, everyone is debating without providing real numbers and data.

We ran a benchmark to settle it, and the results show that the CLI vs. MCP question is mostly a distraction. What actually matters is how your client uses MCP.

---

The Benchmark

We ran 12 real Stripe tasks across three configurations, using the same model (Claude Sonnet 4.6) and the same benchmark suite each time. The tasks ranged from simple reads (get account balance, list customers) to multi-step writes (create an invoice with line items, create a payment link with a new product).

Three configurations:

1. CLI tool: a single shell tool wrapping the stripe CLI. The agent calls it like stripe balance retrieve or stripe invoices create. 2. Raw MCP: the official @stripe/mcp server via npx, used as-is. The agent calls individual MCP tools one at a time. 3. Code Mode MCP: same @stripe/mcp server, but the agent writes TypeScript code to orchestrate the calls instead of invoking them one by one.

All three passed 12/12 tasks. The differences show up in how much context the agent burns through to get there, and whether it stays focused on the task or gets lost managing state between steps.

---

The Numbers

Approach	Passed	Total Cost	Total Tokens	Avg Tokens/Task
CLI	12/12	$2.22	711,555	59,296
Raw MCP	12/12	$1.60	506,970	42,248
Code Mode MCP	12/12	$0.98	294,924	24,577

CLI burns through 2.4x more tokens. Raw MCP uses 1.7x more.

Total tokens across 12 tasks

mermaid %%{init: {'theme': 'base', 'themeVariables': {'xyChart': {'width': 700, 'height': 350, 'plotColorPalette': '#184289'}}}}%% xychart-beta x-axis ["CLI", "Raw MCP", "Code Mode MCP"] y-axis "Tokens" 0 --> 900000 bar [711555, 506970, 294924]

Total cost across 12 tasks (USD)

---

Why CLI Looks Deceptively Good

On simple tasks, the CLI approach is remarkably token-efficient. Getting the account balance with the CLI costs just 3,001 tokens. The same task with raw MCP costs 19,172 tokens, six times more.

Why? MCP servers advertise their full tool schemas on every request. The Stripe MCP server has dozens of tools. Every single LLM turn includes that entire schema in the context window, whether the agent needs it or not. The CLI tool has one schema entry: args: string. It's tiny.

So if you're only doing simple, single-step API calls, a CLI wrapper is genuinely more efficient. The MCP evangelists are wrong to dismiss it entirely.

But the CLI advantage evaporates completely once tasks get complex.

---

Where MCP Wins, and Where It Still Falls Short

Look at the create_invoice task. This requires creating an invoice, adding line items, then finalizing it. Multiple dependent API calls need to share state.

Task: create_invoice	Steps	Cost	Tokens
CLI	19	$1.52	497,556
Raw MCP	12	$0.53	168,480
Code Mode MCP	4	$0.13	38,847

The CLI agent had to call the tool, parse the output in natural language, figure out the next step, and repeat, 19 times. Each round trip through the LLM adds the growing conversation history to the context window. By step 10, the agent is carrying the weight of everything that came before it.

Raw MCP does better: 12 steps instead of 19. Structured tool responses are easier to reason about than CLI text output. But the same problem exists: every step is a full LLM round trip, and the token count compounds.

Code Mode collapses 12 steps into 4. The agent writes a short TypeScript program, executes it, and the program handles the looping and state management internally, without an LLM turn for each iteration. The context window stays small.

CLI: 19 LLM turns. Raw MCP: 12. Code Mode: 4.

mermaid graph TD subgraph CLI["CLI"] direction TB C1[LLM: call stripe CLI] --> C2[Parse text output] C2 --> C3[LLM: call stripe CLI] C3 --> C4[Parse text output] C4 --> C5[LLM: call stripe CLI] C5 --> C6["... 14 more turns"] end

subgraph RAW["Raw MCP"] direction TB R1[LLM: call MCP tool] --> R2[Structured response] R2 --> R3[LLM: call MCP tool] R3 --> R4[Structured response] R4 --> R5["... 8 more turns"] end

subgraph CODE["Code Mode"] direction TB M1[LLM: write TypeScript program] --> M2[Execute in sandbox] M2 --> M3[Program calls MCP tools internally] M3 --> M4[Return result to LLM] end

style CLI fill:#184289,color:#fff,stroke:#002B56 style RAW fill:#1E6969,color:#fff,stroke:#002B56 style CODE fill:#002B56,color:#fff,stroke:#184289

---

MCP is Infrastructure, Not Strategy

MCP is a transport primitive, like HTTP. HTTP doesn't make web apps fast. Architecture does.

The MCP vs. CLI debate is the wrong frame. The right question is: how does your client use the primitive?

Raw MCP gives the agent a box of individual tools and says "good luck." The agent has to reason about sequencing, state, and error handling through natural language, one tool call at a time. That's expensive, especially when the model has to re-derive context at every step.

Code Mode changes the interaction model. Instead of the LLM calling tools, the LLM writes code that calls tools. That code runs in a tight loop without LLM involvement. The model's job shrinks to "write the right program" rather than "make the right sequence of decisions." On multi-step tasks, this is a fundamentally different and much cheaper computational pattern.

Raw MCP calls one tool per LLM turn. Code Mode writes a program that calls many.

style CodeMode fill:#002B56,color:#fff,stroke:#184289

---

What This Means in Practice

Real agent workflows look like create_invoice, not a balance lookup. They involve loops, conditionals, error recovery, and chained writes. If your agent is doing any of that through raw tool calls, you're paying the compounding context window tax on every iteration.

A few practical takeaways:

If you're building an agent that calls APIs: Give it an MCP server, not a CLI wrapper. Structured schemas make tool responses easier to process and reduce hallucination on output parsing.

If your tasks involve loops or multi-step writes: Don't rely on the LLM to manually sequence tool calls. Use a client that lets the agent write code to orchestrate the calls. The token savings on complex tasks are not marginal. They're 4-10x.

The MCP server quality matters, but less than you think: A well-designed MCP server helps, but the biggest wins come from how the client consumes it. Code Mode MCP with the out-of-the-box Stripe server outperforms raw MCP by 39%, using exactly the same server.

---

Closing Thought

The people debating MCP vs. CLI are mostly right about the symptoms and wrong about the cause. CLI tool efficiency comes down to schema size, not architecture. MCP's advantage over CLI comes from structured responses, not the protocol itself.

The client layer is the variable. Give the agent code execution and the cost curve changes. Keep it calling tools one at a time and the context window tax compounds with every step.

The debate should be about client architecture, not transport format.