Guide

Context engineering

The discipline of deciding what enters each model call — and why it now controls both the cost and the quality of your AI agents.

Prompt engineering was about wording a single request well. Context engineering is about everything else the model sees: the system prompt, the tool definitions, the documents, and — in an agent — the entire running transcript of earlier steps. As soon as a model stops answering one question and starts doing multi-step work, the prompt becomes a rounding error. What you put around it, on every call, is what decides whether the agent is fast, accurate, and affordable.

That shift matters because the economics changed. Per-token prices keep falling, yet teams keep getting larger bills. The reason is structural, and it is the heart of context engineering.

Why the context, not the prompt, is the cost center

A transformer re-reads its entire context on every call. It has no memory between calls — each step re-sends everything that came before. An agent that takes a dozen steps does not pay for twelve calls; it pays for the first step, then the first two, then the first three, and so on. Cost follows a triangular curve, which is why agents routinely come in three to five times over budget.

You can see the shape of this for your own workload in the token cost calculator: add a few steps and the monthly figure climbs far faster than the step count does.

The two kinds of bloat

Context grows from two distinct sources, and it helps to name them.

Definition bloat is the cost you pay before any work begins. Every tool the agent might use is described to the model up front — names, parameters, types, descriptions. A handful of MCP servers can load tens of thousands of tokens of schemas into the window before the first real instruction. You pay for all of it, on every call, whether the tool is used or not.

Result bloat is the cost that accumulates during the work. Each tool result — a file, a query response, a page of a document — enters the context and then rides along on every subsequent step. By the end of a long task, the bulk of what you are paying to reprocess is intermediate output the model has already seen.

Why a bigger context window will not save you

The instinct is to reach for a model with a larger context window. It is the wrong fix. A bigger window does not make the tokens cheaper; it just raises the ceiling you can bloat up to. And capacity is not the only problem — quality is.

Beyond a point, more context makes answers worse, not better. Attention spreads thin, the model loses the signal in the noise, and tool-call accuracy degrades as the window fills — the effect often called context rot. A larger window lets you defer the reckoning; it does not remove it. Context engineering treats the window as something to keep clean, not something to fill.

The usual levers — and where they stop

Most cost-reduction advice is a stack of partial measures. Each helps; none addresses the root.

Prompt caching bills the unchanged, re-sent portion of the context at a fraction of the normal rate. It is the highest-ROI first move, and you should use it — but it does not remove the intermediate results that dominate late-step cost, and it does not remove the round trips.

Model routing sends easy steps to a cheaper model. Useful, but the bloated context travels with the task regardless of which model receives it.

Retrieval and compaction trim what goes in and summarize what piles up. They help, at the cost of complexity and the risk of dropping something that mattered.

All of these manage the symptom — a context that is too big — after the fact. The structural move is to stop the work from entering the context in the first place.

The structural fix: keep the work out of the context

Instead of making many discrete tool calls, the model writes one program against a compact, typed interface and runs it. The tools are described once, in a form far smaller than full schemas. The intermediate results stay in the execution runtime, where the program can filter and combine them — they never re-enter the model's context. Only the final answer comes back.

This is the approach now called Code Mode. On multi-tool, large-result work it is where the documented reductions come from — Anthropic reported a single task falling from roughly 150,000 tokens to about 2,000, a ~98% drop, by letting code handle the intermediate data. (On a single, simple tool call there is little to save; the gains concentrate where the bloat does.)

Done well, context engineering is not a tuning exercise you repeat forever. It is a structural decision about where the work happens — and the same structure that cuts the bill also keeps attention sharp and stays clear of rate-limit ceilings. At Port of Context, that runtime runs on infrastructure you own, isolated and contained by default, with every run observable.

See it on your own numbers.

Put your workload into the calculator, or see how Code Mode runs the same work on infrastructure you control.