Did cheaper AI inference make AI agents cheaper to run?

Not the way the price chart implies. Per-token inference prices fell about 430x between 2021 and 2026, from $60 per million tokens for GPT-3 Davinci to $0.14 for DeepSeek V3. But agentic workloads consume far more tokens: Goldman Sachs estimates a resident agent burns over 100,000 tokens a day versus roughly 1,000 for a 2021 chatbot query, a 100x increase. Net savings on a resident-agent workload land near 4.3x, not 430x, and heavy real-world use erases even that.

How does the Jevons paradox apply to AI inference costs?

William Stanley Jevons observed in 1865 that more efficient steam engines made England burn more coal, not less, because efficiency made coal worth using for new tasks. AI inference follows the same pattern. Cheaper tokens did not become a discount; they became a budget, spent on retries, reflection, tool calls, and agents that run all day. The price per token keeps falling while total token consumption, and the bill, climbs.

Inference got 430× cheaper. Your agent didn't.

June 3, 2026 4 min read

I run agents constantly. One is working in another window while I write this; others ran overnight while I slept. Between them they go through more tokens in a day than I used to spend in months of asking a chatbot questions, and the bill is small enough that I keep forgetting to look at it.

A month ago I wrote that the cost of intelligence hit zero. GPT-3 Davinci cost $60 per million tokens in November 2021. DeepSeek V3 does the same work for $0.14. That’s a 430× cut on the raw price, and a16z’s performance-adjusted number runs closer to 1,000×. Open-weight models set a floor no lab can raise. None of that has changed.

What stayed with me afterward was the part I left out: where the money actually went.

In 2021 a chatbot query ran on about a thousand tokens. You typed a question, the model answered, the meter stopped. In 2026 the model doesn’t stop. Goldman Sachs, in a May report called “Decoding the Agentic Economy,” puts a background AI copilot at roughly 5,000 tokens a day and a resident agent (software that works on your behalf instead of waiting to be prompted) at over 100,000. The query has grown into a worker that runs all day, and the token count grew with it. Same price per token, a hundred times as many of them.

Do the division and the headline falls apart. Price dropped 430×. Token count rose 100×. Net savings on a resident-agent workload: 4.3×.

The gap between four-point-three and four hundred and thirty is the whole story. It isn’t a rounding error or a bug better engineering will close. It’s what happens when you make a resource cheap. William Stanley Jevons noticed it in 1865: when steam engines got more efficient, England burned more coal, not less, because efficiency made coal worth using for things it had never touched.

The 430× became a budget, not a discount. We spent it on capability: retries, reflection, tool calls, multi-agent pipelines, agents that run all day instead of answering once and going quiet. The price fell, the ambition rose to meet it, and the bill barely moved.

The trap cuts both ways. A resident agent burning 100,000 tokens a day costs about a penny and a half at today’s prices. The 2021 chatbot query it replaced cost six cents: one answer, a thousand tokens, sixty dollars a million. The agent does vastly more and still costs less than the thing that did almost nothing. Intelligence really did get cheap. It just stopped showing up as a smaller number on the invoice and started showing up as a bigger number in the “what it can do” column.

So the 430× is real and the 4.3× is real, and you have to know which one you’re holding.

And 4.3× is the generous end of it. Goldman’s 100,000-token agent is a light user; the ones I run burn millions a day. At that volume the 430× doesn’t shrink to 4.3×; it disappears entirely, because I did what everyone does with something cheap and used far more of it.

I wrote in March, in “The real tokenomics,” that SaaS pricing has no word for what an agent costs. This is the other half of that. The companies and investors who anchored on “inference is going to zero” built their models on the 430×. The per-token price chart points down and to the right, and it’s easy to extrapolate it into free. But nobody runs a token. They run a task, and the task got hungrier at almost the exact rate the token got cheaper. Margin math done on the price curve is margin math done on the wrong number.

You can see it in the decks. “Costs fall every quarter” is true and beside the point if your product is an agent doing ten times more work per quarter to stay competitive. The job is the unit that matters, and the job’s appetite for tokens climbs with capability.

The labs know this. It’s why the frontier keeps shipping models that think longer, call more tools, and run in longer loops. More capable means more tokens, and more tokens at a lower price is still more revenue. The price war and the capability war are the same war, and they net out to a bill that doesn’t fall the way the chart promises.

None of this is an argument against cheap inference. Cheap inference is the best thing that has happened to software in a decade. It’s an argument against reading one number and thinking you’ve read the other. The cost of intelligence hit zero. The cost of an intelligent system did not, because we keep building bigger systems with the savings.

Price the token at 430×. Price the work at 4.3×. The difference is your business.

I built Where the 430× goes, an interactive companion to this piece.

Keep reading

The subscription was an insurance policy

The agent economy runs on permission slips

Cloning Claude