Cheap AI Isn't Ending. It's Forking.

Three things happened in the last three months. Most people are reading them as one story.

In March, Anthropic quietly tightened Claude Max limits during US morning hours. People on the $200-a-month plan started hitting walls earlier than they used to, and on March 31 the company said the quiet part out loud: demand was outrunning capacity. They eased the throttle in May, after locking in a deal with SpaceX for 300 megawatts of new compute at the Colossus 1 datacenter. The pattern that month was tighten, signal, ease - the kind of cycle that tells you the system is running close to a ceiling.

On June 1, GitHub Copilot moved every plan onto usage-based billing. Sticker prices stayed the same. Pro is still $10, Pro+ is still $39, Business is still $19 a seat. What changed was the billing basis. Token consumption now meters against GitHub’s listed API rates, with a monthly credit allotment included and overages billed on top. Within a week, the heavy users posted their bills. One developer went from $29 a month to $750. Another posted a $3,000 month. Same plan name, same product page, very different invoice.

On June 9, Anthropic shipped Claude Fable 5. It’s the most capable model the company has ever released, and the API price is $10 input and $50 output per million tokens - double Opus 4.8 on both sides. Subscribers get it free until June 22. After that, it moves out of the subscription tier and into usage credits. “Once capacity grows,” they said, the standard subscription access might come back. The timeline isn’t given.

As it turned out, the trial window was even shorter than announced - Fable 5 got pulled from broad subscription access before June 22 ran out. The access closed quickly, but the pricing precedent told the industry which way the wind is blowing for frontier models, and that signal isn’t going anywhere.

The headline reading on all of this is: AI is getting expensive again. The cheap-tokens era is over. Time to budget for it.

I think that reading is half right in a way that hides the actual shift.

The Two Curves

What’s actually happening is that AI pricing is splitting in two.

One curve is going down fast. The price to do a given task at a given quality level keeps falling - Epoch AI puts the median rate of decline at about 10x per year for equivalent intelligence, and on some benchmarks the rate is faster. a16z called this “LLMflation” two years ago and it hasn’t slowed. GPT-4 quality, which cost $20 per million tokens in late 2022, runs at $0.40 today. Gemini Flash, Haiku 4.5, Sonnet 4.6 - the commodity shelf gets cheaper every quarter.

The other curve is stable, or rising. Frontier reasoning models hold their price. Opus 4.8 didn’t get cheaper at launch the way mid-tier models do. Fable 5 came in at double the input cost of the model it nominally replaces. The frontier shelf doesn’t follow LLMflation - it sets its own price based on what the market will pay for the best available intelligence right now.

If you only look at one curve, the story is incoherent. The commodity curve says cheap tokens are getting cheaper. The frontier curve says the most useful tokens are getting more expensive. Both are true at the same time, and the gap between them is widening.

The easiest way to think about it is a consulting firm. A Principal partner runs $1,000 an hour, books months in advance, and you don’t put one on a routine deliverable. An Associate runs $300 an hour, has open time, and handles most of the work that doesn’t need a Principal. The price gap is the firm telling you who to use for which task. The same shape is showing up in AI. Fable 5 and Opus 4.8 are the Principals. Sonnet 4.6 and Haiku 4.5 are the Associates. The right question for any task isn’t “which model is best” - it’s “which model is appropriate,” and the answer is mostly going to be the cheaper one.

The three events I started with are not about AI getting more expensive. They are about the industry recognizing that these two curves are different products, and starting to price them differently.

The Copilot transition recognized it first. A flat $10 plan can subsidize chat-style usage, but it cannot subsidize a developer running parallel Claude Code sessions for eight hours. The math only works if usage is normal-distributed. Agentic coding broke the distribution. Usage-based billing isn’t a price increase - it’s an admission that the flat fee was the wrong instrument for the actual product.

The Fable 5 carve-out recognized it more explicitly. Anthropic said, in effect: this model is not in the deal. You can have access during the trial window, and then frontier capacity moves to metered pricing while the commodity tier stays in your subscription. That’s a product split, not a subscription squeeze.

The Anthropic Max tightening was the warning shot. They had a flat-rate plan priced for what they thought heavy use would look like. Heavy use turned out to be agents running overnight. They throttled, signaled, raised limits when capacity allowed, and built the playbook in public for how to manage frontier compute under flat-rate. The playbook only goes so far before you change the pricing model.

Why It’s Happening

Compute is the binding constraint, not algorithms. The bottleneck moved from “we don’t know how to build a smarter model” to “we cannot get the hardware to run it at the volume we need.” High-bandwidth memory, the stacked DRAM that frontier inference depends on, is made by three companies - SK Hynix, Samsung, Micron. Lead times on data-center GPUs run 36 to 52 weeks. The power-delivery infrastructure - transformers, switchgear, substations - is in shortage in the regions where the largest training and inference clusters live. The SpaceX deal that gave Anthropic 220,000 GPUs in May is meaningful, and it bought roughly a quarter of relief on a constraint that takes 12 to 24 months to substantively change. Compute pricing for SOTA is rationing right now, not market pricing.

Flat-rate subscriptions are a marketing layer on top of usage-based products. Every consumer SaaS pricing model assumes that the cost to serve any given customer is small and predictable. AI breaks both assumptions. Cost-to-serve is high and variable, and a single customer running agents at scale costs the provider real money. Flat-rate works when 80% of users have similar usage and the marginal cost is rounding error. Agentic coding broke both: a small fraction of users started consuming a large fraction of capacity, and the cost-per-user at that fraction exceeded the price. Once that gap shows up, you don’t get to keep flat-rate as your pricing - you get to keep flat-rate as your marketing while you re-price underneath.

Jevons keeps total spend rising even as per-token price falls. This is the part I’ve watched closely at Finsi, running cascading agents in production. Per-token prices dropped roughly 80% year over year. Our total spend went up, not down. Cheaper tokens enabled larger context windows, which enabled more tools, which enabled longer chains, which enabled more parallel agents. The bottom of the price curve unlocks workload that didn’t exist before. This is exactly the Jevons paradox at the infrastructure layer - the more efficient the resource gets, the more total resource gets consumed. The pattern has shown up in steam, in electricity, in compute. There’s no reason to expect AI inference to be different.

The middle gets squeezed. Two curves diverging from each other leaves a gap. The gap is occupied by yesterday’s frontier model, still selling at a premium price, increasingly outperformed by the commodity tier on simple tasks and outperformed by the new frontier on hard tasks. The economic logic for the middle disappears - if you have a task that the commodity tier handles, you use the commodity tier. If you have a task that the commodity tier doesn’t handle, you skip the middle and go straight to frontier. Products positioned in the middle - and there will be many - have a narrow window before their pricing becomes indefensible.

AI spend stops looking like SaaS and starts looking like COGS. Software companies have spent twenty years training CFOs to read SaaS as a line item: predictable, per-seat, gross-margin-friendly. AI inference, billed by token, behaves more like AWS than like Salesforce. It scales with usage, fluctuates with feature engagement, and shows up in COGS rather than OPEX. The next twelve months will see a lot of CFOs discover that their AI line item is moving from $20-a-seat to a variable-cost item that needs the same FinOps discipline they apply to cloud. Most companies don’t have the muscle for that yet.

Four Bets

SOTA token prices stay flat or drift up, not down. The 10x-per-year decline applies to the model class one tier below the current frontier. Fable 5 at $10 input is, I think, the new ceiling that holds for a while. If Mythos 5 ships at $15 or $20 input next, it’ll feel surprising in the way Fable 5’s pricing felt surprising last week - and it’ll set a new ceiling that holds for a year or two.

Flat-rate dies on the frontier and survives on commodity. The Copilot move is the template, and Anthropic just demonstrated the upgrade path: subscription holds the commodity tier, frontier moves to metered. I expect Cursor, Windsurf, Cognition, and Replit to follow the same template within six months. The marketing story will be “premium minutes” or “frontier credits” - same instrument, friendlier name. The legacy flat-rate “unlimited Sonnet” or “unlimited Pro” tier will continue, because the unit economics there actually work.

Most usage shifts down the model stack while total agent count goes up. Engineering teams are about to discover model routing the way they discovered caching ten years ago. Eighty percent of agent calls don’t need Fable 5 or Opus 4.8 - they need a model that can read a spec and emit a function. Sonnet 4.6 and Haiku 4.5 handle that at a fraction of the cost. The teams that do this right will see per-task cost fall faster than headline pricing suggests. The teams that don’t will see their AI bill grow until it triggers a CFO conversation. Net-net, total tokens consumed per engineer goes up, because the number of parallel agents per engineer goes up. Lower per-task cost, more tasks - Jevons again.

Company AI spend grows 30 to 50% year over year, and the budget line shifts. Writer’s enterprise survey put 65% of companies increasing AI budgets in 2026 by a median of 22%. Gartner expects 40% of enterprise applications to embed task-specific agents by end of 2026. The growth is real. What changes is where it lands. Procurement starts moving the line from “subscription seats” to “infrastructure spend.” Finance starts asking the same allocation questions they ask about Snowflake and AWS. TokenOps becomes a discipline the way FinOps did over the last decade - with allocation, alerting, drift detection, and quarterly reviews. The companies that already have FinOps maturity will adapt fast. The companies that don’t will spend the first half of 2027 building it.

What This Means If You Build

If you build product that depends on AI, the next twelve months will reward two specific moves.

The first move is to know which shelf you’re building on. If your product is good-enough quality at low cost - automated email categorization, basic code completion, internal data lookup - you’re on the commodity shelf, and your input costs will keep falling. Your pricing model should mirror that. Per-user, predictable, subscription-style billing works. The commodity shelf is the right place to build durable margin, because input costs work in your favor over time.

If your product depends on frontier capability - the hardest reasoning, the longest contexts, the agents that actually finish work - you’re on the frontier shelf, and your input costs will stay stable or drift up. Your pricing model needs to mirror that too. Usage-based, tiered, or premium-credit billing matches what’s actually happening to your COGS. If you sell a flat-rate plan on top of frontier inference, you’re underwriting Jevons at your own expense.

The middle is the dangerous shelf. If your product is positioned as “as good as the frontier was a year ago, at a discount to the frontier today” - you’re going to spend 2027 explaining to customers why they shouldn’t just use the commodity tier or pay for the new frontier. The market for “almost-frontier-at-a-discount” gets very small once the gap between the two real curves opens up.

The second move is to build the muscle for model routing now, before the bill forces you to. The current generation of agent frameworks treats model selection as a constant. The next generation will treat it as a parameter - a routing layer that picks the cheapest model that can handle the task, falls back to a more capable model when it can’t, and keeps a per-customer or per-feature budget. This isn’t novel architecture. It’s the same pattern as multi-cloud routing for compute, multi-region routing for latency, multi-CDN routing for delivery. It’s coming for inference next.

The long-term win goes to the teams that learn to pick the right model for each kind of work. At Finsi we route Sonnet to the bulk of mid-difficulty work, Opus to anything that needs a large context window or expensive reasoning, GPT to review steps, and Gemini to text generation. Routing is the easy part. The hard part is the evals - not “did the agent return the correct result,” but “what’s the cost per correct result, and would the cheaper model handle this class of task.” Evals turn routing from a guess into a knob you can actually adjust on the price-quality curve.

The team that owns the routing layer in a company will be in the same position the SRE team was in 2015 - the people who can explain why the bill went up, why it should go down, and what trade-offs the business is making by running on the current configuration.

The Choice

The week that Fable 5 shipped, I was on a call with a founder building an agent product for a vertical market. He was looking at the new pricing and trying to decide whether to lock his stack to frontier or to default to commodity with frontier as a fallback. He’d been assuming the question was about quality. After the week’s news, he started asking it as a question about unit economics.

I think that’s the actual signal underneath the three events. The question wasn’t whether AI was getting more expensive. The question was which AI you were building on, and what the unit economics of that shelf would look like in twelve months.

The cheap-AI era isn’t ending. It’s still there, on the commodity shelf, getting cheaper. What’s ending is the period where you didn’t have to choose, because everything came in one bundle at a flat price.

You have to choose now.

I don’t know yet which shelf Finsi ends up on for every workload. We run cascading agents in production today, and we’ll probably end up using both - frontier on the hardest planning steps, commodity on the cheap, parallel work. The decision is more interesting than it was six months ago, because the answer used to be “whichever is the best available” and now it’s “whichever has the right unit economics for the task.”

The week of June 9 was when that question got real.