The Cache-Aware Pricing Revolution: Why LLM 'Sticker Prices' Are Now Meaningless

The $/M token number plastered across every LLM pricing page has become a distraction. Two models with identical sticker prices can differ in effective cost by a factor of ten or more — and the cheaper-on-paper model is often the more expensive one in practice. The variable that determines real cost is not the model itself but the cache architecture of the provider serving it.

Cache-aware pricing is the most consequential and least discussed shift in LLM economics since the API model was invented.

The Hy3 Precedent That Isn’t

The evidence is hiding in plain sight on OpenRouter’s model rankings . In May 2026, Tencent’s Hy3 preview — a model with mediocre benchmark scores and a restrictive license — climbed to the top of the usage charts, rivaling DeepSeek V4 Flash and surpassing Claude. The surface explanation was price: Hy3 preview costs $0.063/1M input tokens, cheaper than the $0.10/1M for DeepSeek V4 Flash on OpenRouter’s listing page.

But the sticker price is a mirage.

DeepSeek V4 Flash served directly by DeepSeek’s own API carries an effective input price of $0.018/1M tokens. The same model served by other providers on OpenRouter costs between $0.05 and $0.10/1M effective. Hy3 preview, despite its lower listed price, has an effective cost of $0.034/1M — nearly double DeepSeek V4 Flash via DeepSeek’s infrastructure.

The Hy3 story was never about model quality. It was about caching economics and a single large application tuned to SiliconFlow’s specific cache hit profile. The model that appears cheaper on the ranking page is actually more expensive in the one number that matters: the effective price after cache hits are accounted for.

The model that appears cheaper on the ranking page is actually more expensive in the one number that matters: the price you pay after cache hits are accounted for.

Why 98% of the Cost Is Now Input Tokens

The stateless API call is a fiction that hasn’t kept pace with how LLMs are actually used. Every conversation turn, every agent tool call, every context window extension resends the entire conversation history. In agentic workflows — which now dominate LLM usage — a single session can accumulate hundreds of thousands of tokens of prior context.

The data from OpenRouter confirms this: 98% of all tokens processed across their platform are input tokens. Only 2% are output.

This ratio transforms the economics of inference. When 98% of your cost comes from reprocessing the same conversational prefix over and over, the price that matters is not the generic per-token rate but the cache read cost — the per-token discount you get when the provider reuses previously computed KV cache states instead of recomputing them from scratch.

The industry standard for cache reads is 10% of the base input price. OpenAI, Anthropic, and Google all converge on this figure. Anthropic’s prompt caching documentation shows Claude Opus 4.7 charging $5/MTok for base input and $0.50/MTok for cache reads — exactly 10%. OpenAI’s latest models follow the same pattern.

On DeepSeek’s API pricing page , V4 Flash charges $0.0028/MTok for cache hits. That is 2% of its base input price of $0.14/MTok. The same architecture-driven discount applies to DeepSeek V4 Pro, where cache reads cost just 0.83% of the base rate.

The gap between 10% and 2% is not a pricing decision. It’s a structural consequence of how the models handle attention.

What Hybrid Attention Does to the Cost Curve

DeepSeek V4’s innovation is a hybrid attention mechanism that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), as detailed in the DeepSeek V4 technical report . CSA compresses KV entries by 4x along the sequence dimension and uses a FP4 lightning indexer to select the top-k compressed blocks per query. HCA compresses by 128x and runs dense attention over the compressed stream. The layers alternate between these two mechanisms.

The results are dramatic. At 1M tokens, DeepSeek V4-Pro requires 27% of the per-token FLOPs of V3.2 and 10% of the KV cache memory. V4-Flash drops these to 10% of FLOPs and 7% of KV cache. Compared to a standard grouped-query attention architecture, the total KV cache size is roughly 2% of the standard.

This is not an incremental efficiency gain. It’s a structural shift in what it costs to serve a cached token. DeepSeek’s cache read pricing is not a loss leader. It is the natural price floor of a model designed from the ground up to minimize KV cache overhead.

This is not an incremental efficiency gain. It’s a structural shift in what it costs to serve a cached token.

Every other provider — even those serving the same open-weight model — must run V4-Flash on their own hardware with their own attention kernels. They cannot replicate DeepSeek’s CSA/HCA advantage because the architecture is embedded in the model itself, and the provider’s cache hit rate and KV storage strategy determine their effective cost.

The Effective Price Table Changes Everything

OpenRouter’s effective pricing tables now account for cache hit rates alongside cache read costs. The numbers reveal a market that looks nothing like the sticker-price comparison:

Provider	Stated Input Price	Cache Read Cost	Effective Input Price
DeepSeek (direct)	$0.14/MTok	2%	$0.018/MTok
Provider A (V4 Flash)	$0.10/MTok	20%	$0.05/MTok
Provider B (V4 Flash)	$0.10/MTok	50%	~$0.08/MTok
SiliconFlow (Hy3)	$0.063/MTok	44%	$0.034/MTok

The same model, same benchmark scores, same output quality — but a 4.4x range in effective price depending on which provider’s cache infrastructure you are routed through.

This changes the optimization problem entirely. The common advice to “pick the cheapest model for your task” is now incomplete. You need to pick the cheapest provider for your cache hit profile. A model with a higher sticker price but a lower cache read cost can be dramatically cheaper than a superficially cheaper model with poor caching economics.

Three Numbers That Matter More Than the Sticker

When evaluating LLM costs in 2026, three metrics replace the obsolete $/M token comparison:

Cache read cost as a percentage of base input. DeepSeek’s 2% is the new floor. The industry standard of 10% is the ceiling for well-optimized providers. Any provider charging more than 10% for cache reads is either running inefficient infrastructure or marking up cache hits as a profit center.
Cache hit rate for your workload pattern. Agentic workloads that resend large system prompts and conversation histories achieve the highest hit rates. Batch processing that rotates through many distinct prompts achieves lower hit rates. The effective price is a weighted average of cache hit and cache miss costs — and the weights depend entirely on your usage pattern.
Provider lock-in for cache continuity. The cache only persists within a single provider’s infrastructure. If OpenRouter routes your request to a different provider mid-conversation — which happens during failover or load balancing — the cache is invalidated and the next request pays the full miss price. This makes provider stability a pricing variable.

The Counter-Narrative That Everyone Is Missing

The dominant story about DeepSeek’s low prices is that Chinese labor and compute costs create a structural advantage. This framing is wrong.

DeepSeek’s pricing advantage is architectural, not geographic. The CSA/HCA attention mechanism is documented in their technical report and available in the open-weight model. Any provider with enough engineering talent could optimize their serving stack to approach DeepSeek’s cache economics. The fact that nobody has done so — that 13 providers serving V4-Flash all charge between 20% and 50% for cache reads while DeepSeek charges 2% — suggests the gap is not about labor arbitrage.

DeepSeek’s pricing advantage is architectural, not geographic.

It is about the difference between the model creator and the model reseller. DeepSeek knows exactly where every efficiency lever is because they designed the architecture. Third-party providers are running a model they did not build, on hardware they did not tune it for, with attention kernels that cannot exploit the full CSA/HCA pipeline.

The same dynamic explains why Hy3 preview’s effective price is higher than DeepSeek V4 Flash despite a lower sticker price. SiliconFlow, the sole provider of Hy3 on OpenRouter, charges a 44% cache read premium. Whether this reflects infrastructure costs, margin strategy, or low cache hit rates is unknowable from outside — but the result is that a model designed to be cheap ends up costing more than a more capable model served by its creator.

What This Means for Practitioners

The immediate takeaway is practical: if you are using OpenRouter’s automatic routing for DeepSeek V4 Flash, you may be paying 2x to 5x more than necessary without knowing it. The effective price varies by provider, and the default routing may not send your requests to the cheapest one. If you are a heavy agentic user — resending the same system prompt and conversation history across many turns — the savings from routing exclusively through DeepSeek’s own API can be substantial.

The LLM market is fragmenting along cache economics, not model quality. Two providers running the same model can have different effective costs because their cache hit rates and KV cache architectures differ. The cheapest model on a leaderboard may not be the cheapest model for your workload. And the most important pricing innovation of 2026 is not a new model — it is a new attention mechanism that compresses KV cache to 2% of the standard.

The bottom line: The era of comparing LLMs by a single per-token number is over. The number that matters is the one your bank account sees after the cache hits are tallied. And that number depends less on the model you choose than on the infrastructure of the provider you route through.

The Cache-Aware Pricing Revolution: Why LLM 'Sticker Prices' Are Now Meaningless

The Hy3 Precedent That Isn’t

Why 98% of the Cost Is Now Input Tokens

What Hybrid Attention Does to the Cost Curve

The Effective Price Table Changes Everything

Three Numbers That Matter More Than the Sticker

The Counter-Narrative That Everyone Is Missing

What This Means for Practitioners

Further Reading

No comments yet

Continue reading

The Integration Ceiling

The Sandbox War: Cloudflare and Vercel Both Solved the Same Infrastructure Blind Spot

File-Based Planning Is Becoming the Universal Agent Protocol

Track the tools. Lead the shift.