Inference's Subsidy Hangover: How Xiaomi's 1000 TPS Exposes the Cost Fiction at the Heart of US AI

This week, delivered two stories that cannot be understood in isolation. On Monday, Xiaomi and TileRT announced that their MiMo-V2.5-Pro-UltraSpeed — a trillion-parameter model — achieves over 1000 tokens per second on a single 8-GPU commodity node. On Tuesday, GitHub finalized its switch to token-based billing, with developers reporting their Copilot costs jumping from $29/month to $750/month or more.

The industry says these are separate facts: one is a Chinese engineering achievement, the other is a necessary pricing correction for a service that was never profitable. The story claims inference is inherently expensive, token billing is simply the market finding its level, and US labs are right to pass costs to customers.

That story is not just incomplete. It is a deliberate inversion of cause and effect.

Inference Costs Are Inevitable

The conventional wisdom, repeated in every earnings call and pricing announcement, goes like this: serving frontier models is extraordinarily expensive. Each query burns through GPU compute measured in seconds. The capex required to train these models runs into the hundreds of billions. Token-based billing is the only sustainable path forward. GitHub Copilot’s flat-rate days were an unsustainable subsidy, and the correction hurts but is honest.

This framing makes the current pricing wave feel like a law of physics. OpenAI charges for GPT-5-tier models by the token. Anthropic does the same. The message is unified: this stuff costs what it costs, and you were getting a deal before.

But the physics argument collapses the moment you look at what Xiaomi actually shipped.

What Xiaomi and TileRT Actually Proved

MiMo-V2.5-Pro-UltraSpeed is a 1.02-trillion-parameter MoE model with 42 billion active parameters. On a single node of 8 commodity GPUs — not Cerebras wafer-scale, not Groq custom silicon, not H100 clusters — it generates text at 1000 tokens per second. Xiaomi is offering access at 3x the cost of its standard model for roughly 10x the speed.

That is a pricing structure that says “faster costs proportionally less per token,” not “inference is inherently expensive.”

The technical stack reveals where the real cost leverage lives.

FP4 quantization on MoE experts only. The MiMo team applied MXFP4 quantization exclusively to the mixture-of-experts layers — which hold the vast majority of parameters — while keeping attention projections and other modules at higher precision. This is not a blanket compression. It is a surgical intervention: benchmark scores stay within 2% of the FP8 baseline. On SWE-Bench Verified, the FP4 model scores 77.4% compared to 78.9% for FP8 — within 2% with a 2x bit-width reduction. The model size drops dramatically, and memory-bandwidth pressure — the real bottleneck at inference time — is cut proportionally.

Memory pressure is one bottleneck. Throughput is another, and DFlash attacks it from a completely different angle.

DFlash block-diffusion speculative decoding. Conventional speculative decoding generates one draft token at a time, then verifies. DFlash fills an entire block of 8 masked positions in a single forward pass, then lets the backbone verify them in one step. Average acceptance length in coding scenarios hits 6.30 tokens — meaning the backbone confirms over 6 of every 8 draft tokens per verification round. The draft model uses sliding window attention, so its compute is constant regardless of context length. The result is not a marginal speedup. It is a structural change in how many tokens the model produces per unit of compute.
TileRT’s persistent execution model. This is the systems-side breakthrough that deserves more attention than it has received. Traditional inference frameworks decompose a model into isolated operators and launch them one by one. Every launch carries host-side overhead, synchronization costs, and global-memory round-trips. At 1000 tps, where each operator’s lifecycle is microseconds, these execution gaps dominate. TileRT’s persistent engine keeps the entire compute pipeline resident on the GPU, uses warp specialization to decouple data movement from computation, and prefetches continuously so the GPU is never idle waiting for the next tile.

The combined result: a 1T model delivering 1000+ tps on hardware you can buy today, without special interconnects or custom silicon. The 8-GPU node is the key detail. Cerebras and Groq achieve extreme speeds through hardware specialization. Xiaomi and TileRT achieved comparable speed through software co-design.

Two Business Models Collide

The dominant US AI labs made a specific bet: that the path to better models requires massive capex on infrastructure, and that inference costs can be recovered through pricing power. They raised hundreds of billions, built clusters of 100,000+ GPUs, and structured their business models around converting that capex into per-token revenue. From that vantage point, token billing is essential. GitHub Copilot was never going to sustain $29/month for unlimited premium access when each agentic session could burn through millions of tokens. The $750/month developer is not an anomaly — it is the logical endpoint of the business model.

Chinese competitors made a different bet. DeepSeek optimized Mixture-of-Experts architectures for efficiency from the ground up. Xiaomi and TileRT pushed model-system co-design to its logical extreme, achieving at the system level what others try to achieve at the hardware level. Their cost structure is fundamentally different because their engineering approach starts from a different constraint: commodity hardware, maximum efficiency.

The gap is not in model quality. MiMo-V2.5-Pro scores competitively with frontier Western models on SWE-Bench Verified, Humanity’s Last Exam, and agent benchmarks. The gap is in the cost of delivery, and it is widening at precisely the moment US labs need to increase revenue to justify their capital commitments.

That collision is about to produce a reckoning.

The inference cost narrative is about to flip.

Not because Chinese models are cheaper — that much has been true since DeepSeek V2. The flip is that US labs are raising prices at the exact moment the technical foundation for cheap inference reaches commodity hardware. Every price increase announced this month is a signal of financial distress masked as market discipline. Every token-billing conversion is a subsidy withdrawal, not a cost-reflective pricing update.

Xiaomi’s 1000 tps is not a benchmark to celebrate and move past. It is a demonstration that a trillion-parameter model can run on a single 8-GPU node at interactive speeds when the software stack is built for efficiency rather than margin recovery.

The same techniques — selective FP4 quantization, block-level speculative decoding, persistent execution models — are not proprietary to Xiaomi. They are published in open papers (DFlash is ICML 2026), running on open-source frameworks (SGLang already supports the configuration), and available for anyone to deploy.

The US labs that built their business models on $10,000+ H100 clusters have a structural cost disadvantage that no amount of token billing can fix. The labs that treat inference efficiency as a first-class engineering problem — not a pricing problem — will own the next phase of this market.

The bill is coming due. But it is not the developers who will pay it.

Inference's Subsidy Hangover: How Xiaomi's 1000 TPS Exposes the Cost Fiction at the Heart of US AI

Inference Costs Are Inevitable

What Xiaomi and TileRT Actually Proved

Two Business Models Collide

The inference cost narrative is about to flip.

Further Reading

No comments yet

Continue reading

Why AI Coding Agents Prefer Rust: The Compiler as Guardrail

The Integration Ceiling

The Sandbox War: Cloudflare and Vercel Both Solved the Same Infrastructure Blind Spot

Track the tools. Lead the shift.