Landscape

The Great Unbundling: Why Nvidia's Crown Won't Fit the Agentic Future

The AI market's biggest blind spot is the gap between answer inference and agentic inference. Nvidia's premium-on-latency bet may miss the mark.

Cerebras is on track for the largest IPO of 2026 — demand is 20× oversubscribed. As Reuters reported on May 10, the company is raising its IPO price range to $150–$160 per share (up from $115–$125) and increasing its share count to 30 million. At the top of the range, Cerebras would raise roughly $4.8 billion.

The market is hungry for an Nvidia alternative, and Cerebras offers something genuinely different: a wafer-scale chip with 44 GB of on-chip SRAM and a blistering 21 PB/s of memory bandwidth. For workloads that fit in that cache, it’s extraordinarily fast.

But the Cerebras IPO frenzy is a symptom of a deeper structural shift the industry has not yet fully priced in. The real story isn’t about one company challenging Nvidia. It’s about the kind of inference the market is heading toward — and why that future looks radically different from the one Nvidia’s architecture was built to dominate.

Answer Inference vs. Agentic Inference

Ben Thompson drew a key distinction on Stratechery . He separates AI inference into two categories:

Answer inference is what happens when a human asks a question and expects a response. ChatGPT, Claude, Gemini — these are answer-inference systems. A person types a prompt, waits for a reply, reads it, and moves on. Latency matters because humans notice delays. Token speed directly affects user experience. This is the world GPUs were built for: fast compute, high-bandwidth memory, and tightly integrated systems that minimize the time between prompt and response.

Agentic inference is what happens when machines talk to machines. An agent running a multi-step task — querying a database, calling a tool, writing code, verifying output, and iterating — generates far more tokens than any human conversation ever will. More importantly, there’s no person waiting on the other end. The agent runs until it finishes. Latency is secondary. Completion is primary.

These two modes place different demands on hardware. Thompson’s framing is incisive:

Answer inference optimizes for token speed; agentic inference optimizes for memory.

And that distinction has profound implications for which chip architectures will win.

Where the GPU Premium Breaks Down

Nvidia’s dominance rests on a tight coupling of fast compute, high-bandwidth memory (HBM), and chip-to-chip networking that lets thousands of GPUs act as one system. This works well for training, where every GPU must share results before the next step begins. It works for answer inference, where latency is king.

But agentic inference breaks this model. Agents need context, state, and history. The Key-Value cache (KV cache) stores the Key-Value pairs from earlier attention layers, and it grows O(n) with every token — so a long-running agentic task generates a massive, persistent memory footprint. Every tool call, every retrieved document, every intermediate computation adds to the pressure, ballooning far beyond what fits in GPU memory.

This is where Nvidia’s premium-on-latency strategy looks misaligned.

Consider the cost structure. Epoch AI’s latest analysis shows that high-bandwidth memory has grown from 52% to 63% of total AI chip component costs between Q1 2024 and Q4 2025. Logic dies — the actual compute — stayed flat at roughly 13%. In absolute terms, HBM spending across Nvidia, AMD, Google, and Amazon grew from $12 billion in 2024 to $32 billion in 2025. The industry is spending an ever-growing share of its silicon budget on memory, not compute.

That tradeoff makes sense when every microsecond of latency matters. It makes less sense when the workload is an agent churning through a multi-hour task, where much of the KV cache could live on cheaper DRAM or even SSDs.

Nvidia Recognizes the Shift — and Is Hedging

Nvidia sees this. The company’s Dynamo inference framework, open-sourced earlier this year , is a clear signal that Nvidia understands the architecture problem. Dynamo disaggregates the prefill and decode phases of inference across separate GPUs, allowing each to be optimized independently. It includes an LLM-aware router that minimizes redundant computation, a KV cache manager that offloads to storage tiers, and a GPU planner that dynamically allocates resources.

Nvidia is building software to compensate for the fact that its hardware was designed for a workload that’s increasingly becoming a minority use case. Disaggregation, KV caching to CPU RAM, and topology-aware scheduling are all workarounds for the core tension: GPU memory is expensive and scarce, and agentic inference needs abundant, cheap memory.

Dynamo is a strategic hedge. But it reveals a structural limitation. You can bolt a memory hierarchy onto a GPU cluster, but you’re still paying for compute and HBM you don’t fully need. As Thompson notes, the decode phase of inference strands compute while it waits for memory reads; the prefill phase strands HBM while it burns through compute. The GPU is inherently mismatched for workloads where neither constraint dominates.

The Memory Hierarchy Wins

The winning architecture for agentic inference will look more like a traditional data center node than a GPU supercomputer: a large pool of cheap DDR memory, “good enough” compute, and smart caching between tiers.

Consider a coding agent tasked with building a feature end-to-end. It retrieves documentation from a vector store, writes implementation files, runs the test suite, inspects error logs, iterates on broken tests, and commits a final summary. Across 15 minutes of tool calls, that single agent generates 50,000+ tokens of context — far beyond what fits in GPU HBM. On a GPU cluster, this workload requires constant paging of KV cache between HBM and system RAM, stranding expensive compute cycles while the memory bus catches up. On a system designed for memory capacity — DDR DRAM for active context, CXL-attached memory pools for warm cache, and SSDs for cold logs — the same workload runs at a fraction of the cost with no meaningful difference in completion time.

This is where the CPU reenters the picture. For agentic workloads, the speed of tool calling — typically CPU-mediated — may matter more than GPU teraflops. The memory hierarchy — KV cache in DRAM, embeddings in SSDs, logs in object stores — becomes the system’s backbone, not a bolt-on afterthought. Interconnect standards like CXL (Compute Express Link) enable this topology, letting CPUs access pools of memory that can be dynamically allocated across agents without the overhead of GPU memory management.

And here’s the part the market hasn’t absorbed: when latency is no longer the binding constraint, Nvidia’s relative advantage collapses. The company charges a premium for the tightest possible integration of compute, memory, and networking. That premium is justified when every millisecond of inference latency translates to user frustration or lost revenue. It is far harder to justify when the consumer is another machine running background tasks.

Who Wins in an Agentic World

This shift has three implications.

Hyperscalers Gain Leverage

Companies like Amazon, Google, and Microsoft already design their own chips (Trainium, TPU, Maia). For agentic inference, they can optimize for memory capacity and total cost of ownership rather than peak FLOPS. The economics favor vertical integration.

Cerebras’s Narrow Lane

Cerebras’ wafer-scale chip is unmatched for workloads that fit in its on-chip SRAM. But that’s a relatively small slice of the agentic inference pie, where context windows and KV caches routinely exceed 44 GB. Cerebras is optimized for answer inference speed, not agentic memory capacity. The IPO frenzy reflects a real hunger for Nvidia alternatives, but it overstates the addressable market for Cerebras’ specific architecture.

China’s Window

China is in a stronger position than most realize. As Thompson observed, the country’s lack of leading-edge compute is a genuine disadvantage for training. But for agentic inference, fast-enough GPUs and abundant DRAM are all you need. The entire stack can be assembled from mature process nodes and commodity memory. US export controls that constrain Nvidia’s China business may matter less in a world where the biggest inference market doesn’t require cutting-edge chips.

The Unbundling Has Begun

The GPU is a feat of engineering, but it is also a bundle — compute, HBM, and ultra-fast networking sold as a single integrated package. That bundle made sense when training was the dominant AI workload and answer inference was the only game in town.

Agentic inference unbundles that package. It separates the need for compute from the need for memory, and it de-prioritizes latency in favor of throughput, capacity, and cost efficiency. The hardware that wins this market will not look like a GPU. It will look like a memory system with compute attached.

Nvidia will still dominate training and answer inference — two large and profitable markets. But the biggest market of all, agentic inference, belongs to a different architectural philosophy.

The crown is not falling yet. But the head it was made for is changing shape.

Further Reading

No comments yet

Live feed in your inbox

Track the tools. Lead the shift.

Tech leaders use Artificialus to stay ahead: editorial picks, agent comparisons, MCP updates, and signal-heavy analysis when it matters.

No spam. Only tools and shifts worth tracking.