The AI Tokenmaxxing Reckoning: When More Tokens Don't Mean More Value

For the past eighteen months, engineering leaders have been playing a game of AI chicken. The rules are simple: whoever burns through the most tokens wins. CEOs flex their AI code generation percentages like bodybuilders at a mirror. Engineering VPs report token consumption in board meetings. VCs ask about your Claude Code budget like it's a proxy for technical sophistication.

And now the hangover is starting.

Uber's COO publicly admitted this week that it's getting harder to justify AI token spending. The company's CTO blew through their entire 2026 Claude Code budget. Duolingo — one of the earliest companies to mandate AI usage in performance reviews — quietly walked that policy back. Meanwhile, the CEOs still bragging about their 60%, 84%, 90% AI-generated code numbers are starting to face an uncomfortable question from their own boards: are we actually shipping better products?

This isn't a correction. It's a reckoning. And if you're measuring AI ROI by token count, you're about to find out why.

The Tokenmaxxing Era

Let's call this what it was: a collective delusion dressed up as a productivity metric.

Starting in early 2025, the industry decided that the percentage of AI-generated code in your codebase was a meaningful number. It wasn't. But that didn't stop a parade of CEOs from treating it like a KPI.

The scoreboard, as of May 2026, reads like a competition nobody asked to enter:

Anthropic: 90% AI-generated code
Chime: 84%
DoorDash: approximately 66%
Airbnb: 60%
Compass, Fubo, DoubleVerify: also on the list, because apparently everyone needed a number

These figures come from a Business Insider report cataloguing the growing trend of executives treating AI code percentage as a bragging right . Notice what's missing from every single one of these announcements: any connection between that percentage and actual business outcomes.

Nobody is saying "we ship 90% AI-generated code and our customers are 90% happier." Nobody is saying "our 60% AI code ratio reduced our bug rate by half." The metric exists in a vacuum, floating above the actual work of building products that people use.

Tokenmaxxing — the practice of consuming more AI tokens to signal sophistication rather than to solve specific problems — became the new greenwashing. Companies burned through API credits not because they needed to, but because they needed to look like they needed to. The term borrows from internet culture's "-maxxing" suffix: the practice of maximizing some metric to signal in-group status. Here, the in-group is "companies serious about AI," the metric is token consumption, and the substance behind the signal has been entirely optional.

Here's the thing about quotas: they create the behavior you measure, not the behavior you want. When you tell engineers that AI adoption is a performance metric, they will adopt AI. They will generate code with it. They will report the numbers. And the numbers will look great right up until someone asks what those numbers actually produced.

The Uber Wake-Up Call

If tokenmaxxing had a turning point, it happened at Uber.

The company's CTO burned through their entire 2026 Claude Code budget — not in December, not in Q4, but so thoroughly that it became a board-level concern. Andrew Macdonald, Uber's COO, then did something rare in tech: he admitted publicly that the company is finding it harder to justify AI token spending ( source ).

This matters because Uber is not a company that lacks resources. They are slowing hiring to fund AI investment , which means they are actively trading headcount for compute. That's a real cost. And the COO is now saying they can't clearly link that spending to consumer-facing features.

They can't link token consumption to shipped features.

This is the gap that the entire AI productivity narrative has been papering over with percentage points. If you're spending millions on AI tooling and you cannot point to a specific feature, a specific improvement, a specific outcome that exists because of that spending — what exactly are you measuring?

Uber's situation is not unique. It's just the first one a COO said out loud. The pattern is consistent across companies that adopted AI coding tools aggressively: initial excitement, rapid token consumption, a period where everyone assumes the productivity gains are real because the numbers look impressive, and then a slow dawning that the connection between tokens burned and value delivered is thin.

The hiring slowdown to fund AI investment adds another layer. Uber is essentially saying: we'd rather pay Anthropic than hire engineers. But if the AI tools aren't producing features that customers notice, who exactly is winning that trade?

The Performance Review Trap

Duolingo took a different path to the same wall — and it reveals something Uber's story doesn't.

While Uber was burning through budgets, Duolingo was doing something more structural: it made AI usage a formal part of employee performance reviews. If you weren't using AI tools, you weren't performing. It was one of the most aggressive AI adoption policies in the industry, and it sent a clear message to every employee — adapt or fall behind.

Then, in April 2026, they walked it back .

The reversal isn't about whether AI tools work. It's about what happens when you put AI adoption inside a performance review. Employees don't start using tools because they've found genuine value — they start using them because their compensation depends on it. The result is exactly what you'd expect: performative usage. People generate AI output to show activity, not to solve problems. Code review becomes theater. The metric becomes the goal, and the goal becomes meaningless.

Duolingo's correction is a warning shot for every company considering similar mandates. When you tie AI usage to performance evaluation, you don't get better engineers. You get engineers who are good at appearing to use AI. And there's a real cost to that: the engineers who evaluated the tools honestly and decided they weren't useful for their specific workflow get penalized for their rigor. The ones who gamed the system get rewarded.

Mandatory AI in performance reviews optimizes for compliance, not competence. And compliance is easy to fake when the tool produces output that looks plausible on the surface.

The Productivity Illusion

AI coding tools do make developers faster at certain tasks. Generating boilerplate, writing test scaffolding, drafting documentation — these are real time-savers, and any engineer who denies it is either lying or hasn't used the tools seriously. The question isn't whether AI helps. The question is whether "faster at producing code" translates to "better products shipped."

Here's what AI-generated code percentage actually measures:

It doesn't measure quality. A model can generate 10,000 lines of perfectly syntactic code that does nothing useful.

It doesn't measure velocity. More code in the repository doesn't mean faster delivery — it often means more code to review, more code to maintain, more code that introduces subtle bugs.

It doesn't measure customer value. Users don't care what percentage of your codebase was written by a model. They care whether the product works.

Tony Xu, DoorDash's CEO, asked the right question publicly: "are we actually delivering better outcomes for customers?" That question should be on every engineering leader's desk. But most CEOs are still reporting the percentage and hoping nobody asks what it means.

The productivity illusion works like this: AI tools make it faster to produce code. Producing code faster feels like productivity. Therefore, AI tools increase productivity. The logical gap is enormous, but a year of enthusiastic case studies and vendor marketing has papered it over.

AI coding tools are genuinely useful for specific tasks: boilerplate generation, test scaffolding, documentation drafts, refactoring suggestions. They are less useful for architecture decisions, system design, understanding business context, and making trade-offs that require judgment. The portion of engineering work that benefits most from AI is also the portion that contributes least to differentiated product outcomes.

There's a second layer nobody wants to talk about: AI-generated code often looks better than it is. A language model will produce code that follows conventions, includes comments, and handles edge cases — at least superficially. But the code still needs to integrate with your existing system, handle your specific business logic, and survive contact with real data. The model doesn't know your system's constraints. It doesn't know which of your legacy services are held together by duct tape and prayers.

This is why code review cycles haven't shrunk proportionally with AI adoption. The code arrives faster, but the review burden hasn't decreased — in some cases, it's increased, because reviewers now need to verify that AI-generated code actually does what it appears to do.

So when a company reports that 90% of its code is AI-generated, what they're really saying is that 90% of their code falls into the category where a language model can produce something syntactically correct. That's not nothing. But it's also not the whole job.

What Engineering Leaders Should Do Differently

If you're an engineering leader reading this and thinking "we need to fix our AI metrics before someone asks the hard questions," here's a framework that actually works.

Stop measuring token consumption as a KPI. Token count is a cost metric, not an outcome metric. It tells you how much you're spending, not what you're getting. Track it for budgeting purposes, not for performance evaluation.
Measure feature-level impact, not code-level impact. Instead of asking "what percentage of our code is AI-generated?" ask "which shipped features were accelerated by AI tooling, and by how much?" This requires actual measurement — cycle time before and after, bug rates, customer feedback. Harder than counting tokens. Also the only measurement that matters. One mid-stage startup we tracked stopped counting tokens and started tracking cycle time per feature — they found AI accelerated test-heavy features by 40% but had zero impact on architecture-heavy ones. That's the kind of insight you can actually act on.
Audit your AI spend quarterly. Not annually. Quarterly. The market is moving too fast for yearly reviews. Look at token consumption by team, by project, by tool. Identify where spending is growing without corresponding output growth. Cut those links. Double down on the ones that show real impact. If a team's token usage doubled but their shipped features stayed flat, that's not an AI problem — that's a management problem.
Don't mandate AI usage — enable it. Duolingo learned this the hard way. Mandatory usage creates performative behavior. Instead, invest in training, build internal examples of effective AI workflows, and let teams adopt tools when they find genuine value. The teams that benefit will adopt quickly. The ones that don't won't, and that's a useful signal too.
Ask the DoorDash question regularly. "Are we delivering better outcomes for customers?" Put it in your engineering all-hands. Put it in your board deck. If the answer is "we're not sure," that's your signal to stop optimizing for token count and start optimizing for outcomes. This question is uncomfortable because it forces you to connect your AI spending to something that actually matters. That discomfort is the point.

The companies that navigate this reckoning well won't be the ones that spent the most on AI. They'll be the ones that spent the most thoughtfully.

The Bottom Line

Uber's COO said the quiet part out loud. Duolingo discovered that putting AI in performance reviews produces compliance, not competence. The gap between tokens burned and value delivered is widening, and the companies still treating AI code percentage as a KPI are measuring the wrong thing with increasing confidence.

Your job right now isn't to increase your AI token budget. It's to prove that every token you're already spending is earning its keep.

The emperor had tokens. The emperor still has no clothes.

The AI Tokenmaxxing Reckoning: When More Tokens Don't Mean More Value

The Tokenmaxxing Era

The Uber Wake-Up Call

The Performance Review Trap

The Productivity Illusion

What Engineering Leaders Should Do Differently

The Bottom Line

Further Reading

No comments yet

Continue reading

The Integration Ceiling

The Sandbox War: Cloudflare and Vercel Both Solved the Same Infrastructure Blind Spot

File-Based Planning Is Becoming the Universal Agent Protocol

Track the tools. Lead the shift.