Is Grep All You Need? The War Over How AI Coding Agents Search Your Code

The title — "Is Grep All You Need?" — is the kind of bait that sends the internet into a spiral. A team from PwC put together a tidy factorial experiment: put grep and vector search head-to-head inside four different agent harnesses, run them on a long-context QA benchmark (LongMemEval), and reported the numbers. The headline result is that grep generally beats vector search.

If you stop there, you miss the entire paper. The real finding, buried in Table 1 and the discussion, is that the agent harness — how you wire search into the tool loop — determines accuracy more than which search algorithm you pick.

The agent harness — how you wire search into the tool loop — determines accuracy more than which search algorithm you pick.

Moving the same model between Chronos (their custom harness) and Claude Code shifts accuracy by margins comparable to swapping grep for vector search. And switching from inline to file-based tool delivery can crater a 93% accuracy score to 55% without touching a single line of search code.

This isn't a paper about search. It's a paper about orchestration, and the industry is drawing the wrong lesson.

The False Binary

Let's get the numbers out of the way, because they are interesting. Across ten harness-model combinations, inline grep outperforms inline vector search in every single row. The largest gap: Chronos with Gemini 3.1 Flash-Lite scoring 86.2% with grep vs. 62.9% with vector. The narrowest: Claude Code with Claude Opus 4.6 at 76.7% vs. 75.0%.

But here's what the clickbait headline won't tell you. LongMemEval is a benchmark built around recovering literal witnesses — exact dates, counts, preferences, and spans of text. It's designed for fact-retrieval from long conversations, not for semantic understanding tasks where vector search shines. The paper's authors are transparent about this: "LongMemEval rewards recovering literal witnesses." A grep-friendly benchmark producing grep-favorable numbers is about as surprising as a speed test on a drag strip favoring a sports car.

The more important finding is hiding in the harness rows. Look at Claude Opus 4.6 — the same backbone model — running under Chronos vs. Claude Code:

Harness	Inline grep	Inline vector
Chronos	93.1%	83.6%
Claude Code	76.7%	75.0%

The harness alone accounts for a 16.4-point drop on grep and an 8.6-point drop on vector. That's not a search story. That's an orchestration story. Chronos uses category-conditioned prompting and controlled tool surface area. Claude Code uses its own opaque shell-based tool loop. The underlying retrieval is identical.

A related paper from Li et al., "Beyond Semantic Similarity" , makes the same argument from first principles using the concept of direct corpus interaction (DCI) — giving agents grep, file reads, shell commands, and lightweight scripts rather than routing everything through a retriever API. Their key claim echoes the PwC findings: "as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus."

Both papers converge on the same insight: retrieval in an agent loop is a fundamentally different problem from retrieval in a static RAG pipeline.

The Delivery Path Is a Stress Test

The paper's experiment with file-based tool delivery is where things get genuinely interesting. Instead of injecting search results inline into the context, results get written to a file that the agent must explicitly read. This is supposed to relieve context pressure. In practice, it adds a multi-step workflow that many agents execute unreliably.

The starkest example: Codex with GPT-5.4 fell from 93.1% (inline grep) to 55.2% (programmatic grep). A 38-point collapse. The search results were the same. The corpus was the same. The only change was how results were delivered. Codex couldn't reliably close the "read the file, integrate the results, retry if needed" loop.

This is the kind of failure mode that benchmark leaderboards hide. If your agent can't handle file-based tool output, you don't have a search problem — you have a harness problem. And no embedding model will fix it.

Programmatic delivery did help in some cases — vector search with programmatic output exceeded programmatic grep on five of ten harness-model pairs. But the inconsistency tells you everything: file-based routing is a capability test for the agent itself, not a property of the search algorithm.

File-based output is a stress test. If your agent can't handle file-based tool output, you don't have a search problem — you have a harness problem. And no embedding model will fix it.

What ColGREP and FFF Tell Us About the Future

If the paper had been published in isolation, you could dismiss it as a narrow benchmark study with a provocative title. But two open-source projects are already converging on the same conclusion from opposite directions.

ColGREP (part of LightOn's NextPlaid ecosystem) is a semantic code search tool that explicitly combines regex filtering with multi-vector ranking. You can run colgrep -e "async.*await" "error handling" — the -e flag applies a regex pre-filter, then the query runs late-interaction similarity scoring on the surviving candidates. It uses ColBERT-style multi-vector embeddings (keeping ~300 embeddings per document instead of one) with product quantization and memory-mapped indices. Indexing runs on CPU. The latest release (v1.5.4, June 9 2026) ships native integrations for Claude Code, OpenCode, Codex, and Hermes.

What ColGREP gets right is that it doesn't force a choice. It gives the agent both tools and lets the query determine which path wins. Regex for precision, multi-vector for recall, hybrid by default.

FFF (8.3k stars) takes a different but complementary approach. It's a file search toolkit designed from the ground up for long-running agent processes. The MCP server gives agents tools like ffgrep and fffind with frecency-ranked results, definition-aware classification, smart-case with fuzzy fallback, and git-status annotations. Search results come back as typed objects, not text you have to re-parse. FFF keeps the index resident in memory so the second search costs sub-10 ms instead of the seconds a ripgrep spawn would take.

The key insight FFF makes explicit in its documentation (paraphrased): "FFF is a file search library, not a CLI. Ripgrep and fzf are great tools, but every call forks a new process. FFF keeps the index resident in one long-lived process." For AI agents that might run hundreds of searches per session, that architectural choice matters more than any ranking algorithm. The tradeoff: that resident index competes for RAM with the agent's context, so teams with memory-constrained environments should benchmark before committing.

Both projects are saying the same thing: give agents multiple search modalities and optimize the harness, not just the retriever. ColGREP does it with hybrid regex+semantic. FFF does it with programmatic APIs, frecency, and definition-aware indexing. Neither asks "grep or vector?" — they build both into the tool surface and let the agent figure out which to use. The catch with ColGREP's approach: multi-vector indices are larger than single-vector, and CPU indexing trades build speed for accessibility — fine for nightly rebuilds, less ideal for rapidly changing codebases.

What Teams Should Actually Do

If you're building an AI coding agent today, here's what I'd take from all of this:

Don't optimize search in isolation. Run your accuracy benchmarks inside your actual harness, not with a static retriever. The PwC paper shows that harness effects are as large as retriever effects. If you're not measuring both, you're measuring the wrong thing.
Ship multiple search tools, not one. Give the agent grep (or ripgrep) for precision, vector search for recall, and let it decide. The paper shows that the best results often come from giving the model the ability to choose its search strategy dynamically. ColGREP and FFF are existence proofs that this approach ships today.
Watch the delivery path. File-based tool results can double as a stress test for your agent loop. If your agent can't reliably read and integrate file output, that's a harness bug, not a search bug. Fix the harness first.
Benchmark on your data, not on LongMemEval. The paper's results are real but context-dependent. If your task involves semantic understanding, conceptual blending, or code navigation, vector search may win. Test on what you actually do.

The real war isn't between search algorithms — it's between harness designs that treat retrieval as an afterthought and those that make it a first-class part of the agent loop.

Ask yourself not "which retriever should I use," but "how does my agent discover, select, and integrate information from its environment?" The teams that answer that question well will beat the teams with better embeddings every single time.

Is Grep All You Need? The War Over How AI Coding Agents Search Your Code

The False Binary

The Delivery Path Is a Stress Test

What ColGREP and FFF Tell Us About the Future

What Teams Should Actually Do

Further Reading

No comments yet

Continue reading

Why AI Coding Agents Prefer Rust: The Compiler as Guardrail

The Integration Ceiling

The Sandbox War: Cloudflare and Vercel Both Solved the Same Infrastructure Blind Spot

Track the tools. Lead the shift.