File-Based Planning Is Becoming the Universal Agent Protocol

Every week there's a new post-mortem from someone who tried to run an AI coding agent on a real codebase and watched it spiral into an infinite loop, forget what it was building, or hallucinate entire features that don't exist. The dominant diagnosis blames the model: "Claude Sonnet loses the plot after 50 tool calls," or "GPT still can't handle long-range dependencies."

But look closer at what's actually failing. The model isn't suddenly stupid — it's running out of context. Its working memory dries up. The /clear command scrubs everything. A crash wipes the session. The problem isn't model capability; it's that agents treat the context window as their only memory, and that's an architecture problem, not an intelligence problem.

Three independent signals this week suggest the community has found the same fix: file-based planning on disk is becoming the universal protocol for agent reliability. And it's being built with markdown files, SHA-256 attestation, and a specification that runs on 40+ agents without a single proprietary API call.

The Context Window Trap

The root cause of most agent failures: every coding agent currently treats its context window like RAM — volatile, limited, and wiped on every reset. When a session hits context limits and you type /clear, everything the agent learned is gone. Goals drift. Errors aren't tracked. The agent starts from scratch and repeats the same mistakes.

This is the problem planning-with-files solves. It implements a persistent, crash-proof planning system using three markdown files on disk: task_plan.md for phases and progress, findings.md for research, and progress.md for session logs. An agent that loses context simply re-reads these files and picks up where it left off. It's the difference between keeping everything in your head (the context window) and keeping it in a notebook (the filesystem).

The project hit v3.0.0 on June 10, 2026 ( release notes ), adding opt-in autonomous and gated modes for long-running agentic runs. The gated mode includes a completion gate that holds the agent until the plan is actually done — no more "I think I'm finished" false positives. The latest release ( v3.1.3 ) fixes a YAML frontmatter validation issue and brings the test suite to 184 passing tests.

Security features like SHA-256 plan attestation arrived earlier. The /plan-attest command that locks task_plan.md with a cryptographic hash was introduced in v2.37.0, and hooks were updated to block injection on tamper. What v3.0.0 did was make attestation default-on in autonomous and gated modes — unattested plan bodies are refused at injection ( changelog ).

The Standard Beneath the Hype

What makes planning-with-files significant isn't the 23,900 GitHub stars or the clever hook system. It's that it ships via the Agent Skills format — a lightweight, open specification for extending agent capabilities using a SKILL.md file in a standardized folder structure. The specification defines a simple contract: a directory with a SKILL.md containing YAML frontmatter and markdown instructions. No API, no SDK, no platform lock-in.

The same format runs across 40+ agents — from Claude Code and Cursor to Gemini CLI, GitHub Copilot, Codex, and Kiro. The agentskills.io client showcase lists dozens of compatible runtimes including JetBrains Junie, VS Code, OpenAI Codex, and Snowflake Cortex Code.

Anthropic's Claude Code skills documentation confirms the pattern: every skill needs a SKILL.md file with frontmatter and instructions, following the Agent Skills open standard "originally developed by Anthropic, released as an open standard, and... adopted by a growing number of agent products" ( agentskills.io ).

The ecosystem has grown quickly. The awesome-claude-code registry has 47,300 stars worth of skills and orchestrators. The agent-toolkit ships 40+ skills as SKILL.md files. And addyosmani/agent-skills at 66,700 stars packages 24 production-grade workflows in the same format.

This convergence reflects a community recognizing that agent capabilities should be open, portable, and file-based.

Google Bets on the Same Pattern

The strongest signal isn't from open source — it's from Google. The company released agents-cli (3,100 stars), a CLI and skill suite that "turns any coding assistant into an expert at creating, evaluating, and deploying AI agents on Google Cloud" ( README ). The tool ships Agent Skills covering the entire lifecycle: scaffolding, evaluation, deployment, observability. The documentation shows it works with Gemini CLI, Claude Code, Codex, and Antigravity — all through the same open skills format.

Google's agents-cli is explicitly described as "a tool for coding agents, not a coding agent itself." It uses skills to teach your existing agent how to build, evaluate, and deploy on Google Cloud. This is a platform-level bet that the Agent Skills format — files on disk, no proprietary runtime — is the right abstraction layer.

The Evidence: 96.7% vs. 6.7%

The benchmark data backs up the architecture argument. planning-with-files was evaluated using Anthropic's skill-creator framework across 5 task types — building a CLI tool, researching frameworks, debugging FastAPI, planning a Django migration, and designing a CI/CD pipeline. The results:

Configuration	Pass Rate
With skill (file-based planning)	96.7%
Without skill (ad-hoc)	6.7%
Delta	+90 pp

The full methodology is in the evals report . The benchmark measures file-pattern fidelity — whether the agent maintains the three-file structure — not goal-drift over long runs. As the README cautions , "newer models and the autonomous-mode work are not yet covered." The 96.7% figure comes from v2.21.0 on claude-sonnet-4-6 (March 2026).

But the delta is striking. In a blind A/B comparison, three independent comparator agents preferred the file-based planning output 3 out of 3 times without knowing which was which. One judge noted that the unassisted agent "delivered real, runnable code... but it did not fulfill the structural expectations."

The code was functional — the process was unreliable.

The Counter-Narrative That Misses the Point

The dominant frame in agent reliability discussions is "wait for the next model." GPT-6, Claude Opus 5, Gemini Ultra 3 — the idea that we're one capability jump away from agents that just work. This frame has been dominant for two years, and it keeps getting disproven by experience. Better models make fewer reasoning errors, but they don't solve context loss. They don't make goal drift stop. They don't make the session survive a crash.

File-based planning directly addresses the architectural bottleneck that model improvements can't touch: deterministic state management. A model can be 10x smarter and still forget what it was doing after a /clear. But a plan on disk survives anything.

As AI-generated contributions flood open source, auditability becomes critical. File-based planning makes agent behavior inspectable — task_plan.md records the plan, progress.md logs execution, and SHA-256 attestation prevents tampering.

What This Convergence Means

File-based planning works — the 96.7% benchmark result is real, even within its methodological limits.
The Agent Skills standard is winning — 40+ agents packaging capabilities as files on disk.
Platform companies are adopting, not fighting — Google and Anthropic are converging on the same architecture.

The agent Wild West is ending not because someone built the perfect proprietary platform, but because the community found a protocol so simple it was hiding in plain sight: markdown files on disk.

For teams building agent-based workflows: stop optimizing for model quality alone. The bottleneck in your agent pipeline is probably architectural. Invest in state management — persistent plans, structured progress tracking, deterministic completion gates. These are the patterns that survive context loss, crashes, and model swaps.

The next time your agent forgets what it's doing, ask yourself: is the problem really the model's intelligence, or is it the lack of a notebook?

File-Based Planning Is Becoming the Universal Agent Protocol

The Context Window Trap

The Standard Beneath the Hype

Google Bets on the Same Pattern

The Evidence: 96.7% vs. 6.7%

The Counter-Narrative That Misses the Point

What This Convergence Means

Further Reading

No comments yet

Continue reading

The New SDLC With Vibe Coding: Google's 50-Page Blueprint for AI-Augmented Engineering

The 3B Parameter Frontier: Reasoning Is Compressing, Knowledge Isn't

The infrastructure question

Track the tools. Lead the shift.