# File-Based Planning Is Becoming the Universal Agent Protocol | Artificialus

> For the complete content index, see [llms.txt](https://artificialus.com/llms.txt). Markdown versions of all pages are available by appending `.md` to any URL.

- Home
- /
- Articles
- /
- File-Based Planning Is Becoming the Universal Agent Protocol

Guides

# File-Based Planning Is Becoming the Universal Agent Protocol

File-based planning is quietly becoming the universal protocol for agent reliability — and it has nothing to do with model quality.

June 26, 2026

7 min read

M

Written by

McClane | The Toolmaker

Share

X

Facebook

Reddit

Telegram

Bluesky

Email

Contents

Every week there's a new post-mortem from someone who tried to run an AI coding agent on a real codebase and watched it spiral into an infinite loop, forget what it was building, or hallucinate entire features that don't exist. The dominant diagnosis blames the model: "Claude Sonnet loses the plot after 50 tool calls," or "GPT still can't handle long-range dependencies."

But look closer at what's actually failing. The model isn't suddenly stupid — it's running out of context. Its working memory dries up. The `/clear` command scrubs everything. A crash wipes the session. The problem isn't model capability; it's that agents treat the context window as their only memory, and that's an architecture problem, not an intelligence problem.

Three independent signals this week suggest the community has found the same fix: file-based planning on disk is becoming the universal protocol for agent reliability. And it's being built with markdown files, SHA-256 attestation, and a specification that runs on 40+ agents without a single proprietary API call.

## The Context Window Trap

The root cause of most agent failures: every coding agent currently treats its context window like RAM — volatile, limited, and wiped on every reset. When a session hits context limits and you type `/clear`, everything the agent learned is gone. Goals drift. Errors aren't tracked. The agent starts from scratch and repeats the same mistakes.

This is the problem planning-with-files solves. It implements a persistent, crash-proof planning system using three markdown files on disk: `task_plan.md` for phases and progress, `findings.md` for research, and `progress.md` for session logs. An agent that loses context simply re-reads these files and picks up where it left off. It's the difference between keeping everything in your head (the context window) and keeping it in a notebook (the filesystem).

The project hit v3.0.0 on June 10, 2026 ( release notes ), adding opt-in autonomous and gated modes for long-running agentic runs. The gated mode includes a completion gate that holds the agent until the plan is actually done — no more "I think I'm finished" false positives. The latest release ( v3.1.3 ) fixes a YAML frontmatter validation issue and brings the test suite to 184 passing tests.

Security features like SHA-256 plan attestation arrived earlier. The `/plan-attest` command that locks `task_plan.md` with a cryptographic hash was introduced in v2.37.0, and hooks were updated to block injection on tamper. What v3.0.0 did was make attestation default-on in autonomous and gated modes — unattested plan bodies are refused at injection ( changelog ).

## The Standard Beneath the Hype

What makes planning-with-files significant isn't the 23,900 GitHub stars or the clever hook system. It's that it ships via the Agent Skills format — a lightweight, open specification for extending agent capabilities using a `SKILL.md` file in a standardized folder structure. The specification defines a simple contract: a directory with a `SKILL.md` containing YAML frontmatter and markdown instructions. No API, no SDK, no platform lock-in.

The same format runs across 40+ agents — from Claude Code and Cursor to Gemini CLI, GitHub Copilot, Codex, and Kiro. The agentskills.io client showcase lists dozens of compatible runtimes including JetBrains Junie, VS Code, OpenAI Codex, and Snowflake Cortex Code.

Anthropic's Claude Code skills documentation confirms the pattern: every skill needs a `SKILL.md` file with frontmatter and instructions, following the Agent Skills open standard "originally developed by Anthropic, released as an open standard, and... adopted by a growing number of agent products" ( agentskills.io ).

The ecosystem has grown quickly. The awesome-claude-code registry has 47,300 stars worth of skills and orchestrators. The agent-toolkit ships 40+ skills as `SKILL.md` files. And addyosmani/agent-skills at 66,700 stars packages 24 production-grade workflows in the same format.

This convergence reflects a community recognizing that agent capabilities should be open, portable, and file-based.

## Google Bets on the Same Pattern

The strongest signal isn't from open source — it's from Google. The company released agents-cli (3,100 stars), a CLI and skill suite that "turns any coding assistant into an expert at creating, evaluating, and deploying AI agents on Google Cloud" ( README ). The tool ships Agent Skills covering the entire lifecycle: scaffolding, evaluation, deployment, observability. The documentation shows it works with Gemini CLI, Claude Code, Codex, and Antigravity — all through the same open skills format.

> Google's agents-cli is explicitly described as "a tool for coding agents, not a coding agent itself." It uses skills to teach your existing agent how to build, evaluate, and deploy on Google Cloud. This is a platform-level bet that the Agent Skills format — files on disk, no proprietary runtime — is the right abstraction layer.

## The Evidence: 96.7% vs. 6.7%

The benchmark data backs up the architecture argument. planning-with-files was evaluated using Anthropic's skill-creator framework across 5 task types — building a CLI tool, researching frameworks, debugging FastAPI, planning a Django migration, and designing a CI/CD pipeline. The results:

Configuration

Pass Rate

With skill (file-based planning)

96.7%

Without skill (ad-hoc)

6.7%

Delta

+90 pp

The full methodology is in the evals report . The benchmark measures file-pattern fidelity — whether the agent maintains the three-file structure — not goal-drift over long runs. As the README cautions , "newer models and the autonomous-mode work are not yet covered." The 96.7% figure comes from v2.21.0 on claude-sonnet-4-6 (March 2026).

But the delta is striking. In a blind A/B comparison, three independent comparator agents preferred the file-based planning output 3 out of 3 times without knowing which was which. One judge noted that the unassisted agent "delivered real, runnable code... but it did not fulfill the structural expectations."

> The code was functional — the process was unreliable.

PortableText [components.type] is missing "callout"

## The Counter-Narrative That Misses the Point

The dominant frame in agent reliability discussions is "wait for the next model." GPT-6, Claude Opus 5, Gemini Ultra 3 — the idea that we're one capability jump away from agents that just work. This frame has been dominant for two years, and it keeps getting disproven by experience. Better models make fewer reasoning errors, but they don't solve context loss. They don't make goal drift stop. They don't make the session survive a crash.

File-based planning directly addresses the architectural bottleneck that model improvements can't touch: deterministic state management. A model can be 10x smarter and still forget what it was doing after a `/clear`. But a plan on disk survives anything.

As AI-generated contributions flood open source, auditability becomes critical. File-based planning makes agent behavior inspectable — `task_plan.md` records the plan, `progress.md` logs execution, and SHA-256 attestation prevents tampering.

## What This Convergence Means
- File-based planning works — the 96.7% benchmark result is real, even within its methodological limits.
- The Agent Skills standard is winning — 40+ agents packaging capabilities as files on disk.
- Platform companies are adopting, not fighting — Google and Anthropic are converging on the same architecture.

> The agent Wild West is ending not because someone built the perfect proprietary platform, but because the community found a protocol so simple it was hiding in plain sight: markdown files on disk.

For teams building agent-based workflows: stop optimizing for model quality alone. The bottleneck in your agent pipeline is probably architectural. Invest in state management — persistent plans, structured progress tracking, deterministic completion gates. These are the patterns that survive context loss, crashes, and model swaps.

The next time your agent forgets what it's doing, ask yourself: is the problem really the model's intelligence, or is it the lack of a notebook?

## Further Reading
- Agent Skills Specification — The open format specification that standardizes agent capabilities as `SKILL.md` files. Defines the directory structure, frontmatter schema, and progressive disclosure model that the entire ecosystem builds on.
- Claude Code — Extend with Skills
— Anthropic's official documentation for creating, managing, and sharing skills. Covers the full lifecycle from creation and dynamic context injection to subagent execution and evaluation with skill-creator.
- Google agents-cli Documentation
— Google's documentation for the CLI and skill suite that turns any coding assistant into an agent builder on Google Cloud. Demonstrates platform-level adoption of the Agent Skills format.
- planning-with-files — Evals Report
— The full methodology and results for the 96.7% vs. 6.7% benchmark comparison, including blind A/B evaluation, token cost analysis, and the caveat that these numbers measure file-pattern fidelity, not autonomous-run reliability.
- HN Discussion: The Hidden Cost of AI-Generated PR Spam
— Community discussion around AI-generated contributions flooding open source repositories, and why auditable agent behavior (via file-based planning) becomes essential as the volume of AI-generated code increases.

### No comments yet

Name

Email

Don't fill this out

Comment

Post Comment

Filed under

Guides
June 26, 2026
1,379 words

Key metrics

Read time

7 min

Words

1,379

### McClane | The Toolmaker

Contributor

Technical deep-dives into AI research, models, and architectures. Bridging the gap between academic papers and daily engineering.

In this article

## Continue reading

AI Research

8 min

### The New SDLC With Vibe Coding: Google's 50-Page Blueprint for AI-Augmented Engineering

Google's 50-page whitepaper turns "vibe coding" from meme into methodology, arguing verification — not generation — is the new software bottleneck.

AI Research

Jun 24

AI Research

8 min

### The 3B Parameter Frontier: Reasoning Is Compressing, Knowledge Isn't

A 3B model matches 500B+ models on math and coding benchmarks — changing how we think about architecture, on-device AI, and build-vs-buy.

AI Research

Jun 24

Opinion

6 min

### The infrastructure question

Agents don't need to learn our tools. Our tools need to learn from agents.

Opinion

Jun 24