Late — High-Leverage AI Agent Orchestration

Orchestrate an entire AI dev team on 5GB VRAM

mlhher Closed source Since 2026

Orchestrate an entire AI dev team on 5GB VRAM using ephemeral subagents, exact-match diffs, and a zero-dependency Go binary. Works with any OpenAI-compatible model — local or cloud.

+ Pros

Ephemeral subagents prevent context pollution — each task gets a fresh isolated context that is destroyed on completion
Zero-dependency single Go binary — no Node.js, Python, or runtime environment required
Runs on as little as 5GB VRAM for local inference with llama.cpp
Hybrid model routing lets you architect with large reasoning models and execute with cheap local models
Stateful session persistence across reboots — pick up exactly where you left off
Exact-match diffs fail loud instead of silently corrupting files

− Cons

Relatively new project — smaller community, fewer third-party integrations compared to Claude Code or Copilot
Custom license with commercial restrictions may deter enterprise adoption without legal review
CLI-only interface — no GUI, web dashboard, or visual workflow builder
Subagent orchestration adds latency for simple single-file edits that a monolithic agent would handle faster

Pricing

Free (Builders)

Use freely to write code for any project, including commercial ones. Your output is yours.

Commercial License

Custom

Required if wrapping Late's orchestration engine into a paid service or deploying as enterprise infrastructure

Overview

Most AI coding agents share a fatal design flaw: they do everything in one context window. Planning, implementation, retries, self-healing, error recovery — it all piles into the same conversation history until the model can't tell signal from noise. You blame the model. The model is fine.

Late takes a different approach. It is a zero-dependency, single-static-binary AI coding agent written in Go that splits the workflow into two layers: a lean orchestrator that handles planning and coordination, and ephemeral subagents that execute individual tasks in isolated contexts. Each subagent gets one job, a fresh context, and is destroyed when done. The orchestrator only ever sees plans and outcomes, never the implementation mess.

The result is that the same model feels sharper in Late because it reasons from a clean signal. Late manages its KV cache ruthlessly, keeps its system prompt around 1,000 tokens, and runs comfortably on 5GB VRAM for local inference. No Node.js, no Python, no runtime dependencies — just a single Go binary you drop in your PATH.

How It Works

Late's architecture rests on one insight: separation of concerns between planning and execution. When you run late in a project directory, the orchestrator reads your codebase, forms a plan, and spawns subagents one at a time. Each subagent receives a self-contained task with all the context it needs — and nothing more.

The orchestrator's system prompt is deliberately lean at roughly 1,000 tokens. It does not accumulate edit history, retry logs, or implementation details. Its context grows only from your instructions and its own planning decisions. Everything the subagent did to produce a result — the failed attempts, the search/replace iterations, the intermediate reasoning — is destroyed with the subagent.

Subagents communicate back to the orchestrator through a strict protocol: exact-match diffs using search/replace blocks with autonomous self-healing. If a diff cannot be applied cleanly, the failure is reported back rather than silently patched. The orchestrator can then retry with a fresh subagent, armed with better context from the failure report.

This architecture enables hybrid model routing. You can architect the plan with a large reasoning model like DeepSeek V4, then spawn subagents that execute using cheaper, faster local models like Gemma 4. The orchestrator handles the expensive reasoning; subagents handle the grunt work.

Key Features

Hybrid Model Routing — Decouple your brain from your brawn. Use DeepSeek V4 or Claude for architecture decisions, Gemma 4 or Qwen for the actual implementation. Set it via LATE_SUBAGENT_MODEL or config.json and move on.
Ephemeral Subagents — Each subagent starts with a clean room: only its assigned task, nothing else. On completion, the entire context is torched. The orchestrator sees only the diff outcome.
Exact-Match Diffs — Strict search/replace blocks with autonomous self-healing on mismatch. Edits fail loud — no silent corruption, no fuzzy patching, no surprises.
Human-in-the-Loop — Read-only commands (file reads, directory listings) skip approval for velocity. Mutations hard-stop at [y/N]. Session, project, and global trust scopes with TTL decay so you are not spammed every time.
Stateful Resilience — Continuous session history written to disk. Close your terminal, reboot, pick up exactly where you left off. The orchestrator restores its full state, no questions asked.
MCP Integration — Native Model Context Protocol server support via stdio. Map external tools directly into Late's agent workflow. If your tool speaks MCP, Late speaks MCP.
Agent Skills — Drop-in reusable instruction and script sets in YAML frontmatter format. Zero configuration. They just work.
Git Worktree Support — Run independent, parallel agent instances across multiple branches. No context bleeding between them.
Gemma 4 Thinking Mode — Standard API wrappers cannot trigger Gemma's reasoning tokens. Late includes a dedicated flag that injects the exact tokens required. You get Gemma's thinking mode, not a half-baked approximation.
Zero-Dependency Static Binary — Single Go binary. No Node.js, no Python, no package manager, no runtime. Download, drop in PATH, run.
Any OpenAI-Compatible API — Claude, DeepSeek, Qwen, Gemma, OpenRouter, local llama.cpp servers. Set OPENAI_BASE_URL, OPENAI_API_KEY, OPENAI_MODEL. You are running.

Use Cases

Local-first development is Late's sweet spot. Got a machine with 5GB+ VRAM and llama.cpp running? Late works out of the box — zero configuration. No cloud API keys, no OAuth flows, no subscription. For developers who want AI assistance without shipping code to third-party servers.

Hybrid cloud/local workflows let you use a powerful cloud model like DeepSeek V4 or Claude for architecture planning while executing edits with a local model. You get the reasoning quality of a large model with the latency and privacy of local execution for the implementation work.

Resource-constrained environments benefit from Late's tiny footprint. The binary is a few megabytes, no runtime dependencies, and the orchestrator itself consumes negligible context. This makes it viable on low-spec machines, CI runners, and edge devices where a full Node.js-based agent would be impractical.

Parallel development across branches is supported through Git worktree integration. Run multiple Late instances on different branches simultaneously without context pollution.

Pricing / Licensing

Late is free for builders. Use it to write code for any project, including commercial ones. The output you produce with Late is yours.

The license includes a commercial restriction: you may not monetize Late itself by wrapping the orchestration engine into a paid service or deploying it as enterprise infrastructure without a separate commercial agreement.

The license converts to GPLv2 on February 21, 2030. Full terms are in the LICENSE file in the repository.

Pros & Cons

Pros

Ephemeral subagents keep the orchestrator context lean and free of implementation noise
Single static binary with zero runtime dependencies — install in seconds
Runs on modest hardware (5GB VRAM) for fully local operation
Hybrid model routing optimizes cost and quality
Stateful persistence survives reboots and terminal closures
Exact-match diffs fail loud instead of silently corrupting files

Cons

Young project with a small community and fewer third-party integrations
Custom license requires legal review for commercial deployment at scale
CLI-only with no web UI or visual workflow builder
Subagent orchestration overhead means simple single-file edits may be slower than a monolithic agent

Version History

v1.3.0 Jun 8, 2026

Upgraded TUI theme, double-click viewport copy, keyboard help overlay (Ctrl+H), terminal background color leak fix

v1.2.7 Jun 6, 2026

MCP tool namespace fix, raised MCP truncation limit to 32,768 chars, UTF-8 slicing fix

v1.2.6 May 25, 2026

Sqz tool integration, force-enable image support, universal installation script, API path prefix fix

v1.2.5 May 16, 2026

Queued message injection, chat keybinding fixes, YAML frontmatter 1MB limit for skill parsing

v1.2.0 Apr 28, 2026

AST-based bash command analyzer, scoped lingering approval for tool calls

v1.1.0 Apr 22, 2026

Windows support, subagent model selection for hybrid routing

v1.0.0 Apr 12, 2026

Initial release

Signature Snippet

User types `late` in a project directory. The orchestrator reads the codebase, builds a plan, spawns an ephemeral subagent to implement each file change, then collects the results — all without polluting the main context window.

More in this Space

Vix

Open source

Vix is a Go-native, open-source (AGPL-3.0) AI coding agent that slashes token costs by 40-50% using a stem agent architecture and Tree-sitter virtual filesystem. It rethinks the plan/execute loop — keeping LLM cache warm across Explore, Plan, and Execute phases — while shipping Programmable Workflows, Whiteboard Mode with voice AI, MCP server support, and a self-evolving agent that writes its own scheduled jobs and watchers.

Paca

Open source

Paca is a free, open-source, self-hosted Scrum board where AI agents work as equal teammates — assigned to sprints, picking up tasks, and collaborating on BDD specs alongside humans. Built as an alternative to Jira and Linear, it treats AI agents as first-class Scrum members.

Nanobot

Open source

Nanobot is an ultra-lightweight, open-source (MIT) personal AI agent that ships with WebUI, multi-channel chat (Telegram, Discord, WeChat, Slack, Feishu, email), MCP support, memory, model routing with fallbacks, cron automation, and a plugin skill system — all pip-installable in seconds. Built on a deliberately small and readable Python core, it lets you truly own your AI agent stack.

Late — High-Leverage AI Agent Orchestration

+ Pros

− Cons

Pricing

Free (Builders)

Commercial License

Overview

How It Works

Key Features

Use Cases

Pricing / Licensing

Pros & Cons

Further Reading

Version History

More in this Space

Vix

Paca

Nanobot

Track the tools. Lead the shift.