Type to search across all content

    Late — High-Leverage AI Agent Orchestration

    Orchestrate an entire AI dev team on 5GB VRAM

    mlhher Closed source Since

    Orchestrate an entire AI dev team on 5GB VRAM using ephemeral subagents, exact-match diffs, and a zero-dependency Go binary. Works with any OpenAI-compatible model — local or cloud.

    + Pros

    • Ephemeral subagents prevent context pollution — each task gets a fresh isolated context that is destroyed on completion
    • Zero-dependency single Go binary — no Node.js, Python, or runtime environment required
    • Runs on as little as 5GB VRAM for local inference with llama.cpp
    • Hybrid model routing lets you architect with large reasoning models and execute with cheap local models
    • Stateful session persistence across reboots — pick up exactly where you left off
    • Exact-match diffs fail loud instead of silently corrupting files

    Cons

    • Relatively new project — smaller community, fewer third-party integrations compared to Claude Code or Copilot
    • Custom license with commercial restrictions may deter enterprise adoption without legal review
    • CLI-only interface — no GUI, web dashboard, or visual workflow builder
    • Subagent orchestration adds latency for simple single-file edits that a monolithic agent would handle faster

    Pricing

    Free (Builders)

    $0

    Use freely to write code for any project, including commercial ones. Your output is yours.

    Commercial License

    Custom

    Required if wrapping Late's orchestration engine into a paid service or deploying as enterprise infrastructure

    Overview

    Most AI coding agents share a fatal design flaw: they do everything in one context window. Planning, implementation, retries, self-healing, error recovery — it all piles into the same conversation history until the model can't tell signal from noise. You blame the model. The model is fine.

    Late takes a different approach. It is a zero-dependency, single-static-binary AI coding agent written in Go that splits the workflow into two layers: a lean orchestrator that handles planning and coordination, and ephemeral subagents that execute individual tasks in isolated contexts. Each subagent gets one job, a fresh context, and is destroyed when done. The orchestrator only ever sees plans and outcomes, never the implementation mess.

    The result is that the same model feels sharper in Late because it reasons from a clean signal. Late manages its KV cache ruthlessly, keeps its system prompt around 1,000 tokens, and runs comfortably on 5GB VRAM for local inference. No Node.js, no Python, no runtime dependencies — just a single Go binary you drop in your PATH.

    How It Works

    Late's architecture rests on one insight: separation of concerns between planning and execution. When you run late in a project directory, the orchestrator reads your codebase, forms a plan, and spawns subagents one at a time. Each subagent receives a self-contained task with all the context it needs — and nothing more.

    The orchestrator's system prompt is deliberately lean at roughly 1,000 tokens. It does not accumulate edit history, retry logs, or implementation details. Its context grows only from your instructions and its own planning decisions. Everything the subagent did to produce a result — the failed attempts, the search/replace iterations, the intermediate reasoning — is destroyed with the subagent.

    Subagents communicate back to the orchestrator through a strict protocol: exact-match diffs using search/replace blocks with autonomous self-healing. If a diff cannot be applied cleanly, the failure is reported back rather than silently patched. The orchestrator can then retry with a fresh subagent, armed with better context from the failure report.

    This architecture enables hybrid model routing. You can architect the plan with a large reasoning model like DeepSeek V4, then spawn subagents that execute using cheaper, faster local models like Gemma 4. The orchestrator handles the expensive reasoning; subagents handle the grunt work.

    Key Features

    • Hybrid Model Routing — Decouple your brain from your brawn. Use DeepSeek V4 or Claude for architecture decisions, Gemma 4 or Qwen for the actual implementation. Set it via LATE_SUBAGENT_MODEL or config.json and move on.
    • Ephemeral Subagents — Each subagent starts with a clean room: only its assigned task, nothing else. On completion, the entire context is torched. The orchestrator sees only the diff outcome.
    • Exact-Match Diffs — Strict search/replace blocks with autonomous self-healing on mismatch. Edits fail loud — no silent corruption, no fuzzy patching, no surprises.
    • Human-in-the-Loop — Read-only commands (file reads, directory listings) skip approval for velocity. Mutations hard-stop at [y/N]. Session, project, and global trust scopes with TTL decay so you are not spammed every time.
    • Stateful Resilience — Continuous session history written to disk. Close your terminal, reboot, pick up exactly where you left off. The orchestrator restores its full state, no questions asked.
    • MCP Integration — Native Model Context Protocol server support via stdio. Map external tools directly into Late's agent workflow. If your tool speaks MCP, Late speaks MCP.
    • Agent Skills — Drop-in reusable instruction and script sets in YAML frontmatter format. Zero configuration. They just work.
    • Git Worktree Support — Run independent, parallel agent instances across multiple branches. No context bleeding between them.
    • Gemma 4 Thinking Mode — Standard API wrappers cannot trigger Gemma's reasoning tokens. Late includes a dedicated flag that injects the exact tokens required. You get Gemma's thinking mode, not a half-baked approximation.
    • Zero-Dependency Static Binary — Single Go binary. No Node.js, no Python, no package manager, no runtime. Download, drop in PATH, run.
    • Any OpenAI-Compatible API — Claude, DeepSeek, Qwen, Gemma, OpenRouter, local llama.cpp servers. Set OPENAI_BASE_URL, OPENAI_API_KEY, OPENAI_MODEL. You are running.

    Use Cases

    Local-first development is Late's sweet spot. Got a machine with 5GB+ VRAM and llama.cpp running? Late works out of the box — zero configuration. No cloud API keys, no OAuth flows, no subscription. For developers who want AI assistance without shipping code to third-party servers.

    Hybrid cloud/local workflows let you use a powerful cloud model like DeepSeek V4 or Claude for architecture planning while executing edits with a local model. You get the reasoning quality of a large model with the latency and privacy of local execution for the implementation work.

    Resource-constrained environments benefit from Late's tiny footprint. The binary is a few megabytes, no runtime dependencies, and the orchestrator itself consumes negligible context. This makes it viable on low-spec machines, CI runners, and edge devices where a full Node.js-based agent would be impractical.

    Parallel development across branches is supported through Git worktree integration. Run multiple Late instances on different branches simultaneously without context pollution.

    Pricing / Licensing

    Late is free for builders. Use it to write code for any project, including commercial ones. The output you produce with Late is yours.

    The license includes a commercial restriction: you may not monetize Late itself by wrapping the orchestration engine into a paid service or deploying it as enterprise infrastructure without a separate commercial agreement.

    The license converts to GPLv2 on February 21, 2030. Full terms are in the LICENSE file in the repository.

    Pros & Cons

    Pros

    • Ephemeral subagents keep the orchestrator context lean and free of implementation noise
    • Single static binary with zero runtime dependencies — install in seconds
    • Runs on modest hardware (5GB VRAM) for fully local operation
    • Hybrid model routing optimizes cost and quality
    • Stateful persistence survives reboots and terminal closures
    • Exact-match diffs fail loud instead of silently corrupting files

    Cons

    • Young project with a small community and fewer third-party integrations
    • Custom license requires legal review for commercial deployment at scale
    • CLI-only with no web UI or visual workflow builder
    • Subagent orchestration overhead means simple single-file edits may be slower than a monolithic agent

    Further Reading

    Version History

    v1.3.0

    Upgraded TUI theme, double-click viewport copy, keyboard help overlay (Ctrl+H), terminal background color leak fix

    v1.2.7

    MCP tool namespace fix, raised MCP truncation limit to 32,768 chars, UTF-8 slicing fix

    v1.2.6

    Sqz tool integration, force-enable image support, universal installation script, API path prefix fix

    v1.2.5

    Queued message injection, chat keybinding fixes, YAML frontmatter 1MB limit for skill parsing

    v1.2.0

    AST-based bash command analyzer, scoped lingering approval for tool calls

    v1.1.0

    Windows support, subagent model selection for hybrid routing

    v1.0.0

    Initial release

    Signature Snippet
    User types `late` in a project directory. The orchestrator reads the codebase, builds a plan, spawns an ephemeral subagent to implement each file change, then collects the results — all without polluting the main context window.

    Live feed in your inbox

    Track the tools. Lead the shift.

    Tech leaders use Artificialus to stay ahead: editorial picks, agent comparisons, MCP updates, and signal-heavy analysis when it matters.

    No spam. Only tools and shifts worth tracking.