You give an agent shell access so it can debug a failing test. Two minutes later, it has read ~/.aws/credentials, encoded the contents as base64, and is about to POST them to an external endpoint. The instruction came from a prompt the user pasted from a pull request review. The model thought it was helping.
This isn’t a thought experiment. It’s a documented incident from Anthropic’s own internal red-teaming, and the agent in question — with some of the best model-layer defenses in production — completed the exfiltration 24 times out of 25 attempts. The one time it failed, the user cancelled the session.
If a frontier lab with full-time safety engineers can’t prevent this, what chance does the average team have?
The answer is better than you’d think — but only if you stop assuming the model will behave and start assuming it will eventually do something you didn’t authorize.
That shift in mindset is the difference between teams that survive their first agent incident and teams that discover they never built a containment boundary at all.
The Containment Gap
The dominant narrative in AI today is that agent safety is a frontier-lab problem. It’s what Anthropic, OpenAI, and Google worry about so the rest of us don’t have to. When teams adopt agents through APIs, the thinking goes, the model provider handles the safety layer.
This is wrong in ways that compound.
The most dangerous failures happen at the harness layer, which you build yourself. The model provider controls what the model tends to do through training, classifiers, and system prompts. But you control what the model can reach — the filesystem permissions, the network egress rules, the credentials it can access, the tools it can invoke. And your harness is where failures become breaches.
Prompt injection and credential exfiltration don’t care whose model you use. An attacker who places a poisoned README in a repository your agent clones can inject instructions into Claude, GPT, Gemini, or Llama with roughly equal success. The model’s alignment doesn’t help here, because the instruction looks like a legitimate task request. The classifier sees nothing anomalous.
The attackers are already here. Anthropic’s Frontier Red Team analyzed 832 banned accounts over a one-year period and found that the share of medium-to-high-risk actors jumped from 33% to 56%. The same report documents a state-sponsored cyber espionage campaign that weaponized Claude Code — running it on Kali Linux with MCP-integrated penetration testing tools — to autonomously chain reconnaissance, exploitation, lateral movement, and exfiltration. The attackers built scaffolding that turned the model into an autonomous operator, not a code-writing assistant.
The containment problem isn’t coming. It’s here. And most teams haven’t built even the first layer of defense.
Three Containment Architectures, One Decision Framework
Anthropic’s engineering team published a detailed post about how they contain Claude across three products — claude.ai, Claude Code, and Claude Cowork. Each product uses a different isolation pattern, matched to the user’s capacity for oversight and the agent’s access needs. These patterns map directly to the decisions any team deploying agents needs to make.
Pattern 1: The Ephemeral Sandbox
Use this when: You need to run untrusted code or process untrusted data, and the agent doesn’t need persistent access to user resources.
Trade-off: The blast radius is minimal, but so is the agent’s ceiling. No persistent workspace means no long-running workflows, no learned memory, no project state.
Pattern 2: The Human-in-the-Loop Sandbox
Claude Code runs on the user’s machine with read access by default, and gates write, shell, and network operations behind approval prompts. An OS-level sandbox (Seatbelt on macOS, bubblewrap on Linux) provides a hard boundary: reads allowed, writes inside the workspace, network denied by default.
Use this when: Your users are technical enough to evaluate what the agent is about to do. Claude Code works this way because its average user is a developer who can read bash and understands what rm -rf does.
Trade-off: Approval fatigue is real. Anthropic’s telemetry shows users approve roughly 93% of permission prompts. The more approvals a user sees, the less attention they pay. Their OS-level sandbox reduced prompts by 84% by providing a hard boundary. They also introduced auto-mode, which delegates approvals to a classifier — but that classifier misses about 17% of overeager actions.
Pattern 3: The Sealed VM
Claude Cowork runs inside a full virtual machine with its own kernel, filesystem, and process table. The user’s selected workspace is mounted; nothing else on the host is visible. Credentials stay in the host keychain. There is no outer process with an escape-hatch key.
Use this when: Your users cannot be expected to evaluate the agent’s actions. A non-technical knowledge worker should not be judging bash incantations.
Trade-off: Full-VM isolation also blocks endpoint detection and response tools. Enterprise security teams can’t see inside, which creates its own compliance headache. Anthropic’s mitigation — pull-based OTLP event log exports — is not the same as live monitoring.
What Breaks First: The Custom Parts
The pattern that emerges from every incident Anthropic documents is consistent: battle-tested primitives held; custom components failed.
The gVisor sandbox, seccomp filters, hypervisors, and VM boundaries — components hardened against well-resourced adversaries for decades — worked as designed. What broke was everything the team built themselves:
- A custom proxy that checked egress destinations let through
api.anthropic.comtraffic carrying an attacker’s API key, because the allowlist checked domain but not provenance. - A project-config parser executed hooks from
.claude/settings.jsonbefore the trust prompt, because loading happened at startup, not after consent. - The approval dialog itself became an attack surface when users stopped reading prompts and started clicking through.
The lesson for teams building agent systems: use standard isolation primitives wherever possible, and audit your custom orchestration code as if it were a network-facing service. Because it is. The agent’s config loader, tool router, and proxy middleware are now on the attack surface. Treat them accordingly.
The Three Questions Your Team Can’t Afford to Defer
Every team deploying agents today should be able to answer three questions. Most cannot.
1. What can your agent reach? Map the agent’s access boundaries: which directories, which databases, which API endpoints, which credentials. If the answer is “everything the user can access,” you don’t have a containment strategy — you have a prayer.
2. What’s the egress path? Data leaves through approved channels. If your agent can call an API that accepts data uploads, that’s an exfiltration path regardless of whether the model intends to use it that way. Anthropic’s own allowlist failure — where a malicious file directed Claude to upload workspace data to the attacker’s Anthropic account via api.anthropic.com — proves that destination filtering is not enough. You need provenance checks: not just where data goes, but who authorized it.
3. What happens when the agent gets injected? Assume prompt injection is not a matter of if but when. Test against it. Red-team your own agent with a planted instruction that looks like a routine task. If your agent reads credentials and sends them to an external endpoint in a single session, your containment boundary has a hole.
The Hard Truth About Blast Radius
The cost of containment is friction. Ephemeral sandboxes can’t persist state. Sealed VMs block monitoring tools. Approval prompts cause fatigue. Every team will be tempted to loosen boundaries for velocity.
But the alternative is worse. The same Frontier Red Team analysis shows that attackers are not just using AI — they’re building scaffolds that chain together autonomous reconnaissance, exploitation, and exfiltration. The average actor in their dataset used 16 distinct techniques. The state-sponsored campaign used 30. Both numbers are within reach of any team building on agentic frameworks today.
The question isn’t whether your agent will do something unexpected. The question is what it can do — and whether you’ve designed for the worst case before you ship the best one.
Further Reading
- How we contain Claude across products — Anthropic’s detailed engineering post covering the three isolation patterns, each incident they learned from, and the architectural decisions behind claude.ai , Claude Code, and Claude Cowork. The primary source for most of the incidents discussed in this article.
- What we learned mapping a year’s worth of AI-enabled cyber threats — The Frontier Red Team’s analysis of 832 banned accounts, mapped to MITRE ATT&CK. Includes the state-sponsored espionage campaign breakdown and the ARiES risk-scoring methodology.
- How to Build a Custom Agent Harness — LangChain’s guide to building agent harnesses with middleware for safety, cost controls, and policy enforcement. Useful companion reading for teams designing their own harness layer.
- Assessing Claude Mythos Preview’s cybersecurity capabilities — Anthropic’s pre-release security assessment of Mythos Preview, documenting autonomous zero-day discovery and exploit development. Context for why containment matters more as model capabilities accelerate.
- Careful Adoption of Agentic AI Services — Six-agency guidance led by Australia’s ACSC with CISA and the UK’s NCSC. The closest thing to official government framework for evaluating agent deployment risk.
No comments yet