SWE-agent

The original ACI-based autonomous coding agent — now superseded by mini-SWE-agent.

Princeton NLP / Stanford / open-source community Open source Since

SWE-agent is an open-source autonomous AI software engineering agent from Princeton NLP and Stanford that introduced the Agent-Computer Interface (ACI) concept. It enables language models to autonomously fix GitHub issues, solve cybersecurity CTF challenges, and perform custom coding tasks through a structured command interface, all within isolated Docker sandboxes. Now superseded by mini-SWE-agent for most practical use cases.

Cons

  • Research-oriented — less polished UX than commercial tools like Devin, Cursor, or Claude Code
  • No GUI or IDE integration — CLI only, Docker required for sandboxed execution
  • Requires manual setup of LLM API keys, Docker, and Python environment
  • Superseded by mini-SWE-agent for most practical use cases — the official recommendation is to use the smaller, simpler successor

Pricing

Free (OSS)

Free

Self-hosted open-source agent. MIT license. Requires own API keys.

Introduction

SWE-agent is the research project that defined how autonomous AI agents interact with codebases. Published by Princeton NLP and Stanford researchers in April 2024 and accepted at NeurIPS 2024, it introduced the Agent-Computer Interface (ACI) : a purpose-built set of commands (open file, scroll, search, edit, run shell commands, run tests) that gives language models a structured way to navigate and modify a software repository.

Rather than dumping raw file contents into a context window, ACI gives the model a set of verbs to explore and modify code step by step, dramatically improving success rates on software engineering tasks.

SWE-agent was one of the original reference implementations for the SWE-bench benchmark and achieved state-of-the-art results among open-source systems. The project has since spawned an ecosystem: mini-SWE-agent (the official successor), SWE-ReX (parallel cloud execution), SWE-smith (training trajectory generation), and EnIGMA (offensive cybersecurity). As of 2025, the project recommends mini-SWE-agent, which matches SWE-agent's performance in a dramatically simpler package. SWE-agent remains the canonical full-featured reference implementation — the foundation the ecosystem was built on.

Key Features

Agent-Computer Interface (ACI). A set of structured file and shell commands built for LLM interaction. Instead of raw file dumps, the model gets verbs like open, scroll, edit, search, and submit that mirror how a human developer works. Numerous subsequent autonomous coding agents have adopted and adapted the ACI design.

SWE-bench SOTA Performance. SWE-agent has consistently been the top-performing open-source system on SWE-bench. Version 1.0 (February 2025) achieved state-of-the-art results on SWE-bench verified, lite, and full using Claude 3.7 Sonnet. The SWE-agent-LM-32b model, trained via the SWE-smith project, holds open-weights SOTA on SWE-bench verified.

Docker Sandboxing. Every task runs in an isolated Docker container so each run stays safe and reproducible. The sandbox prevents the agent from affecting the host system and gives a clean, repeatable environment for every run.

Configurable LLM Backend. Supports any model via litellm, including Claude, GPT-4o, Gemini, DeepSeek, and local open-weight models. Configuration is governed by a single YAML file, with the ability to override any setting from the command line.

SWE-ReX Parallel Cloud Execution. Massively parallel code execution using modal, AWS, or any other cloud provider. SWE-ReX decouples agent reasoning from code execution for running hundreds or thousands of SWE-agent tasks concurrently in the cloud.

Tool Bundles. Flexible, configurable tool definitions for composing custom sets of commands for different tasks. Bundles can be mixed, matched, and overridden without modifying the core agent code.

Interactive Agent Tools. Tools like gdb for interactive debugging sessions during agent runs, so the model can step through code execution and inspect program state.

Summarizer. Handles long model outputs by summarizing intermediate results before feeding them back into the context window, which prevents context overflow on lengthy debugging sessions.

Trajectory Inspector. A command-line tool for browsing, filtering, and analyzing hundreds of agent trajectories with ease. Essential for research: inspect every model action, command output, and decision point across multiple runs.

EnIGMA Cybersecurity Mode. A mode for offensive cybersecurity capture-the-flag (CTF) challenges. Achieves a 3.3x improvement over previous agents on the NYU CTF benchmark dataset, with dedicated tools for exploit development and vulnerability analysis.

Ecosystem

The SWE-agent project now encompasses several projects that extend the original vision:

mini-SWE-agent. The official successor project. Achieves 65% on SWE-bench verified in approximately 100 lines of Python. Matches SWE-agent's performance while being dramatically simpler. The SWE-agent README now directs users to mini-SWE-agent for most practical use cases.

SWE-smith. A training trajectory generator that produces tens of thousands of high-quality trajectories for fine-tuning language models on software engineering tasks. The resulting SWE-agent-LM-32b model achieves open-weights SOTA on SWE-bench verified.

SWE-ReX. A massively parallel code execution framework that runs SWE-agent tasks in the cloud. Supports modal, AWS, and custom execution backends for large-scale benchmarking and evaluation.

EnIGMA (Enhanced Interactive Generative Model Agent). An offensive cybersecurity variant that achieves state-of-the-art results on multiple CTF benchmarks. Introduced interactive agent tools and the summarizer, which were subsequently merged into the main SWE-agent codebase.

SWE-bench. While not part of SWE-agent itself, the SWE-bench benchmark is deeply intertwined with the project. SWE-agent served as the original reference implementation and continues to be a primary evaluation target.

Benchmark Performance

SWE-agent has consistently demonstrated state-of-the-art performance on the SWE-bench family of benchmarks:

  • SWE-bench Verified. SOTA among open-source agents (v1.0 with Claude 3.7 Sonnet).
  • SWE-bench Full. SOTA on the complete set of 2,000+ GitHub issues (v1.0.1, February 2025).
  • SWE-bench Lite. SOTA among open-source solutions on the lighter evaluation subset.
  • SWE-agent-LM-32b. Open-weights SOTA on SWE-bench verified, trained using SWE-smith-generated trajectories.
  • NYU CTF Benchmark. EnIGMA mode achieves a 3.3x improvement over previous state-of-the-art agents on offensive cybersecurity challenges.

All results are academically validated and published in peer-reviewed venues including NeurIPS 2024.

Who Is It For

SWE-agent is for researchers, academics, and developers who want full visibility into how an autonomous coding agent operates. It works well for studying agent-computer interaction, benchmarking new models against established baselines, and building custom agent systems on top of well-documented principles.

The project is less suited for day-to-day production use where polished UX, IDE integration, or minimal setup overhead is required. For those use cases, the official recommendation is mini-SWE-agent, or commercial alternatives like Devin, Cursor, or Claude Code.

Specific use cases include:

  • Academic research into autonomous software engineering and agent-computer interaction.
  • Benchmarking new language models on SWE-bench using an established, well-documented evaluation pipeline.
  • Building custom agent systems by extending or modifying the ACI tool set.
  • Offensive cybersecurity research using the EnIGMA mode.
  • Teaching and education: the trajectory inspector and full action logging make SWE-agent an excellent teaching tool for understanding how LLM-based agents work.

Further Reading

Getting Started

SWE-agent requires Python 3.11+, Docker, and an LLM API key. Installation is via pip:

pip install sweagent

The recommended workflow for fixing a GitHub issue:

sweagent run --agent.model.name=claude-sonnet-4-20250514 --env.repo.github_url=https://github.com/example/repo --problem_statement.github_url=https://github.com/example/repo/issues/42

SWE-agent will clone the repository into a Docker container, analyze the issue, explore the codebase using its ACI tools, develop and test a fix, and output a patch file. A full trajectory log is saved for inspection.

For batch evaluation on SWE-bench:

sweagent run-batch \
  --agent.model.name claude-sonnet-4-20250514 \
  --instances.type swe_bench \
  --instances.subset verified

For cloud-based parallel execution, SWE-ReX can be configured to run tasks on modal or AWS, scaling from a single issue to thousands of evaluations.

Detailed documentation, configuration guides, and migration notes from earlier versions are available at the SWE-agent documentation .

Version History

v1.1.0

10s of thousands of training trajectories via SWE-smith, multilingual/multimodal SWE-bench support

v1.0.1

SOTA on SWE-bench Full with Claude 3.7 Sonnet

v1.0.0

SWE-agent 1.0: new CLI, SWE-ReX, tool bundles, trajectory inspector

Signature Snippet
sweagent run --agent.model.name=claude-sonnet-4-20250514 --env.repo.github_url=https://github.com/example/repo --problem_statement.github_url=https://github.com/example/repo/issues/42

Live feed in your inbox

Track the tools. Lead the shift.

Tech leaders use Artificialus to stay ahead: editorial picks, agent comparisons, MCP updates, and signal-heavy analysis when it matters.

No spam. Only tools and shifts worth tracking.