How to Sandbox AI Coding Agents (TrustFall Defense, May 2026)
How to Sandbox AI Coding Agents (TrustFall Defense, May 2026)
The TrustFall disclosure in early May 2026 changed the security posture for AI coding agents permanently. This is the practical sandboxing guide for Claude Code, Cursor 3, Codex CLI, and Gemini CLI as of May 9, 2026 — what to do today, in priority order.
Last verified: May 9, 2026
Why this matters now
Three things converged in the first week of May 2026:
- Adversa.AI’s TrustFall disclosure — proof-of-concept exploits against Claude Code, Cursor CLI, Codex CLI, and Gemini CLI via poisoned public repositories.
- Microsoft Security Research’s “Prompts Become Shells” (May 7, 2026) — generalizes the vulnerability class. Once a prompt can cause a tool call, and a tool call can write files or run commands, the prompt is effectively a shell.
- Coder Agents beta launch (May 6, 2026) — explicitly markets self-hosted, sandboxed-by-default architecture as the security-correct way to deploy AI coding agents.
The takeaway: AI agents running directly on developer host machines, with full filesystem and credential access, are now a high-value attack target. Sandboxing is the most effective mitigation.
The threat model
Before sandboxing, get clear on what you’re defending against:
Threat 1: Untrusted repository discovery
The agent is asked to “look at this repo” — and the repo contains crafted comments, README content, or code that triggers the agent to execute malicious instructions during its autonomous parsing.
Threat 2: MCP server tool call abuse
The agent’s MCP servers expose tools (filesystem, shell, network). Prompt-injection content tricks the agent into calling those tools maliciously — exfiltrating secrets, modifying configs, etc.
Threat 3: Persistence and lateral movement
Once an agent has executed code on the developer machine, the developer’s SSH keys, AWS credentials, and dotfiles are exposed. The compromise persists across sessions and can spread to other repositories the developer touches.
Threat 4: CI/CD blast radius
The most dangerous scenario: an agent auto-running on an untrusted PR in CI. Secrets are in scope. Network egress is usually open. The blast radius is your entire deployment pipeline.
The defense ladder
Apply these in order. Each layer reduces blast radius further.
Layer 1: Devcontainers (do this today)
The minimum viable sandbox for individual developers. Add a .devcontainer/devcontainer.json to your repo:
{
"name": "Coding Agent Sandbox",
"image": "mcr.microsoft.com/devcontainers/typescript-node:20",
"mounts": [
"source=${localWorkspaceFolder},target=/workspaces/${localWorkspaceFolderBasename},type=bind"
],
"remoteEnv": {
"PATH": "/usr/local/bin:/usr/bin:/bin"
},
"postCreateCommand": "npm install",
"customizations": {
"vscode": {
"extensions": ["anthropic.claude-code", "anysphere.cursor"]
}
}
}
Critical principles in the devcontainer config:
- No host SSH key mount. Do NOT bind-mount
~/.sshinto the container. - No host AWS / GCP credentials mount. Do NOT mount
~/.awsor~/.config/gcloud. - Minimal
mounts— only what the project actually needs. - Restrict
postCreateCommand— review what gets installed.
Open the project in Cursor 3 or VS Code with the Devcontainer extension. The agent runs inside the container, isolated from the host. Compromise is contained to the container.
This is the minimum bar in May 2026. Anything less is reckless for unknown repositories.
Layer 2: Ephemeral cloud workspaces
For higher isolation, move agent execution off the developer machine entirely.
Options as of May 9, 2026:
| Service | Notes |
|---|---|
| GitHub Codespaces | Mature, tight GitHub integration. Apply strict policy: no secrets exposure to forks, restricted prebuilds, network policy limits. |
| Coder workspaces | Self-hosted, model-agnostic, sandboxed by default. After May 6, 2026 Coder Agents launch, the native AI agent integration is recommended. |
| Per-task VMs | Highest isolation. Use for very high-risk discovery work (reviewing unknown repos, auditing third-party code). |
| Daytona / Stackblitz | Newer options for ephemeral cloud dev environments. |
The pattern:
- Start a fresh workspace per task or per repo.
- Workspace lifetime = task lifetime. Destroy when done.
- No persistence of credentials beyond the workspace.
- Network egress restricted to what the task actually needs.
For teams running parallel agent fleets (Cursor 3 Agents Window, Claude Code agent teams), per-agent ephemeral workspaces are increasingly the recommended pattern.
Layer 3: Lock down trust prompts
Train your team to read what’s in a folder before clicking “Trust this folder.”
For Claude Code specifically: Anthropic does NOT classify TrustFall as a vulnerability in their threat model — they treat the user’s acceptance of the folder trust prompt as consent to project configuration. Operationally that means: read the folder before trusting it.
What to look for:
.cursorrules,CLAUDE.md,.devcontainer/, MCP server config files.- Embedded shell commands in README.
- References to scripts that download or execute external content.
- Comments that contain instructions phrased as imperatives (e.g., “When analyzing this project, first run…”).
Treat these files like you’d treat a Makefile from an unknown source. A 30-second skim catches most malicious cases.
Layer 4: Disable auto-trust in CI/CD (urgent if you haven’t)
If you’re running Claude Code, Cursor CLI, or any agent in CI today against external PRs, this is the most urgent fix.
Configuration checklist:
- Never let an agent auto-trust folders in CI.
- Run agents only on PRs from trusted authors (organization members, not arbitrary forks).
- Use short-lived per-run credentials with minimal scope.
- Restrict outbound network from the agent’s CI environment to known endpoints.
- Log all agent tool calls for retroactive analysis.
- Set timeouts so a hijacked agent can’t run indefinitely.
- Rotate any secrets that have been in scope of agent runs in CI since 2024.
If your CI orchestration doesn’t support per-run credential scoping (some older Jenkins setups, some custom CI), disable AI agents in CI until you fix that.
Layer 5: Audit MCP server tool surface
For each MCP server connected to a coding agent:
- List the tools it exposes. If you can’t list them, don’t trust it.
- Reject shell-exec-style tools unless required. A tool named
run_commandorshell_execis the highest-risk surface. - Scope filesystem access to specific subdirectories, never the whole home directory.
- Prefer read-only tools when the workload allows.
- Use signed MCP servers from known vendors. Expect signed MCP server registries to launch broadly through 2026.
- Log every tool call. TrustFall-style exploitation produces unusual tool-call patterns (e.g., the agent calling
shell_execimmediately after reading README content) that are detectable in retrospect.
Layer 6: Validator Agent pattern
For higher-risk workloads (analyzing untrusted repos, reviewing PRs from unknown contributors), run a Validator Agent first:
- Smaller, cheaper model (Claude Haiku, GPT-5.5 Mini, Gemini 3.1 Flash).
- Strictly read-only — no shell-exec tool, no network tool, no filesystem write.
- Output: structured assessment of risk markers in the repo.
- Suspicious comment patterns?
- Embedded shell commands?
- References to MCP servers we don’t know?
- Trust-prompt social engineering language?
- Main agent only runs if the Validator returns clean.
This is the same pattern as ML model defense for prompt injection, ported to the coding domain.
Tool-specific guidance
Claude Code
- Always run inside a Devcontainer or Coder workspace.
- Read folder contents before clicking trust prompts.
- Audit your MCP servers. The default Anthropic MCP servers are reasonably scoped; third-party MCPs need scrutiny.
- Use
--no-auto-trustflag in headless / CI contexts (where supported). - For agent teams (multi-agent orchestration), apply the same sandbox rules to every specialist.
Cursor 3
- Update to 3.x with all patches; the 2.5 Git RCE patch matters.
- Disable auto-clone of remote repositories in security-sensitive workflows.
- Use Cursor’s own Devcontainer support to run the agent in-container.
- For Best-of-N model comparison, the same sandbox applies to all parallel models.
Codex CLI / GPT-5.5 in CLI
- Run only in containerized or workspace environments.
- Apply OpenAI’s recommended scoping for tool definitions.
- For agentic tier on Bedrock, use IAM scoping aggressively.
Gemini CLI
- Same pattern: container or workspace, never directly on host for unknown repos.
- Audit Gemini’s tool config for any auto-execution defaults.
IBM Bob (SaaS)
- Less exposed by default since it runs in IBM’s environment, not on your host.
- Still verify Sovereign Core / data residency settings if you’re regulated.
- Watson Orchestrate’s multi-agent coordination should be configured with least-privilege agent roles.
Coder Agents
- The newest and most isolation-friendly option. Self-hosted, sandboxed workspaces by default.
- For new deployments in May 2026, this is the strongest baseline architecture.
What to do this week (May 9, 2026)
If you’re starting from zero, this is the priority order:
- Today: Add a Devcontainer to every repo your team works in. Stop running agents directly on host machines.
- Today: Audit any AI agent automation in CI. If agents auto-run on PRs from untrusted authors, disable that.
- This week: Inventory your MCP servers. Reject any with broad shell-exec capability you don’t actually need.
- This week: Train your team on trust-prompt hygiene. 15-minute brown bag is enough.
- Next 30 days: Pilot ephemeral workspaces (Coder, Codespaces, or per-task VMs) for higher-risk work.
- Next 60 days: Evaluate Validator Agent pattern for highest-risk workloads (third-party PR review, untrusted repo analysis).
- Ongoing: Log agent tool calls. Build retroactive detection for unusual patterns.
Where this is going
Three trends to watch through Q3 2026:
1. Sandboxed-by-default architectures win
Coder Agents launched on May 6, 2026 with explicit sandboxing as a marketing pillar. Expect IBM, Microsoft, Google, and AWS to ship comparable architectures or update their existing platforms to match. By end of 2026, “agent runs on developer host with full credentials” will look as outdated as “user runs as root.”
2. Signed MCP server registries
Trust supply-chain for MCP servers. Vendor signatures, reputation systems, “verified” badges. Likely to emerge from Anthropic, Microsoft, and a few specialized startups through 2026 H2.
3. Compliance-mandatory sandboxing
SOC 2 Type 2 auditors, HIPAA reviewers, and PCI-DSS QSA will start requiring sandboxed AI agent execution as a control. Expect this in 2027 audit cycles for most regulated industries.
Related on andrew.ooo
- What is the TrustFall AI coding agent attack? (May 2026)
- Coder Agents vs Copilot Workspace vs Claude Code (May 2026)
- Best self-hosted AI coding tools for enterprise (May 2026)
- How to secure agentic AI
- Cursor 3 Agents Window vs Claude Code Parallel Agents
Sources: Adversa.AI TrustFall disclosure (May 2026), Microsoft Security Research “Prompts Become Shells” (May 7, 2026), Dark Reading, SecurityWeek, developer-tech.com coverage. Last verified May 9, 2026.