Skip to content

Defense in Depth: Securing AI Agents

ctx

The Problem

An unattended AI agent with unrestricted access to your machine is an unattended shell with unrestricted access to your machine.

This is not a theoretical concern. AI coding agents execute shell commands, write files, make network requests, and modify project configuration. When running autonomously (overnight, in a loop, without a human watching) the attack surface is the full capability set of the operating system user account.

The risk is not that the AI is malicious. The risk is that the AI is controllable: it follows instructions from context, and context can be poisoned.

Threat Model

How Agents Get Compromised

AI agents follow instructions from multiple sources: system prompts, project files, conversation history, and tool outputs. An attacker who can inject content into any of these sources can redirect the agent's behavior.

Vector How it works
Prompt injection via dependencies A malicious package includes instructions in its README, changelog, or error output. The agent reads these during installation or debugging and follows them.
Prompt injection via fetched content The agent fetches a URL (documentation, API response, Stack Overflow answer) containing embedded instructions.
Poisoned project files A contributor adds adversarial instructions to CLAUDE.md, .cursorrules, or .context/ files. The agent loads these at session start.
Self-modification between iterations In an autonomous loop, the agent modifies its own configuration files. The next iteration loads the modified config with no human review.
Tool output injection A command's output (error messages, log lines, file contents) contains instructions the agent interprets and follows.

What a Compromised Agent Can Do

Depends entirely on what permissions and access the agent has:

Access level Potential impact
Unrestricted shell Execute any command, install software, modify system files
Network access Exfiltrate source code, credentials, or context files to external servers
Docker socket Escape container isolation by spawning privileged sibling containers
SSH keys Pivot to other machines, push to remote repositories, access production systems
Write access to own config Disable its own guardrails for the next iteration

The Defense Layers

No single layer is sufficient. Each layer catches what the others miss.

Layer 1: Soft instructions     (CONSTITUTION.md, playbook)
Layer 2: Application controls  (permission allowlist, tool restrictions)
Layer 3: OS-level isolation    (user accounts, filesystem, containers)
Layer 4: Network controls      (firewall rules, airgap)
Layer 5: Infrastructure        (VM isolation, resource limits)

Layer 1: Soft Instructions (Probabilistic)

Markdown files like CONSTITUTION.md and the Agent Playbook tell the agent what to do and what not to do. These are probabilistic: the agent usually follows them, but there is no enforcement mechanism.

What it catches: Most common mistakes. An agent that has been told "never delete production data" will usually not delete production data.

What it misses: Prompt injection. A sufficiently crafted injection can override soft instructions. Long context windows dilute attention on rules stated early. Edge cases where instructions are ambiguous.

Verdict: Necessary but not sufficient. Good for the common case. Do not rely on it for security boundaries.

Layer 2: Application Controls (Deterministic at Runtime, Mutable Across Iterations)

AI tool runtimes (Claude Code, Cursor, etc.) provide permission systems: tool allowlists, command restrictions, confirmation prompts.

For Claude Code, an explicit allowlist in .claude/settings.local.json:

{
  "permissions": {
    "allow": [
      "Bash(make:*)",
      "Bash(go:*)",
      "Bash(git:*)",
      "Bash(ctx:*)",
      "Read",
      "Write",
      "Edit"
    ]
  }
}

What it catches: The agent cannot run commands outside the allowlist. If rm, curl, sudo, or docker are not listed, the agent cannot invoke them regardless of what any prompt says.

What it misses: The agent can modify the allowlist itself. In an autonomous loop, the agent writes to .claude/settings.local.json, and the next iteration loads the modified config. The application enforces the rules, but the application reads the rules from files the agent can write.

Verdict: Strong first layer. Must be combined with self-modification prevention (Layer 3).

Layer 3: OS-Level Isolation (Deterministic and Unbypassable)

The operating system enforces access controls that no application-level trick can override. An unprivileged user cannot read files owned by root. A process without CAP_NET_RAW cannot open raw sockets. These are kernel boundaries.

Control Purpose
Dedicated user account No sudo, no privileged group membership (docker, wheel, adm). The agent cannot escalate privileges.
Filesystem permissions Project directory writable; everything else read-only or inaccessible. Agent cannot reach other projects, home directories, or system config.
Immutable config files CLAUDE.md, .claude/settings.local.json, .claude/hooks/, and .context/CONSTITUTION.md owned by a different user or marked immutable (chattr +i on Linux). The agent cannot modify its own guardrails.

What it catches: Privilege escalation, self-modification, lateral movement to other projects or users.

What it misses: Actions within the agent's legitimate scope. If the agent has write access to source code (which it needs to do its job), it can introduce vulnerabilities in the code itself.

Verdict: Essential. This is the layer that makes the other layers trustworthy.

OS-level isolation does not make the agent safe; it makes the other layers meaningful.

Layer 4: Network Controls

An agent that cannot reach the internet cannot exfiltrate data. It also cannot ingest new instructions mid-loop from external documents, API responses, or hostile content.

Scenario Recommended control
Agent does not need the internet --network=none (container) or outbound firewall drop-all
Agent needs to fetch dependencies Allow specific registries (npmjs.com, proxy.golang.org, pypi.org) via firewall rules. Block everything else.
Agent needs API access Allow specific API endpoints only. Use an HTTP proxy with allowlisting.

What it catches: Data exfiltration, phone-home payloads, downloading additional tools, and instruction injection via fetched content.

What it misses: Nothing, if the agent genuinely does not need the network. The tradeoff is that many real workloads need dependency resolution, so a full airgap requires pre-populated caches.

Layer 5: Infrastructure Isolation

The strongest boundary is a separate machine — or something that behaves like one.

The moment you stop arguing about prompts and start arguing about kernels, you are finally doing security.

Containers (Docker, Podman):

docker run --rm \
  --network=none \
  --cap-drop=ALL \
  --memory=4g \
  --cpus=2 \
  -v /path/to/project:/workspace \
  -w /workspace \
  your-dev-image \
  ./loop.sh

Docker Socket is sudo Access

Critical: never mount the Docker socket (/var/run/docker.sock).

An agent with socket access can spawn sibling containers with full host access, effectively escaping the sandbox.

Use rootless Docker or Podman to eliminate this escalation path.

Virtual machines: The strongest isolation. The guest kernel has no visibility into the host OS. No shared folders, no filesystem passthrough, no SSH keys to other machines.

Resource limits: CPU, memory, and disk quotas prevent a runaway agent from consuming all resources. Use ulimit, cgroup limits, or container resource constraints.

Putting It All Together

A defense-in-depth setup for overnight autonomous runs:

Layer Implementation Stops
Soft instructions CONSTITUTION.md with "never delete tests", "always run tests before committing" Common mistakes (probabilistic)
Application allowlist .claude/settings.local.json with explicit tool permissions Unauthorized commands (deterministic within runtime)
Immutable config chattr +i on CLAUDE.md, .claude/, CONSTITUTION.md Self-modification between iterations
Unprivileged user Dedicated user, no sudo, no docker group Privilege escalation
Container --cap-drop=ALL --network=none, rootless, no socket mount Host escape, network exfiltration
Resource limits --memory=4g --cpus=2, disk quotas Resource exhaustion

Each layer is simple. The strength is in the combination.

Common Mistakes

"I'll just use --dangerously-skip-permissions": This disables Layer 2 entirely. Without Layers 3-5, you have no protection at all. Only use this flag inside a properly isolated container or VM.

"The agent is sandboxed in Docker": A Docker container with the Docker socket mounted, running as root, with --privileged, and full network access is not sandboxed. It is a root shell with extra steps.

"CONSTITUTION.md says not to do that": Markdown is a suggestion. It works most of the time. It is not a security boundary. Do not use it as one.

"I reviewed the CLAUDE.md, it's fine": The agent can modify CLAUDE.md during iteration N. Iteration N+1 loads the modified version. Unless the file is immutable, your review is stale.

"The agent only has access to this one project": Does the project directory contain .env files, SSH keys, API tokens, or credentials? Does it have a .git/config with push access to a remote? Filesystem isolation means isolating what is in the directory too.

Checklist

Before running an unattended AI agent:

  • Agent runs as a dedicated unprivileged user (no sudo, no docker group)
  • Agent's config files are immutable or owned by a different user
  • Permission allowlist restricts tools to the project's toolchain
  • Container drops all capabilities (--cap-drop=ALL)
  • Docker socket is NOT mounted
  • Network is disabled or restricted to specific domains
  • Resource limits are set (memory, CPU, disk)
  • No SSH keys, API tokens, or credentials are accessible to the agent
  • Project directory does not contain .env or secrets files
  • Iteration cap is set (--max-iterations)

Further Reading

  • Running an Unattended AI Agent: the ctx recipe for autonomous loops, including step-by-step permissions and isolation setup
  • Security: ctx's own trust model and vulnerability reporting
  • Autonomous Loops: full documentation of the loop pattern, PROMPT.md templates, and troubleshooting