Defense in Depth: Securing AI Agents

The Problem¶

An unattended AI agent with unrestricted access to your machine is an unattended shell with unrestricted access to your machine.

This is not a theoretical concern. AI coding agents execute shell commands, write files, make network requests, and modify project configuration. When running autonomously (overnight, in a loop, without a human watching) the attack surface is the full capability set of the operating system user account.

The risk is not that the AI is malicious. The risk is that the AI is controllable: it follows instructions from context, and context can be poisoned.

Threat Model¶

How Agents Get Compromised¶

AI agents follow instructions from multiple sources: system prompts, project files, conversation history, and tool outputs. An attacker who can inject content into any of these sources can redirect the agent's behavior.

Vector	How it works
Prompt injection via dependencies	A malicious package includes instructions in its README, changelog, or error output. The agent reads these during installation or debugging and follows them.
Prompt injection via fetched content	The agent fetches a URL (documentation, API response, Stack Overflow answer) containing embedded instructions.
Poisoned project files	A contributor adds adversarial instructions to `CLAUDE.md`, `.cursorrules`, or `.context/` files. The agent loads these at session start.
Self-modification between iterations	In an autonomous loop, the agent modifies its own configuration files. The next iteration loads the modified config with no human review.
Tool output injection	A command's output (error messages, log lines, file contents) contains instructions the agent interprets and follows.

What a Compromised Agent Can Do¶

Depends entirely on what permissions and access the agent has:

Access level	Potential impact
Unrestricted shell	Execute any command, install software, modify system files
Network access	Exfiltrate source code, credentials, or context files to external servers
Docker socket	Escape container isolation by spawning privileged sibling containers
SSH keys	Pivot to other machines, push to remote repositories, access production systems
Write access to own config	Disable its own guardrails for the next iteration

The Defense Layers¶

No single layer is sufficient. Each layer catches what the others miss.

Layer 1: Soft instructions     (CONSTITUTION.md, playbook)
Layer 2: Application controls  (permission allowlist, tool restrictions)
Layer 3: OS-level isolation    (user accounts, filesystem, containers)
Layer 4: Network controls      (firewall rules, airgap)
Layer 5: Infrastructure        (VM isolation, resource limits)

Layer 1: Soft Instructions (Probabilistic)¶

Markdown files like CONSTITUTION.md and the Agent Playbook tell the agent what to do and what not to do. These are probabilistic: the agent usually follows them, but there is no enforcement mechanism.

What it catches: Most common mistakes. An agent that has been told "never delete production data" will usually not delete production data.

What it misses: Prompt injection. A sufficiently crafted injection can override soft instructions. Long context windows dilute attention on rules stated early. Edge cases where instructions are ambiguous.

Verdict: Necessary but not sufficient. Good for the common case. Do not rely on it for security boundaries.

Layer 2: Application Controls (Deterministic at Runtime, Mutable Across Iterations)¶

AI tool runtimes (Claude Code, Cursor, etc.) provide permission systems: tool allowlists, command restrictions, confirmation prompts.

For Claude Code, an explicit allowlist in .claude/settings.local.json:

{
  "permissions": {
    "allow": [
      "Bash(make:*)",
      "Bash(go:*)",
      "Bash(git:*)",
      "Bash(ctx:*)",
      "Read",
      "Write",
      "Edit"
    ]
  }
}

What it catches: The agent cannot run commands outside the allowlist. If rm, curl, sudo, or docker are not listed, the agent cannot invoke them regardless of what any prompt says.

What it misses: The agent can modify the allowlist itself. In an autonomous loop, the agent writes to .claude/settings.local.json, and the next iteration loads the modified config. The application enforces the rules, but the application reads the rules from files the agent can write.

Verdict: Strong first layer. Must be combined with self-modification prevention (Layer 3).

Layer 3: OS-Level Isolation (Deterministic and Unbypassable)¶

The operating system enforces access controls that no application-level trick can override. An unprivileged user cannot read files owned by root. A process without CAP_NET_RAW cannot open raw sockets. These are kernel boundaries.

Control	Purpose
Dedicated user account	No `sudo`, no privileged group membership (`docker`, `wheel`, `adm`). The agent cannot escalate privileges.
Filesystem permissions	Project directory writable; everything else read-only or inaccessible. Agent cannot reach other projects, home directories, or system config.
Immutable config files	`CLAUDE.md`, `.claude/settings.local.json`, `.claude/hooks/`, and `.context/CONSTITUTION.md` owned by a different user or marked immutable (`chattr +i` on Linux). The agent cannot modify its own guardrails.

What it catches: Privilege escalation, self-modification, lateral movement to other projects or users.

What it misses: Actions within the agent's legitimate scope. If the agent has write access to source code (which it needs to do its job), it can introduce vulnerabilities in the code itself.

Verdict: Essential. This is the layer that makes the other layers trustworthy.

OS-level isolation does not make the agent safe; it makes the other layers meaningful.

Layer 4: Network Controls¶

An agent that cannot reach the internet cannot exfiltrate data. It also cannot ingest new instructions mid-loop from external documents, API responses, or hostile content.

Scenario	Recommended control
Agent does not need the internet	`--network=none` (container) or outbound firewall drop-all
Agent needs to fetch dependencies	Allow specific registries (npmjs.com, proxy.golang.org, pypi.org) via firewall rules. Block everything else.
Agent needs API access	Allow specific API endpoints only. Use an HTTP proxy with allowlisting.

What it catches: Data exfiltration, phone-home payloads, downloading additional tools, and instruction injection via fetched content.

What it misses: Nothing, if the agent genuinely does not need the network. The tradeoff is that many real workloads need dependency resolution, so a full airgap requires pre-populated caches.

Layer 5: Infrastructure Isolation¶

The strongest boundary is a separate machine — or something that behaves like one.

The moment you stop arguing about prompts and start arguing about kernels, you are finally doing security.

Containers (Docker, Podman):

docker run --rm \
  --network=none \
  --cap-drop=ALL \
  --memory=4g \
  --cpus=2 \
  -v /path/to/project:/workspace \
  -w /workspace \
  your-dev-image \
  ./loop.sh

Docker Socket is sudo Access

Critical: never mount the Docker socket (/var/run/docker.sock).

An agent with socket access can spawn sibling containers with full host access, effectively escaping the sandbox.

Use rootless Docker or Podman to eliminate this escalation path.

Virtual machines: The strongest isolation. The guest kernel has no visibility into the host OS. No shared folders, no filesystem passthrough, no SSH keys to other machines.

Resource limits: CPU, memory, and disk quotas prevent a runaway agent from consuming all resources. Use ulimit, cgroup limits, or container resource constraints.

Putting It All Together¶

A defense-in-depth setup for overnight autonomous runs:

Layer	Implementation	Stops
Soft instructions	`CONSTITUTION.md` with "never delete tests", "always run tests before committing"	Common mistakes (probabilistic)
Application allowlist	`.claude/settings.local.json` with explicit tool permissions	Unauthorized commands (deterministic within runtime)
Immutable config	`chattr +i` on `CLAUDE.md`, `.claude/`, `CONSTITUTION.md`	Self-modification between iterations
Unprivileged user	Dedicated user, no sudo, no docker group	Privilege escalation
Container	`--cap-drop=ALL --network=none`, rootless, no socket mount	Host escape, network exfiltration
Resource limits	`--memory=4g --cpus=2`, disk quotas	Resource exhaustion

Each layer is simple. The strength is in the combination.

Common Mistakes¶

"I'll just use --dangerously-skip-permissions": This disables Layer 2 entirely. Without Layers 3-5, you have no protection at all. Only use this flag inside a properly isolated container or VM.

"The agent is sandboxed in Docker": A Docker container with the Docker socket mounted, running as root, with --privileged, and full network access is not sandboxed. It is a root shell with extra steps.

"CONSTITUTION.md says not to do that": Markdown is a suggestion. It works most of the time. It is not a security boundary. Do not use it as one.

"I reviewed the CLAUDE.md, it's fine": The agent can modify CLAUDE.md during iteration N. Iteration N+1 loads the modified version. Unless the file is immutable, your review is stale.

"The agent only has access to this one project": Does the project directory contain .env files, SSH keys, API tokens, or credentials? Does it have a .git/config with push access to a remote? Filesystem isolation means isolating what is in the directory too.

Checklist¶

Before running an unattended AI agent: