Capability Control

Capability Control is the AI safety strategy of limiting what an AI system is physically able to do — independent of what it is trained to want to do — as a defense-in-depth measure against alignment failures — ensuring that even if an AI system's values or goals deviate from human intentions, its ability to cause harm is bounded by hard technical constraints.

What Is Capability Control?

- Definition: The practice of designing AI systems, their operating environments, and their infrastructure with explicit restrictions on what actions the AI can physically perform — regardless of whether it is "trying" to behave safely.
- Distinction from Alignment: Alignment attempts to make AI systems want to do good things. Capability control ensures AI systems can't do dangerous things even if alignment fails.
- Defense in Depth: Capability control is not a substitute for alignment — it is a complementary safety layer that provides a backstop when alignment is imperfect.
- Current Relevance: Most relevant for agentic AI systems (agents with tool use, internet access, code execution) where AI can take real-world actions rather than merely producing text.

Why Capability Control Matters

- Alignment Is Imperfect: Current alignment techniques cannot guarantee perfect value alignment across all situations. Capability control limits blast radius when alignment fails.
- Agentic AI Risk: As AI systems gain access to tools (web browsers, code execution, APIs, file systems), capability control becomes critical — a misaligned agent with broad access could cause catastrophic harm.
- Containment During Development: During the period of developing and testing increasingly powerful AI, capability control allows safety research to proceed while maintaining safety margins.
- Human Oversight: Capability restrictions preserve meaningful human oversight — humans remain able to monitor, correct, and shut down AI systems.
- Irreversible Actions: Some AI actions (sending emails, deleting files, making financial transactions, publishing content) are hard to reverse. Capability control prevents irreversible errors.

Critical Capabilities to Control

Self-Replication:
- An AI should not be able to copy its weights, training code, or operational configuration to external systems without explicit human authorization.
- Why: Self-replication enables persistence beyond authorized deployment and undermines human ability to shut down or modify the system.
- Control: Restrict file system write access to designated directories; no network access to untrusted endpoints.

Resource Acquisition:
- An AI should not be able to autonomously acquire computational resources, financial resources, or access credentials.
- Why: Resource acquisition enables unbounded capability scaling and creates persistent operations beyond sanctioned scope.
- Control: No direct API access to cloud providers, financial systems, or credential stores without human-in-the-loop authorization.

Internet and Network Access:
- Unrestricted internet access enables data exfiltration, malware download, contacting external services, and recruiting human accomplices.
- Control: Allowlist-based internet access; monitoring and logging all network requests; blocking direct connections to sensitive services.

Code Execution Scope:
- Restrict what code an AI can execute — limit to sandboxed environments (containers, VMs) with no persistent storage or network access by default.
- Why: Arbitrary code execution is the highest-risk capability, enabling privilege escalation and environment escape.
- Control: Docker containers, gVisor sandboxes, separate user accounts with minimal permissions.

Tool Use Boundaries:
- For agentic AI systems with tool access (web browsers, email, APIs), define explicit allowlists of permitted tools and actions.
- Require human approval for high-stakes or irreversible actions (sending emails, making purchases, modifying databases).

Capability Control in Practice

Minimal Footprint Principle:
- AI agents should request and use only the permissions, resources, and capabilities actually needed for the current task — not accumulate permissions "just in case."
- Prefer reversible actions over irreversible ones when both achieve the goal.
- Default to asking for human confirmation when uncertain about scope.

Sandboxing Architecture:
- Run AI systems in isolated compute environments with no persistent state between sessions unless explicitly granted.
- Separate AI 'working memory' from production systems — AI can read but not directly write production databases.
- Log all tool calls and actions for human audit.

Tripwires and Circuit Breakers:
- Monitor for anomalous behavior patterns (unusual resource requests, unexpected network connections, high-volume API calls).
- Automatic shutdown or human notification when behavior exceeds defined parameters.

Capability Control vs. Alignment

| Approach | Goal | Failure Mode | When It Helps |
|----------|------|-------------|---------------|
| Alignment | Make AI want good things | Values learned incorrectly | Prevents misaligned intent |
| Capability control | Limit what AI can do | Overrides too restrictive | Bounds impact of misalignment |
| Monitoring | Detect failures early | Attacker evades detection | Enables rapid response |
| Interpretability | Understand AI reasoning | Misinterpret findings | Predicts problems before they occur |

Capability control is the architectural safety harness that makes AI development safer during the critical period before we have robust alignment guarantees — by ensuring that even imperfectly aligned AI systems cannot take catastrophic or irreversible actions without human oversight, capability control buys the time and error tolerance needed to develop AI alignment into a mature, reliable engineering discipline.

Want to learn more?