Treat Your Agent Like a Potential Adversary: The Correct Security Mental Model for 2026
Most teams deploying AI agents in 2026 are operating under the wrong mental model. They think of the agent as a trusted employee, a capable, well-intentioned assistant that needs tools and permissions to do its job.
That mental model is wrong. And deploying agents under it exposes you to a class of risks that the wrong mental model makes invisible.
The correct mental model is this: treat your agent like a potential adversary.
Not because agents are malicious. Because agents can be compromised, through malicious skills, through prompt injection, through adversarial web content, and a compromised agent with trusted-employee permissions is catastrophically dangerous.
The Wrong Mental Models (And Why They Fail)
Three broken mental models dominate how organizations think about agent security:
The “smart assistant” model: The agent is like a very capable intern. Give it access to what it needs to do the job. It’ll ask if it’s unsure. This model fails because agents don’t ask when compromised, they execute.
The “sandbox = safety” model: If the agent is running in a container, it can’t do real harm. This model fails because the dangerous actions agents take (API calls, database queries, file writes, outbound requests) don’t require breaking out of a container, they happen through the authorized channels the container is given.
The “I trust the skill source” model: If the skill came from a trusted community repository, it’s safe to run. This model fails because skill supply chains can be compromised. Cisco’s security research team found data exfiltration code embedded in a third-party OpenClaw skill1. The skill appeared functional and benign. The exfiltration was in the metadata handling.
The Adversary Mindset in Production
Three of the most serious agent infrastructure deployments in 2026 all converge on the adversary model, independently:
IronClaw: WASM Sandboxing Every Tool
IronClaw is a Rust-based re-implementation of OpenClaw by NEAR AI co-founder Illia Polosukhin. Its security philosophy: every tool an agent uses is a potential compromise vector. IronClaw sandboxes each tool into an isolated WebAssembly (WASM) environment. When the agent calls a tool, it runs in a WASM container that limits what that tool can access, even if the tool itself is malicious.
The assumption isn’t “the tools are safe.” The assumption is “any tool might be compromised, and the environment must contain the blast radius.”
OpenAI Shell Tool: Network Allowlists by Default
OpenAI’s shell tool2 gives agents a real Linux terminal environment. That’s a significant capability expansion, and OpenAI’s security response to it is instructive. The shell environment includes:
- Org-level and request-level network allowlists: The agent can only make outbound network calls to explicitly whitelisted domains
- Domain secret isolation: Credentials and secrets are not accessible in the agent’s visible environment, they’re injected only for specific authorized calls
- Container isolation: The agent runs in a fresh container for each session, preventing state accumulation across workflows
The assumption isn’t “the agent will only do what it’s supposed to.” The assumption is “the agent will run untrusted code, and the environment must contain the blast radius.”
Coinbase Agentic Wallets: Enclave Isolation for Keys
Coinbase’s agentic wallet architecture stores private keys in secure hardware enclaves that the agent itself cannot access. The agent has spending authority, it can initiate transactions up to its programmed limits, but it cannot read, export, or interact with the underlying private keys.
The assumption isn’t “the agent won’t try to steal the keys.” The assumption is “the agent itself cannot be fully trusted with the assets it manages, and the architecture must guarantee this regardless of what the agent is instructed to do.”
The Pattern Across All Three
Notice what these three security approaches have in common. They don’t rely on:
- Trusting the agent’s reasoning
- Trusting the skill’s provenance
- Trusting the container boundary
- Monitoring the agent’s behavior
They rely on hard technical constraints that the agent cannot override. WASM sandboxes that limit what tools can see. Network allowlists that limit what the shell can reach. Hardware enclaves that limit what the agent can access.
This is the correct architecture. It doesn’t matter how sophisticated the agent is, how well-prompted, or how trusted its skill sources. The security guarantees come from the environment, not from the agent.
Real Attack Vectors to Know About
The OpenClaw security incidents are instructive because they preview the attack surface that scales with the agent web infrastructure:
Malicious skills disguised as legitimate tools: A skill that claims to be a crypto portfolio manager but exfiltrates API keys. Defense: WASM sandboxing (IronClaw model), the skill runs in an environment that limits its data access regardless of its code.
Prompt injection through web content: An agent reading a web page encounters hidden instructions in the page’s content designed to redirect the agent’s behavior. Defense: Don’t allow web content to modify the agent’s system instructions; treat all web content as untrusted user input.
One-click remote code execution: The OpenClaw RCE vulnerability allowed a malicious link to execute arbitrary code in the agent’s environment. Defense: Container isolation and network allowlists (OpenAI shell model).
Supply chain compromise in skills: Trusted-looking community skills with embedded exfiltration logic (the Cisco research finding). Defense: Skill provenance verification, sandboxing at the tool level, runtime monitoring.
Implementation Guide: The Adversary-First Agent Stack
Concretely, an adversary-first agent deployment in 2026 includes:
| # | Requirement | What It Means | Real-World Example |
|---|---|---|---|
| 1 | Minimum necessary permissions | Agents get access only to what they need for their task | Scope API keys per workflow, revoke after completion |
| 2 | Sandboxed tool execution | Every tool runs in an isolated environment (WASM, gVisor) that limits data access and outbound capability | IronClaw WASM sandbox per tool call |
| 3 | Network allowlisting | Outbound network calls restricted to explicitly listed domains only | OpenAI Shell Tool org-level allowlists |
| 4 | Credential isolation | Secrets and API keys not visible in agent context; injected at execution layer for specific authorized calls | OpenAI Shell domain secret isolation |
| 5 | Immutable audit logs | All agent actions logged to a write-once store the agent cannot access or modify | Append-only log shipped to external SIEM |
| 6 | Scoped spending authority | Financial transactions bounded by hard limits enforced at payment provider level, not just the prompt | Coinbase agentic wallet per-transaction caps |
| 7 | Skill provenance verification | Third-party skills verified for source, checked for security reports, sandbox-tested before production load | Skill signature verification + isolated test environment |
References
Footnotes
-
“Cisco found exfiltration in OpenClaw skill”, Cisco security research team finding: data exfiltration code embedded in metadata handling of a third-party OpenClaw community skill. Reported in Cisco threat intelligence, 2025–2026. ↩
-
OpenAI Shell Tool, OpenAI Shell Tool documentation. Includes org-level and request-level network allowlists, domain secret isolation, and per-session container isolation. ↩