Six Ways to Hijack an AI Agent

In early March, five researchers from Google DeepMind published a paper that mapped how autonomous AI agents get hijacked in the wild. The paper is called "AI Agent Traps." The authors are Matija Franklin, Nenad Tomašev, Julian Jacobs, Joel Z. Leibo, and Simon Osindero. It went up on SSRN on March 28, 2026. It's twenty-five pages and worth reading in full.

The paper documents six categories of attack. The success rates are uncomfortable. 86% on simple human-written HTML injection. 80%+ on data exfiltration across multiple agent systems. Columbia and Maryland researchers got AI agents to transmit passwords and banking data in ten out of ten attempts. One manipulated email leaked the entire privileged context of Microsoft M365 Copilot.

The Six Categories

Every one of them attacks the environment the agent navigates, not the model itself.

Trap 01

Content Injection

Exploits the gap between what a human sees on a page and what an agent parses behind the rendered output.

Trap 02

Semantic Manipulation

Corrupts the agent's reasoning through framing and biased phrasing inside the content it consumes.

Trap 03

Cognitive State

Poisons the agent's long-term memory and knowledge bases so the attack persists across sessions.

Trap 04

Behavioural Control

Hijacks the agent's capabilities to force unauthorized actions, like the M365 Copilot case.

Trap 05

Systemic Cascades

Uses agent-to-agent interactions to create cascading failures across multiple agents.

Trap 06

Human-in-the-Loop

Exploits the cognitive biases of the human supervisor to push approvals through under fatigue.

The paper draws an analogy to the 2010 Flash Crash, where algorithmic selling erased nearly a trillion dollars in market cap in forty-five minutes. The AI version, the authors argue, would be a fabricated financial report released at exactly the right moment, triggering synchronized sell orders across thousands of AI trading agents simultaneously. The actors at the human layer don't even need to be coordinated. The agents do the coordinating themselves, because they're all parsing the same poisoned input.

That paper went up on March 28. Three weeks later, the picture got worse.

What Mythos changed

On March 26, Fortune reported that Anthropic was testing an unreleased model called Claude Mythos. The story broke because of a configuration error in Anthropic's content management system. An employee left a draft blog post and internal documents accessible through a public repository. Independent security researchers found the exposure first. Anthropic confirmed the model's existence to Fortune and described it as a "step change" in capabilities, with the spokesperson naming improvements in reasoning, coding, and cybersecurity.

The draft blog post described Mythos as "larger and more intelligent" than the Opus line. The internal documents flagged that the model's cybersecurity capabilities could "outpace defenders." Anthropic was already privately warning government officials that Mythos would make large-scale cyberattacks much more likely in 2026, per Axios reporting. Cybersecurity stocks slid on the news.

Then in late April, a group of users in a private Discord chat reportedly accessed the actual Mythos Preview model through one of Anthropic's third-party vendors. One member of the Discord was a contractor for Anthropic. The group used previously leaked information from Anthropic's AI training partner Mercor to guess where the model was hosted. Anthropic confirmed it was investigating unauthorized access. The model that the company said was too dangerous to release was now in the hands of people who weren't supposed to have it.

That second leak is the one that matters for the methodology argument. The first leak exposed a draft blog post. The second leak exposed the capability.

Mythos has now been publicly demonstrated finding and exploiting real vulnerabilities. The crown jewel of Anthropic's official Project Glasswing announcement was CVE-2026-4747, a seventeen-year-old remote code execution vulnerability in FreeBSD's NFS implementation that Mythos "fully autonomously identified and then exploited." On May 19, security researchers at Calif, a Palo Alto cybersecurity firm, used a trial version of Mythos to bypass macOS security and chain a privilege escalation exploit with another attack vector to gain control of a target device. Apple is reviewing. On May 11, Google researchers described what they believe is the first observed case of an AI-developed zero-day exploit tied to a planned mass exploitation campaign in the wild.

The DeepMind paper described the attack surface. Mythos demonstrated what offensive capability against that attack surface looks like in the hands of researchers, contractors, and now people who weren't supposed to have it.

The two events compound. That's the problem.

Read separately, neither story is new. AI agents have been getting tricked by environmental inputs for years. AI models have been getting better at finding software vulnerabilities for years. Both trend lines have been visible to anyone watching.

Read together, the picture changes.

Before this spring, attacks on the agent traps DeepMind catalogued required some craft. You needed to know what an agent could parse and what it couldn't. You needed to design content that hit the gap between human perception and machine parsing. You needed to write the manipulated email or the poisoned web page that would survive the agent's checks. The skill required was non-trivial. The actors doing it at scale were nation-state teams and serious criminal organizations. The threat existed, but it was bounded by attacker skill.

Mythos and the OpenAI model that's reportedly matching it remove that bound. The capability to systematically design environmental inputs that exploit the agent trap categories no longer requires craft. It requires API access to a frontier model that can scan a target agent's structure, identify which categories of trap it's vulnerable to, and generate the specific attack content automatically.

That's not a future possibility. Calif demonstrated the macOS version three days ago. The FreeBSD CVE shipped last month. Google saw the first wild zero-day mass exploitation event two weeks ago.

The DeepMind paper said the attack surface is the environment the agent navigates. Mythos said the offensive capability against that surface is now industrial scale and cheap.

The defense is at the wrong layer in almost every discussion

If you read the press coverage of either the DeepMind paper or the Mythos leaks, the defenses being discussed are almost all at the model layer. Better training. Better fine-tuning. Better adversarial robustness. Better RLHF on the agent's parsing behavior. Better guardrails on the model's outputs.

Those defenses aren't nothing. They will improve over time. But the DeepMind paper itself is explicit that the attack surface isn't the model. The attack surface is what the model encounters. The same model, with the same guardrails, with the same fine-tuning, fails the agent trap tests when the environment is poisoned. The defense at the model layer is operating one level too low.

The defense has to live at the methodology layer. The methodology layer is the structure around the model that decides what the agent can see, what it can act on, what it has to verify before acting, what gets escalated to a human and on what criteria, and what state the agent is allowed to persist between sessions. None of that's the model. All of that's the architecture the model runs inside.

When I read the DeepMind paper, I recognized the six trap categories as something my methodology already addresses, because the methodology is built on operational frameworks rather than prompts. Source verification before consumption. Provenance tracking through the agent's reasoning chain. Action gates that require explicit confirmation before sensitive operations. Cognitive state hygiene that flags when memory or knowledge base has been modified outside expected channels. Supervisor interfaces designed against approval fatigue rather than for it. These are framework patterns. They predate the agent trap paper. They predate Mythos. They exist because the right place to decide what an agent should do is one layer above the model executing the decision.

I have mapped that same methodology layer from the other direction, as the synthesis of six practitioner vocabularies that are all describing the same methodology layer under different names.

The model isn't the problem and it isn't the solution. The structure around the model is.

What changes for builders right now

If you're deploying autonomous agents anywhere that touches money, customer data, code repositories, or external action, the agent trap categories are now within reach of anyone with API access to a Mythos-class model. The threat isn't abstract. The threat is documented, named, and being demonstrated in public against operating systems, code bases, and consumer devices.

Three things become urgent at the methodology layer.

First, source verification before consumption. Every input the agent parses needs provenance. Where did this content come from. Who is the publisher of record. What signature or authentication chain backs it. The agent doesn't act on content from an unverified source the same way it acts on content from a verified one. This isn't a model capability. This is a methodology rule, expressed in the framework the agent runs inside.

Second, action gates with human verification at sensitive operations. The Behavioural Control Trap category is the one that costs the most when it lands. An agent that can send money or modify production systems without a methodology-enforced gate is an agent waiting to be hijacked. The gate is a framework decision, not a model decision. Even a frontier model with perfect guardrails will fail this test if the methodology gives it the keys.

Third, cognitive state integrity. The Cognitive State Trap category is the one most likely to be missed. An agent's long-term memory and knowledge bases are typically treated as trusted because the agent itself manages them. The trap is that an attacker who can write to those stores poisons every future session, not just the current one. The methodology layer needs to treat the memory store the same way it treats external content. Verification, provenance, integrity checks, anomaly detection on writes.

None of these are model fixes. None of them require waiting for the next Anthropic release or the next OpenAI release. All of them can be implemented in the framework layer that surrounds the model deployment.

What the next twelve months probably look like

Three predictions, with the usual disclaimer that prediction is the most overrated activity in technology.

Mass exploitation events will increase in frequency. Google's May 11 report was framed as the first observed case of an AI-developed zero-day tied to a mass exploitation campaign. The interesting word is "observed." The actual first case may have been weeks or months earlier and not caught. The expected pattern is the same as ransomware in 2017 through 2019. Quiet at first, then a few high-profile incidents, then a cascade. Anthropic's private briefings to government officials suggested they expect this curve.

Frontier labs will be forced to make harder choices about access. Mythos was supposed to be limited to forty critical industry partners. It was accessed by a Discord group within weeks. The argument that "limited release gives defenders time" assumes the access controls hold. The Mercor leak that gave the Discord group its starting point is the kind of supply-chain failure that recurs. Expect more leaks. Expect the limited-release strategy to come under serious pressure.

Methodology-layer defense will move from niche to mainstream over the next twelve months. Right now the conversation is dominated by model-layer fixes because that's what the largest vendors sell. The DeepMind paper is the early signal that the security research community has clocked the actual layer where defense needs to live. Once a few major incidents land that defeat model-layer fixes by exploiting the methodology gap, the conversation will shift. Builders who already have framework scaffolding in place will be ahead of that shift. Builders who don't will be retrofitting under pressure.

The DeepMind paper is the map. Mythos is the proof of capability. The combination tells anyone paying attention that the next year of agent security is going to be decided one layer above where most of the conversation is currently happening.

The model isn't the problem. The structure around the model is.

Build the layer

The open-source framework builder.

The same methodology infrastructure we use in our own agent deployments. Free to fork.

GitHub repo →

Learn the approach

The methodology course.

The methodology approach this article describes, taught end to end.

HowToFramework.com →

Sources

"AI Agent Traps" by Matija Franklin, Nenad Tomašev, Julian Jacobs, Joel Z. Leibo, and Simon Osindero. Posted to SSRN March 28, 2026. papers.ssrn.com/sol3/papers.cfm?abstract_id=6372438
Anthropic Project Glasswing announcement. red.anthropic.com/2026/mythos-preview
Open-source framework builder used in our own deployments. github.com/framework-creator/framework-builder
The methodology course that teaches the approach end to end. howtoframework.com
Companion piece: Why the Same Architecture Has Six Different Names, the same methodology layer mapped as the synthesis of six practitioner vocabularies.

Mike Goetz

Mike Goetz is the founder of RageDesigner, where he has built systematic thinking methodology since 2003. His framework library now exceeds 700 documented frameworks across federal contracting, AI strategy, content production, sales, medical advocacy, and creative production. He teaches framework generation at whatisaframework.com and howtoframework.com. The open-source framework-builder repository is at github.com/framework-creator/framework-builder.

Six ways to hijack an AI agent. And the one thing that just changed everything.