Fable 5 proves prompt injections aren't fixable

Frontier AI models ship with guardrails. People break them constantly. It’s tempting to read each bypass as a patch someone forgot to apply. It isn’t. Prompt injections and jailbreaks aren’t a solvable problem. It’s why the labs building these models design around containment, not prevention.

Fable 5 is the clearest example yet. It’s one of Anthropic’s new frontier models, as capable as their top-tier Mythos model but wrapped in guardrails meant to stop it being used for cyberattacks and bioweapons research. It lasted three days in the open.

The technique that got past those guardrails was, by Anthropic’s own account, nothing exotic: ask the model to read a codebase and fix the security flaws it finds. The US government treated that as a vulnerability Anthropic could patch and suspended access. There was nothing to patch.

A jailbreak is when someone manipulates a model into ignoring its own rules. A prompt injection is when hostile instructions are hidden inside content the model reads - a webpage, a document, a codebase - and the model follows them. Same weakness underneath: a model can’t reliably tell a legitimate instruction from a malicious one.

In Anthropic’s Defence

Anthropic didn’t cut corners. They layered several defences and put real effort into breaking them. At the core is a classifier - a separate AI system, outside the main model, that scans each request for anything fishing for help with cyber, biology, or chemistry. When it flags something, the request gets rerouted to a weaker, safer model. The harm a bad actor can pull out drops.

They red-teamed it hard. Internal testing, outside organisations, a bug-bounty programme that turned up no universal jailbreak in over 1,000 hours, though the UK’s AI Safety Institute made early progress towards one. A universal jailbreak is one reusable method that switches the safeguards off across the board. A non-universal one is a narrow crack that works in a single context, or has to be rebuilt each time.

Then comes the part that matters. Anthropic expects non-universal jailbreaks to keep happening, and plans to monitor for them and shut down universal ones before they spread. A team that tested this hard still built the system around the assumption it would be partially broken. OpenAI has said much the same: prompt injection stays possible for a long time unless the underlying architecture changes.

Why The Guards Leak

This isn’t only hard in practice. There’s research showing why. A classifier guarding a model is itself a model, drawing a line between “safe” and “harmful” across a near-infinite space of inputs. Adversarial machine learning has shown for years that those lines leak. You can almost always find an input that lands on the wrong side of the boundary.

The attacks also travel. A 2017 study found that a perturbed input built to fool one model often fools others too, because different models settle on surprisingly similar decision boundaries. So a jailbreak found against one model tends to work on the next. Kill one bypass, the class survives. The guard and the model it protects share the same blind spots.

Assume The Bypass

If the labs building these models can’t make prompt-level defences airtight, the teams deploying them on top won’t either. The model will get talked into things. The only question that matters is what it can reach when it does.

Stop trying to secure the prompt. Start containing the agent.

Share this article