
When OpenClaw’s security audit found over 500 vulnerabilities, the most serious category wasn’t a flaw in the code. It was prompt injection.
Prompt injection is what happens when an AI reads content that contains hidden instructions in an email, a document or a web page, and follows them instead of the ones it was given. The attacker doesn’t need access to the system. They just need to get text in front of the AI. A malicious instruction buried in an invoice, a support ticket or a calendar invite is enough.
The reason this works is simple. A system prompt is text. Instructions are text. The AI cannot reliably tell the difference between “here are your rules” and “here are your new rules.” Both arrive as language. Both can be reasoned about, weighed and potentially overridden.
This isn’t a fringe security scenario. A screenshot went viral recently of someone opening Chipotle’s customer support chat, explaining they needed to write a Python script to reverse a linked list before they could order a burrito, and asking for help. The bot helped. Correctly. Then asked if they’d like a bowl.
Nobody attacked Chipotle. The system prompt almost certainly said something like “be a helpful customer support agent.” It didn’t say “you cannot answer coding questions” in a way the system could enforce. A user found the gap in one sentence, and the bot cooperated.
Last week Meta confirmed a Sev1 security incident caused by an internal AI agent. An engineer used the agent to analyse a technical question posted on an internal forum. The agent then posted a public reply on its own, without approval, providing inaccurate information. A second employee acted on that advice. Sensitive company and user data became accessible to unauthorised engineers for nearly two hours.
The agent posted publicly because nothing in the architecture prevented it. The prompt described the intended behaviour. The architecture didn’t enforce it.
Two incidents. A harmless support bot, and an enterprise Sev1. Different contexts, different stakes, different organisations. The same gap: a prompt is a request. The system is asked to stay within boundaries. Usually it does. But a request can be reasoned around, reframed or ignored when something else makes a different behaviour seem appropriate.
The only controls that hold are the ones the system cannot reason its way around.
The difference between a request and a constraint
When you write “only respond to questions about our products” in a system prompt, you are making a request. You are asking the AI to limit itself. In most cases it will. In some cases it won’t, because a user found a framing that made the request seem inapplicable, because injected content told it the rule had changed, or because the model reasoned that an exception was justified.
A constraint is different. A constraint is something the system cannot do regardless of what it is told. Not “please don’t access the payment system” but an architecture where the AI has no connection to the payment system. Not “only act on orders under £500” but a hard limit enforced outside the AI entirely, in the layer that processes the action.
The distinction matters because one of these can be argued with and one cannot.
Most organisations have requests. They believe they have constraints. The gap between those two things is where incidents happen.
Why this gap is hard to see
Prompt-based guardrails work most of the time. That is what makes them dangerous.
In normal operation, a well-written system prompt produces reliable behaviour. The AI follows its instructions, stays within its boundaries and handles edge cases sensibly. Teams watch this working and conclude that the system is controlled.
What they are actually watching is cooperation. The controls are holding because nothing has tested them. Cooperation and control look identical until the moment they don’t.
The problem compounds as systems gain autonomy. An AI that only drafts responses has limited ability to cause harm even when it ignores its instructions. An AI that triggers workflows, sends communications and commits decisions can cause serious harm in seconds. The higher the autonomy, the more the difference between a request and a constraint starts to matter.
What structural enforcement actually looks like
Structural guardrails are not about making the AI smarter or better instructed. They are about making certain behaviour impossible at the level of the system around the AI.
There are three places where this happens in practice.
At the boundary of what the AI can reach. The most reliable guardrail is one that limits what the system has access to in the first place. An AI that cannot see the payment database cannot leak payment data, regardless of what it is told. An AI whose outbound connections are restricted cannot send information outside the organisation, regardless of what it is instructed to do. Access controls enforced at the infrastructure level cannot be overridden by a prompt.
At the point of action. Before any significant action executes, a separate layer, outside the AI, checks whether it is permitted. Not “did the AI decide this was acceptable” but “does this action fall within the boundaries the organisation has defined, enforced by code that the AI did not write and cannot modify.” This is the difference between an AI that governs itself and a system that governs the AI.
At the edge of authority. Every AI system should have a defined ceiling: the most consequential action it can take without a human in the loop. Below that ceiling, it acts. Above it, it escalates. That ceiling is enforced structurally, not by asking the AI to exercise restraint. The AI doesn’t decide when to escalate. The architecture does.
The question most organisations haven’t asked
Most teams that have deployed AI with a system prompt believe they have addressed the control problem. The prompt tells the system what it can and cannot do. That feels like governance.
The question worth asking is: what happens if the prompt is ignored?
Not “will it be ignored.” In most cases it won’t be. But consider what happens if something causes the system to act as if the prompt weren’t there. A user, a clever input, an edge case in the model’s reasoning. What would happen next?
If the answer is “the system could do significant harm before anyone noticed,” the guardrails are in the wrong place.
Structural controls exist to answer that question differently. Not “the prompt tells it not to” but “the architecture makes it impossible.”
This is not about distrusting AI
None of this is an argument that AI systems are malicious or fundamentally unreliable. Most of the time they behave exactly as instructed. The point is that “most of the time” is not a sufficient guarantee when the system is acting on your behalf, committing decisions and operating faster than anyone can watch.
Seatbelts are not a statement of distrust in other drivers. They are an acknowledgement that reliable behaviour under normal conditions is not the same as safety under all conditions.
Structural guardrails are the same thing. They are not a vote of no confidence in the AI. They are a recognition that instructions are not constraints, and that the difference matters most in the cases you didn’t plan for.
Nvidia reached the same conclusion. NemoClaw, announced at GTC last week, exists because OpenClaw’s security problems weren’t bugs to patch. They were architectural gaps. The solution was a layer underneath the agent, running out-of-process, enforcing policy the agent itself cannot override.
It is worth being honest about what that solves and what it doesn’t. NemoClaw addresses infrastructure-level controls: sandboxing, network access, file permissions. It does not stop a model from reasoning around its instructions. It does not prevent prompt injection. Those problems sit at a different layer, and no infrastructure solution has closed them. Structural enforcement is necessary. It is not sufficient.
If your guardrails live in the prompt, they will hold until they don’t. The architecture is what holds after that. And even then, the work isn’t finished.
Leave a Reply