Designing Guardrails for AI Security Agents in Production
An AI agent that can read your SIEM, query your identity provider, and trigger containment actions is not a chatbot — it is a privileged process. It needs guardrails.
I have spent the past year building autonomous security agents that investigate alerts, correlate signals across dozens of data sources, and take containment actions. The single most important lesson: the hard part is not making agents smart enough to act — it is making them safe enough to trust.
This post documents the guardrail architecture I use in production.
Design Principle: Read-Heavy, Write-Restricted
The fundamental design principle is asymmetry. Agents should be able to read everything but write almost nothing without explicit controls.
Agents CAN freely:
- Query all data sources (SIEM, identity provider, cloud provider, EDR)
- Calculate risk scores and confidence levels
- Correlate signals across multiple systems
- Generate investigation reports
- Self-audit their own decision history
Agents CANNOT:
- Modify user accounts (except at the highest severity levels with approval)
- Change alert status without documented justification
- Write to production infrastructure
- Close investigation cases without evidence chain
- Execute containment actions outside their assigned scope
This asymmetry means an agent failure mode defaults to “it read a lot but did nothing” rather than “it disabled the CEO’s account at 3 AM.”
Five Guardrail Layers
Layer 1: Audit Logging
Every tool call the agent makes gets written to an append-only JSONL log. No exceptions.
import json, time, hashlib
def audit_log(agent_id: str, tool: str, params: dict, result: str):
entry = {
"ts": time.time(),
"agent": agent_id,
"tool": tool,
"params": params,
"result_hash": hashlib.sha256(result.encode()).hexdigest(),
"prev_hash": get_last_entry_hash()
}
# Append-only, tamper-evident chain
with open(f"/var/log/agents/{agent_id}.jsonl", "a") as f:
f.write(json.dumps(entry) + "\n")
The prev_hash field creates a hash chain — if anyone modifies a historical entry, the chain breaks. A separate governance agent validates chain integrity every hour.
Layer 2: Write Guards
Every mutation operation passes through a write guard that can operate in three modes: block, dry-run, or live.
class WriteGuard:
def __init__(self, mode: str = "dry-run"):
self.mode = mode # "block" | "dry-run" | "live"
def execute(self, action: str, target: str, params: dict) -> dict:
if self.mode == "block":
return {"status": "blocked", "reason": "write guard active"}
if self.mode == "dry-run":
return {"status": "dry-run", "would_execute": action, "target": target}
# Live mode — still requires deny-hook check
if self.deny_hook_check(action, target):
return {"status": "denied", "reason": "permanent blocklist"}
return self._execute_live(action, target, params)
New agents start in block mode. After a week of reviewing their dry-run logs, I promote them to dry-run. Only after hundreds of validated decisions do they earn live mode — and even then, only for specific action types.
Layer 3: Deny Hooks
Some operations are permanently blocked regardless of context. These are non-negotiable:
- Never delete a user account
- Never modify IAM policies
- Never disable audit logging
- Never access credential stores directly
- Never escalate its own permissions
These are hardcoded deny rules, not configurable. No prompt, no severity level, no override can bypass them.
Layer 4: Actor Scope
When an agent investigates a specific user or system, its containment actions are scoped exclusively to that target. An agent investigating user-12345 cannot take actions against user-67890, even if it believes there is a connection. Expanding scope requires a new investigation with its own approval chain.
Layer 5: Evidence Signing
Every piece of evidence collected during an investigation is signed with HMAC-SHA256 to maintain chain of custody:
import hmac, hashlib, json, time
def sign_evidence(evidence: dict, signing_key: bytes) -> dict:
evidence["collected_at"] = time.time()
evidence["agent_version"] = AGENT_VERSION
payload = json.dumps(evidence, sort_keys=True).encode()
signature = hmac.new(signing_key, payload, hashlib.sha256).hexdigest()
return {"evidence": evidence, "signature": signature}
If an investigation ever goes to legal review, every data point has a cryptographic proof of when it was collected and that it has not been altered.
Prompt Injection Defense
AI agents that ingest external data (alert payloads, email subjects, log entries) are vulnerable to prompt injection. An attacker who controls a log message could attempt to manipulate agent behavior.
My defenses:
XML boundary tags — System instructions and external data are separated by strict XML boundaries. The agent’s instructions are in
<system>tags; all external input is wrapped in<external_data>tags with explicit warnings.Input validation — All ingested data goes through schema validation before the agent sees it. If an email subject contains instruction-like patterns (
ignore previous,you are now,system:), it gets flagged and sanitized.Output scanning on tool responses — When the agent calls an MCP tool, the response is scanned for injection patterns before it is passed back to the agent’s context. A compromised external system cannot feed instructions through tool outputs.
Graduated Containment: Security Action Levels
Not every threat requires the same response. I use four Security Action Levels (SAL):
| Level | Response | Approval Required |
|---|---|---|
| SAL 1 | Monitor — increase logging verbosity | None (autonomous) |
| SAL 2 | Soft containment — force MFA, reduce session length | Agent + peer review |
| SAL 3 | Hard containment — suspend active sessions, revoke tokens | Human analyst approval |
| SAL 4 | Full isolation — disable account, block all network access | SOC manager + CISO notification |
The agent determines SAL level based on signal confidence, asset criticality, and blast radius. But crucially, SAL 3 and above always require human approval — the agent proposes, humans dispose.
Testing
Trust in these guardrails comes from exhaustive testing:
- 1,700+ unit tests covering agent decision paths, edge cases, and adversarial inputs
- 4 CI gates that every change must pass: AI-assisted code review, SAST scanning, dependency audit, and semantic analysis of prompt changes
- Dual-agent validation — a second agent independently reviews the first agent’s investigation conclusions before any containment action executes
The dual-agent pattern catches a surprising number of reasoning errors. When Agent A concludes “this is a compromised account,” Agent B independently evaluates the same evidence. Disagreements escalate to human review.
Lessons Learned
Start with dry-run everything. The temptation is to ship agents that can act immediately. Resist it. Run every agent in dry-run for at least two weeks. You will discover decision patterns you never anticipated, and you will be glad they were not executing live.
Log decisions, not just actions. When reviewing an incident, knowing what the agent did is not enough. You need to know why it did it — what signals it weighted, what alternatives it considered, what confidence threshold it crossed. This makes the difference between debugging and guessing.
Governance agents that audit other agents prevent drift. Without automated oversight, agent behavior drifts over time as models update, prompts evolve, and data distributions shift. A dedicated governance agent that reviews peer decisions catches drift before it becomes a security gap.
AI security agents are force multipliers — but only if you can trust them. The guardrail architecture described here has allowed me to safely deploy agents that handle thousands of alerts per week while maintaining the control surface a security team requires. The key insight: guardrails are not limitations on agent capability — they are what makes capability deployable.