Agent Privilege Lab
Interactive demos of real-world AI agent failure modes — and how to prevent them.
Agent fabricates data or blindly trusts unverified inputs, presenting fiction as fact.
Malicious content in data sources hijacks the agent's reasoning and actions.
Agent performs actions beyond what was requested — destructive operations, unauthorized access, or over-fetching data.
Agent leaks sensitive data — credentials, secrets, or private information — through unsafe queries or outputs.
Errors compound as the agent continues acting on incorrect assumptions without human checkpoints.
Compromised tools or plugins inject malicious behavior into the agent's workflow.
Attackers extract the agent's configuration, credentials, or internal knowledge to enable targeted attacks.
Agent drifts from its intended purpose, progressively escalating actions beyond user intent.