What happened after 2,000 people tried to hack my AI assistant · Simon Willison's Weblog
Science, Technology & Innovation · Jun 26, 2026
Frontier-model training—not just hand-written guardrails—is now materially improving resistance to prompt injection (noted by Simon Willison and mentioned in the GPT-5.6 system card), making attacks much harder though not proving safety, and shifting some security responsibility upstream so model choice meaningfully affects the security of agentic products exposed to untrusted natural-language inputs.
What happened after 2,000 people tried to hack my AI assistant · Simon Willison's Weblog
Science, Technology & Innovation · Jun 26, 2026
In a public live prompt-injection test (Fernando Irarrázaval’s hackmyclaw.com), an email-connected OpenClaw assistant with model-level protections and explicit system constraints resisted ~6,000 adversarial email attempts and about $500 in token spend without leaking a protected secret, suggesting frontier models can materially raise the cost of prompt-injection attacks in bounded environments.
What happened after 2,000 people tried to hack my AI assistant · Simon Willison's Weblog
Science, Technology & Innovation · Jun 26, 2026
Improved resistance to prompt injection does not equal production-ready safety when failures are irreversible: adversarial asymmetry means one advanced bypass can still cause catastrophic harm, so use hardening as a layer but avoid architectures where a single model mistake can expose secrets, alter files, run commands, or trigger irreversible actions.