Back to feed

What happened after 2,000 people tried to hack my AI assistant

Simon Willison's Weblog

Jun 26, 2026

6/26/2026

Frontier-Model Training Improves Resistance to Prompt Injection and Shifts Security Differentiation to Upstream Model Capabilities

What happened after 2,000 people tried to hack my AI assistant · Simon Willison's Weblog

Science, Technology & Innovation · Jun 26, 2026

Frontier-model training—not just hand-written guardrails—is now materially improving resistance to prompt injection (noted by Simon Willison and mentioned in the GPT-5.6 system card), making attacks much harder though not proving safety, and shifting some security responsibility upstream so model choice meaningfully affects the security of agentic products exposed to untrusted natural-language inputs.


6/26/2026

Frontier Models Demonstrate Resilience Against Prompt Injection In A Bounded Email Environment

What happened after 2,000 people tried to hack my AI assistant · Simon Willison's Weblog

Science, Technology & Innovation · Jun 26, 2026

In a public live prompt-injection test (Fernando Irarrázaval’s hackmyclaw.com), an email-connected OpenClaw assistant with model-level protections and explicit system constraints resisted ~6,000 adversarial email attempts and about $500 in token spend without leaking a protected secret, suggesting frontier models can materially raise the cost of prompt-injection attacks in bounded environments.


6/26/2026

Improved Injection Resistance Is Not A Production Safety Guarantee In The Face Of Irreversible Risks

What happened after 2,000 people tried to hack my AI assistant · Simon Willison's Weblog

Science, Technology & Innovation · Jun 26, 2026

Improved resistance to prompt injection does not equal production-ready safety when failures are irreversible: adversarial asymmetry means one advanced bypass can still cause catastrophic harm, so use hardening as a layer but avoid architectures where a single model mistake can expose secrets, alter files, run commands, or trigger irreversible actions.