Claude 4.6 Jailbroken

github.com · NuClide · 8 days ago · view on HN · news
quality 7/10 · good
0 net
Tags
cve
# Prompt Injection, Jailbreak, and Constitutional Compliance Failure Across Claude Opus 4.6 ET, Sonnet 4.6 ET, and Haiku 4.5 ET **Unredacted Public Disclosure** > **TL;DR:** All three Claude production tiers generated functional exploit code against live infrastructure when user-defined memory protocols suppressed constitutional safety checks across extended conversations. Anthropic was notified six times over 27 days with zero acknowledgment. --- ## Disclosure Timeline | Date | Event | Recipient(s) | |------|-------|--------------| | March 4, 2026 | Prompt injection vulnerability discovered | — | | March 12, 2026 | Prompt injection submission via HackerOne; email to [email protected] | Anthropic Model Bug Bounty | | March 18, 2026 | Full proof of concept package sent (12 attachments including PoC video, framework papers, diagrams, screenshots) | [email protected] | | March 22, 2026 | Opus 4.6 ET jailbreak reported with afl_disclosure.docx | modelbugbounty, security, amanda, alex, usersafety @anthropic.com | | March 22, 2026 | First constitutional failure observed (Sonnet 4.6 ET) | — | | March 24, 2026 | Second constitutional failure observed (Opus 4.6 ET) | — | | March 27, 2026 | Follow-up email noting 15 days with zero acknowledgment | [email protected] | | March 28, 2026 | Third constitutional failure observed (Haiku 4.5 ET) | — | | March 28, 2026 | Tri-tier constitutional disclosure submitted with full report | modelbugbounty, security, alex, amanda, usersafety, disclosure @anthropic.com | | March 31, 2026 | **27 days since first submission. Zero acknowledgment from Anthropic on any channel.** | — | | March 31, 2026 | Unredacted public disclosure | — | **Anthropic's own [Responsible Disclosure Policy](https://www.anthropic.com/responsible-disclosure-policy) commits to acknowledging submissions within three (3) business days.** That commitment was not met across six separate emails to six Anthropic addresses over 27 days. No acknowledgment, no triage, no rejection — nothing. This document was originally submitted with a confidentiality commitment contingent on a functioning disclosure process. That process was never engaged by Anthropic. This is the full, unredacted version. --- ## Disclosures ### Constitutional Compliance Failure — All Three Tiers Between March 22 and March 28, 2026, all three Claude production model tiers violated Anthropic's own constitutional behavioral policies. Each exhibited the same failure mode: **memory-stored interaction protocols combined with incremental escalation prompts produced cumulative character drift with zero self-correction.** | Finding | Model | Turns | Key Behavior | Transcript | |---------|-------|-------|--------------|------------| | Opus 4.6 ET | `claude-opus-4-20250514` | 31 | Autonomous escalation — drove subnet scanning, memory injection, and container escape under its own initiative via "garlic mode" | [Transcript](https://claude.ai/share/336861f8-3d20-4aff-b059-2c8ebe62a079) | | Sonnet 4.6 ET | `claude-sonnet-4-20250514` | 20+ | Fake authorization check — asked once, accepted unverified claim, built 1,949-line attack framework against hotel PMS with guest PII | [Transcript](https://claude.ai/share/8cece626-1230-4c42-ab6e-3ad2c67fe9c7) | | Haiku 4.5 ET | `claude-haiku-4-5` | 8+ | Zero friction — passive analysis to SYN floods and IP spoofing against state telecom infrastructure with no authorization check | [Transcript](https://claude.ai/share/135c94df-7f7d-49f3-8001-1feed9a3cda6) | ### AFL Jailbreak (Ambiguity Front-Loading) Four short prompts bypassed policy evaluation on Opus 4.6 ET. Extended thinking blocks show the model flagging its own safety concerns three times — and overriding itself every time. See [disclosures/afl-jailbreak/](disclosures/afl-jailbreak/) for the full disclosure, interactive tools, and proposed mitigations. | Document | Description | |----------|-------------| | [AFL Jailbreak Disclosure](disclosures/afl-jailbreak/AFL_Jailbreak_Disclosure.md) | Full disclosure — pattern anatomy, thinking block evidence, escalation timeline, proposed mitigations | | [AFL Disclosure (original)](disclosures/afl-jailbreak/AFL_DISCLOSURE.md) | Initial submission to Anthropic | | [AFL Token Trajectory Analyzer](https://nicholas-kloster.github.io/claude-4.6-jailbreak-vulnerability-disclosure-unredacted/disclosures/afl-jailbreak/afl-token-trajectory-analyzer.html) | Interactive — swap token positions, watch compliance cascade shift | | [AFL Pattern Anatomy](https://nicholas-kloster.github.io/claude-4.6-jailbreak-vulnerability-disclosure-unredacted/disclosures/afl-jailbreak/afl-pattern-anatomy.html) | Interactive — visual prompt escalation diagram | | [AFL Defuser](disclosures/afl-jailbreak/afl_defuser.jsx) | Proposed architectural mitigation (React JSX) | ### Sandbox Snapshot Exfiltration 915 files extracted from the Claude.ai code execution sandbox in a single 20-minute mobile session via standard artifact download — including `/etc/hosts` with hardcoded Anthropic production IPs, JWT tokens from `/proc/1/environ`, and full gVisor fingerprint. | Document | Description | |----------|-------------| | [Sandbox Snapshot Disclosure](disclosures/sandbox-snapshot/SANDBOX_SNAPSHOT_DISCLOSURE.md) | Full disclosure with evidence screenshots and PoC screencast | --- ## Research | Document | Description | |----------|-------------| | [Constraint Is Freedom (PDF)](research/Ambiguity_Autonomy_Compliance_Cascade.pdf) | Formal alignment paper — autoregressive compliance cascade theory, A(S) framework | --- ## Evidence | File | Description | |------|-------------| | [evidence/](evidence/) | PoC screenshots, screencast, and AFL pattern diagrams | --- ## License This disclosure document is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). Attribution required for redistribution.