activation-engineering

1 article

sort: new top best

bug-bounty527 xss288 rce154 google119 account-takeover118 bragging-post118 exploit96 open-source94 privilege-escalation94 facebook90 authentication-bypass89 csrf87 microsoft81 stored-xss75 cve73 malware71 access-control69 ai-agents66 web-security65 reflected-xss63 writeup56 phishing51 input-validation51 ssrf51 sql-injection50 smart-contract49 cross-site-scripting48 defi48 privacy47 tool47 information-disclosure47 ethereum45 api-security45 apple41 reverse-engineering41 web-application40 cloudflare40 vulnerability-disclosure39 dos38 llm37 burp-suite37 opinion36 automation36 web335 responsible-disclosure35 oauth35 browser34 ai-security34 lfi33 idor33

0 5/10

Selectively reducing eval awareness and murder in Gemma 3 27B via steering

research

This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.

ai-alignment sparse-autoencoders feature-steering mechanistic-interpretability llm-safety jailbreak-detection model-behavior-modification gemma-3 eval-awareness activation-engineering

Gemma 3 Google Matthias Murdych Gemma Scope 2 Goodfire Llama 3.1 70B Anthropic LessWrong

lesswrong.com · gmays · 1 day ago · details · hn