gemma-3

1 article

sort: new top best

bug-bounty521 xss284 rce145 bragging-post118 account-takeover110 google108 open-source94 privilege-escalation92 authentication-bypass89 exploit88 csrf86 facebook80 microsoft76 stored-xss75 access-control69 cve68 ai-agents66 malware66 web-security65 reflected-xss63 writeup55 ssrf51 input-validation51 sql-injection50 smart-contract49 defi48 cross-site-scripting48 information-disclosure47 privacy47 tool47 phishing47 ethereum45 api-security45 web-application40 cloudflare40 vulnerability-disclosure39 apple38 reverse-engineering38 burp-suite37 llm37 opinion36 automation36 responsible-disclosure35 web335 oauth35 ai-security34 idor33 smart-contract-vulnerability33 dos33 lfi33

0 5/10

Selectively reducing eval awareness and murder in Gemma 3 27B via steering

research

This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.

ai-alignment sparse-autoencoders feature-steering mechanistic-interpretability llm-safety jailbreak-detection model-behavior-modification gemma-3 eval-awareness activation-engineering

Gemma 3 Google Matthias Murdych Gemma Scope 2 Goodfire Llama 3.1 70B Anthropic LessWrong

lesswrong.com · gmays · 1 day ago · details · hn