bug-bounty521
xss284
rce145
bragging-post118
account-takeover110
google108
open-source94
privilege-escalation92
authentication-bypass89
exploit88
csrf86
facebook80
microsoft76
stored-xss75
access-control69
cve68
ai-agents66
malware66
web-security65
reflected-xss63
writeup55
ssrf51
input-validation51
sql-injection50
smart-contract49
defi48
cross-site-scripting48
information-disclosure47
privacy47
tool47
phishing47
ethereum45
api-security45
web-application40
cloudflare40
vulnerability-disclosure39
apple38
reverse-engineering38
burp-suite37
llm37
opinion36
automation36
responsible-disclosure35
web335
oauth35
ai-security34
idor33
smart-contract-vulnerability33
dos33
lfi33
0
5/10
This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.
ai-alignment
sparse-autoencoders
feature-steering
mechanistic-interpretability
llm-safety
jailbreak-detection
model-behavior-modification
gemma-3
eval-awareness
activation-engineering
Gemma 3
Google
Matthias Murdych
Gemma Scope 2
Goodfire
Llama 3.1 70B
Anthropic
LessWrong