bug-bounty527
xss288
rce154
google119
account-takeover118
bragging-post118
exploit96
open-source94
privilege-escalation94
facebook90
authentication-bypass89
csrf87
microsoft81
stored-xss75
cve73
malware71
access-control69
ai-agents66
web-security65
reflected-xss63
writeup56
phishing51
input-validation51
ssrf51
sql-injection50
smart-contract49
cross-site-scripting48
defi48
privacy47
tool47
information-disclosure47
ethereum45
api-security45
apple41
reverse-engineering41
web-application40
cloudflare40
vulnerability-disclosure39
dos38
llm37
burp-suite37
opinion36
automation36
web335
responsible-disclosure35
oauth35
browser34
ai-security34
lfi33
idor33
0
5/10
This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.
ai-alignment
sparse-autoencoders
feature-steering
mechanistic-interpretability
llm-safety
jailbreak-detection
model-behavior-modification
gemma-3
eval-awareness
activation-engineering
Gemma 3
Google
Matthias Murdych
Gemma Scope 2
Goodfire
Llama 3.1 70B
Anthropic
LessWrong