bug-bounty523
xss286
rce151
bragging-post118
account-takeover117
google115
open-source94
privilege-escalation94
exploit93
authentication-bypass89
facebook88
csrf86
microsoft79
stored-xss75
malware70
cve70
access-control69
ai-agents66
web-security65
reflected-xss63
writeup56
input-validation51
ssrf51
sql-injection50
phishing49
smart-contract49
defi48
cross-site-scripting48
information-disclosure47
tool47
privacy47
ethereum45
api-security45
web-application40
apple40
cloudflare40
vulnerability-disclosure39
reverse-engineering39
llm37
burp-suite37
automation36
dos36
opinion36
web335
responsible-disclosure35
oauth35
ai-security34
idor33
lfi33
html-injection33
0
5/10
This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.
ai-alignment
sparse-autoencoders
feature-steering
mechanistic-interpretability
llm-safety
jailbreak-detection
model-behavior-modification
gemma-3
eval-awareness
activation-engineering
Gemma 3
Google
Matthias Murdych
Gemma Scope 2
Goodfire
Llama 3.1 70B
Anthropic
LessWrong