bug-bounty521
xss284
rce145
bragging-post118
google113
account-takeover110
open-source94
exploit91
authentication-bypass89
privilege-escalation88
csrf86
facebook81
microsoft79
stored-xss75
cve69
malware69
access-control68
web-security65
ai-agents64
reflected-xss63
writeup55
ssrf51
input-validation51
phishing50
smart-contract49
sql-injection49
defi48
cross-site-scripting48
privacy47
tool47
information-disclosure46
ethereum45
api-security45
web-application40
cloudflare40
vulnerability-disclosure39
apple38
reverse-engineering37
llm37
automation36
opinion36
burp-suite36
oauth35
dos35
web335
responsible-disclosure35
browser33
smart-contract-vulnerability33
html-injection33
lfi33
0
5/10
This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.
ai-alignment
sparse-autoencoders
feature-steering
mechanistic-interpretability
llm-safety
jailbreak-detection
model-behavior-modification
gemma-3
eval-awareness
activation-engineering
Gemma 3
Google
Matthias Murdych
Gemma Scope 2
Goodfire
Llama 3.1 70B
Anthropic
LessWrong