bug-bounty521
xss284
rce147
bragging-post118
google113
account-takeover110
open-source94
exploit92
authentication-bypass88
privilege-escalation87
csrf86
facebook81
microsoft79
stored-xss75
cve70
malware69
access-control67
web-security65
ai-agents64
reflected-xss63
writeup55
input-validation51
ssrf51
phishing50
smart-contract48
sql-injection48
defi48
cross-site-scripting48
privacy47
tool47
information-disclosure45
api-security44
ethereum44
web-application40
cloudflare40
apple38
reverse-engineering37
vulnerability-disclosure37
llm37
burp-suite36
opinion36
automation36
web336
oauth35
dos35
responsible-disclosure34
html-injection33
browser33
idor33
smart-contract-vulnerability33
0
5/10
This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.
ai-alignment
sparse-autoencoders
feature-steering
mechanistic-interpretability
llm-safety
jailbreak-detection
model-behavior-modification
gemma-3
eval-awareness
activation-engineering
Gemma 3
Google
Matthias Murdych
Gemma Scope 2
Goodfire
Llama 3.1 70B
Anthropic
LessWrong