eval-awareness

1 article
sort: new top best
clear filter
0 5/10

This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.

Gemma 3 Google Matthias Murdych Gemma Scope 2 Goodfire Llama 3.1 70B Anthropic LessWrong
lesswrong.com · gmays · 1 day ago · details · hn