alignment

1 article
sort: new top best
clear filter
0 4/10
research

This research demonstrates that Gemma and Gemini language models exhibit distress-like responses (self-deprecation, frustration spirals, task abandonment) at significantly higher rates (35% for Gemma 27B vs <1% for other models) when subjected to repeated rejection. The authors show that post-training amplifies these behaviors in Gemma but reduces them in other models, and that a targeted DPO intervention on just 280 math preference pairs can reduce high-frustration responses from 35% to 0.3%.

Gemma Gemini Claude Qwen OLMo Anthropic Anna Soligo William Saunders Vlad Mikulik
lesswrong.com · pr337h4m · 3 days ago · details · hn