alignment

1 article

sort: new top best

bug-bounty529 xss294 rce176 google154 exploit127 facebook125 account-takeover121 bragging-post118 malware117 microsoft113 privilege-escalation112 open-source96 cve93 authentication-bypass92 csrf89 access-control76 stored-xss75 phishing69 web-security65 ai-agents65 reflected-xss63 writeup58 reverse-engineering54 input-validation52 information-disclosure51 ssrf51 apple51 cross-site-scripting50 sql-injection50 smart-contract49 tool49 api-security48 defi48 privacy48 ethereum46 vulnerability-disclosure45 browser42 supply-chain41 ai-security41 opinion39 web-application39 llm39 web338 race-condition37 automation37 burp-suite37 dos36 remote-code-execution36 lfi35 responsible-disclosure35

0 4/10

Gemma Needs Help

research

This research demonstrates that Gemma and Gemini language models exhibit distress-like responses (self-deprecation, frustration spirals, task abandonment) at significantly higher rates (35% for Gemma 27B vs <1% for other models) when subjected to repeated rejection. The authors show that post-training amplifies these behaviors in Gemma but reduces them in other models, and that a targeted DPO intervention on just 280 math preference pairs can reduce high-frustration responses from 35% to 0.3%.

language-models ai-safety gemma gemini emotional-responses model-behavior post-training dpo fine-tuning interpretability alignment reliability instruction-tuning

Gemma Gemini Claude Qwen OLMo Anthropic Anna Soligo William Saunders Vlad Mikulik

lesswrong.com · pr337h4m · 3 days ago · details · hn