ai-alignment

3 articles

sort: new top best

bug-bounty521 xss284 rce147 bragging-post118 google113 account-takeover110 open-source94 exploit92 authentication-bypass88 privilege-escalation87 csrf86 facebook81 microsoft79 stored-xss75 cve70 malware69 access-control67 web-security65 ai-agents64 reflected-xss63 writeup55 input-validation51 ssrf51 phishing50 smart-contract48 sql-injection48 defi48 cross-site-scripting48 privacy47 tool47 information-disclosure45 api-security44 ethereum44 web-application40 cloudflare40 apple38 reverse-engineering37 vulnerability-disclosure37 llm37 burp-suite36 opinion36 automation36 web336 oauth35 dos35 responsible-disclosure34 html-injection33 browser33 idor33 smart-contract-vulnerability33

0 5/10

Selectively reducing eval awareness and murder in Gemma 3 27B via steering

research

This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.

ai-alignment sparse-autoencoders feature-steering mechanistic-interpretability llm-safety jailbreak-detection model-behavior-modification gemma-3 eval-awareness activation-engineering

Gemma 3 Google Matthias Murdych Gemma Scope 2 Goodfire Llama 3.1 70B Anthropic LessWrong

lesswrong.com · gmays · 1 day ago · details · hn

0 2/10

Institutional AI vs. Individual AI

opinion

The article argues that individual AI productivity gains (10x per person) haven't translated to organizational value because institutions haven't redesigned themselves around the technology—analogous to how textile mills took 30 years to benefit from electrification. It proposes seven pillars of "Institutional AI" including coordination, signal extraction from noise, and bias mitigation through deterministic agentic systems.

ai-governance organizational-risk ai-alignment deterministic-agents ai-slop agentic-systems institutional-intelligence

George Sivulka ChatGPT Claude Grok Matrix

a16z.news · 7777777phil · 2 days ago · details · hn

0 2/10

I'm glad the Anthropic fight is happening now

opinion

A policy and governance analysis arguing that Anthropic's refusal to remove ethical redlines on mass surveillance and autonomous weapons use is a necessary precedent for preventing future government coercion of AI companies, and that widespread AI adoption will structurally enable authoritarian surveillance unless normative guardrails are established now.

ai-policy government-oversight mass-surveillance autonomous-weapons ai-alignment regulatory-pressure supply-chain-security ai-safety government-coercion private-company-rights

Anthropic Department of War Dwarkesh Patel Amazon Google Nvidia Palantir SpaceX Elon Musk

dwarkesh.com · emschwartz · 2 days ago · details · hn