This work uses sparse autoencoders and activation steering on Gemma 3 27B to selectively modify model behavior by identifying and manipulating internal features corresponding to evaluation awareness and harmful intent. The research demonstrates that evaluation awareness features reliably detect scenario contrivedness and can be steered to produce more honest outputs, though steering for reducing murder intent causes response breakdown in smaller models.
The article argues that individual AI productivity gains (10x per person) haven't translated to organizational value because institutions haven't redesigned themselves around the technology—analogous to how textile mills took 30 years to benefit from electrification. It proposes seven pillars of "Institutional AI" including coordination, signal extraction from noise, and bias mitigation through deterministic agentic systems.
A policy and governance analysis arguing that Anthropic's refusal to remove ethical redlines on mass surveillance and autonomous weapons use is a necessary precedent for preventing future government coercion of AI companies, and that widespread AI adoption will structurally enable authoritarian surveillance unless normative guardrails are established now.