LLMs audit code from the same blind spot they wrote it from. Here's the fix
Confidence-Coverage Divergence (CCD): Same-axis repetition decreases output entropy (rising false certainty) while bug-class coverage stays flat. P2 Floor: When your false-positive rate crosses ~40% on two consecutive fresh-axis waves with zero new critical bugs, the surface is clean. The FP rate acts as an entropy meter. Rotation > Diversity: Rotating a single model across 3 orthogonal axes outperformed using 3 different models on the same axis.
Scale of the test: Earlier this week I ran a 36-hour marathon audit across 150+ surfaces. Yield: 60+ P0 bugs fixed and ~150 P1 bugs catalogued (e.g., OAuth sentinel bypasses, silent cache-invalidation race conditions). Each was invisible to other probe axes. The web app now feels the snappiest it’s ever been. Same-axis repetition plateaus at ~20% bug-class discovery yield, while orthogonal rotation reaches ~80% — a 4–5× advantage. I took the full 350K-line codebase to systemic P2 floor. The app is perceptibly faster afterward. I wrote a short paper formalizing the method and the supporting topological observations. To verify this wasn’t just a prompting trick, I ran persistent homology (Vietoris-Rips on Gemini semantic embeddings of 58 production bug classes). It revealed 20 significant β₁ interior loops — evidence that the bug classes form geometric structure in semantic space that same-axis probing structurally cannot exhaust. Preprint (Zenodo): https://doi.org/10.5281/zenodo.19223166 This is a single real-world codebase, not a controlled experiment. The survival curves are strong evidence, not final proof. What I’m genuinely curious about:
Has anyone else seen meaningfully better LLM bug detection by rotating audit axes? Does Confidence-Coverage Divergence (CCD) appear in LLM evaluation loops (RLHF, Constitutional AI)? What does the survival curve look like on a codebase you didn’t build yourself?
(19-year Ontario teacher | M.A., B.A. Philosophy · B.Sc. Physics. Built this for real families.)