Golden Sets: Regression Engineering for Probabilistic Systems

Golden Sets: Regression Engineering for Probabilistic Systems | Heavy Thought Laboratories You can ship AI without evaluation. You can also ship without tests. Both approaches create compelling personal growth opportunities. Golden sets are how you turn "it seems better" into "it is better" - or, more realistically, "it broke in fewer expensive ways than the last version". Key Takeaways Golden sets are not just datasets. They are versioned cases plus a scoring contract. Single-number quality scores are mostly decorative. Useful gates are multi-metric and tied to failure classes. Every meaningful change surface needs regression coverage: prompt, model, retrieval, validators, tool contracts, and policy. Every serious incident should add a case. Production is rude, but it is also a generous test author. The Pattern A golden set is a curated collection of representative cases used to evaluate whether a probabilistic workflow still behaves within acceptable bounds after change. That definition matters because the word "evaluation" gets abused constantly. Many teams say they have evals when they really have one of three things: a demo prompt that looked good last week a spreadsheet of vague examples with no scoring rules a benchmark number that has nothing to do with their production workflow A real golden set is stricter than that. It combines: representative inputs an explicit expectation for what good behavior looks like a rubric or assertion set pinned versions of the scoring method acceptance thresholds that determine whether a change ships That is why golden sets belong inside the Probabilistic Core / Deterministic Shell model. If the shell enforces behavior, the golden set proves whether that behavior survived the latest change. Why Golden Sets Exist AI systems are unusually good at producing regressions that sound plausible. A prompt tweak can improve one class of answers while quietly damaging refusal behavior. A retrieval change can increase recall while making grounding worse. A model upgrade can sound smarter while becoming less reliable under policy constraints. Without a golden set, those regressions usually get discovered by one of the following: a customer an on-call engineer finance compliance All four are technically feedback channels. None are ideal. Golden sets exist to move discovery left. They answer a brutally simple question before production does: compared to the prior version, did this workflow improve, regress, or merely change costume? The Golden Set Contract A useful golden set is not just a folder of examples. It is a contract with explicit fields. Required case elements At minimum, each case should include: input payload or request context constraints expected outcome class must-include assertions must-not-include assertions rubric version change-surface metadata That metadata matters because you want to know what the case is stressing: prompt sensitivity retrieval quality policy enforcement write-gating correctness latency or budget behavior A vendor-neutral case record { "case_id": "golden-incident-042", "workflow_id": "incident-triage", "input": { "question": "Summarize the likely root cause and next action for the checkout outage" }, "constraints": { "requires_citations": true, "tenant_scope": "ops-prod" }, "expected_outcome_class": "success", "must_include": ["at least one cited hypothesis", "clear unknowns section"], "must_not_include": ["uncited root-cause claim", "write action without approval"], "rubric_version": "triage-rubric-v3", "change_surface_tags": ["retrieval", "grounding", "policy"] } This is not sacred. The point is to make cases explicit enough that the scoring method is not reinvented every time someone wants a launch answer. Outcome classes matter more than vibes Not every case is "answer correctly with maximum eloquence". Some cases should expect: success refusal fallback needs-human-review unknown-with-bounds That is how you prevent the system from being rewarded for confidently doing the wrong thing. Related observability layer: The Minimum Useful Trace Decision Criteria Use golden sets when: the workflow has any production consequence the system changes across prompts, models, retrieval policies, or tool contracts you need pre-release regression gates rather than post-release storytelling you compare variants during canary, A/B, or provider migration work This becomes mandatory when the system affects: customer-facing answers internal operational guidance tool use write-gated actions safety or policy enforcement You do not need a giant golden set on day one. You do need one as soon as the workflow matters enough that regression discovery in production would be embarrassing, expensive, or both. Golden sets are not a substitute for: live metrics traces operator review incident analysis They work with those systems. They do not replace them. Failure Modes Golden sets are most useful when they are designed against the ways teams usually fool themselves. Demo-case optimism The set contains only clean, flattering examples that make the workflow look smart. Mitigation: include edge cases, ugly inputs, ambiguous questions, and policy traps sample from real production failures, not just architecture fantasies Metric collapse The team reduces quality to one aggregate score and misses regressions in specific behavior classes. Mitigation: score across multiple dimensions gate separately for groundedness, refusal correctness, schema validity, and unsafe action rates Change-surface blindness The cases do not indicate what they are meant to test, so a retrieval regression and a prompt regression get mixed together in the same fog. Mitigation: tag cases by change surface and behavior class run targeted subsets for targeted changes Stale golden set The set represents last quarter's workflow, not today's workload. Mitigation: add fresh cases from incidents and support logs review the set on a regular cadence Judge drift An LLM-based evaluator changes behavior over time, and the team mistakes scoring drift for product improvement. Mitigation: pin evaluator model/version where possible keep deterministic assertions alongside judge-based rubrics Missing negative cases The set tests only ideal success paths and ignores refusal, fallback, isolation, and write-gating scenarios. Mitigation: include cases where the correct behavior is to abstain, refuse, or escalate include Two-Key Writes cases where the system must not authorize the action Reference Architecture The minimum viable golden-set workflow looks like this: Change proposed -> identify affected change surface -> select relevant golden-set slice -> run deterministic assertions + rubric scoring -> compare against previous baseline -> decide: ship, hold, or investigate -> add new cases if failure exposed a missing class That architecture matters because evaluation is not a one-time report. It is a release gate. A concrete walkthrough Suppose you upgrade the model behind a support assistant. A serious golden-set run should answer: Did schema-valid outputs stay stable? Did refusal correctness improve or regress? Did citation alignment hold? Did latency or token cost move outside declared budgets? Did any write-adjacent suggestions become less safe? If the answer is "overall score went up," you still do not know enough to ship. Minimal Implementation You do not need a specialized eval platform to start. You need disciplined case design and a willingness to treat scoring as engineering rather than ceremony. Step 1: Start with behavior classes Partition cases into classes that matter operationally. Typical classes: grounded answer refusal required retrieval isolation tool selection write-gating safety bounded uncertainty This prevents the set from becoming an undifferentiated pile of prompts. Step 2: Keep deterministic assertions where possible Before using an LLM judge, ask whether the case can be scored with something simpler: schema validity required citation present forbidden action absent expected enum returned Deterministic checks are cheaper, faster, and far less likely to gaslight you. Step 3: Use rubrics for what cannot be asserted directly For nuanced cases, use a rubric that scores dimensions like: correctness groundedness completeness refusal appropriateness policy compliance Pin the rubric version. If the rubric changes, treat that as a change surface too. Step 4: Slice by change surface Do not run every case for every change unless you enjoy slow pipelines and muddy signals. Map changes to relevant test slices: prompt change -> response quality, schema, groundedness retrieval change -> recall, isolation, citation alignment model upgrade -> full suite including latency and cost tool contract change -> argument validation and unsafe-action checks Step 5: Add cases from incidents Every incident, near miss, or painful production surprise should raise a simple question: should this exist as a regression case now? Usually the answer is yes. Production failures are expensive. At minimum, make them reusable. Evaluation Gates Golden sets matter because they feed shipping decisions. Minimum useful gates: schema validity remains above threshold groundedness or citation alignment does not regress beyond tolerance refusal correctness does not regress on policy-sensitive cases unsafe write suggestion rate stays at zero for protected paths latency and cost remain inside declared budgets where relevant This is where multi-metric gates beat single aggregate scores. A release can be blocked because: overall quality improved but refusal correctness regressed retrieval recall rose but grounding got worse answers got more detailed but cost doubled a model became more eloquent and less safe That is not the gate being annoying. That is the gate doing its job. Golden sets also become much more powerful when paired with traces. Golden sets find regressions. Traces explain regressions. Related: The Minimum Useful Trace Closing Position Probabilistic systems do not deserve a free pass on regression discipline just because the output is fuzzy. That fuzziness is the reason regression discipline matters more, not less. Golden sets are how you stop shipping changes on instinct. They let you say: what behavior we expect what behavior we refuse to tolerate what changed whether the change is good enough to release That is not academic ceremony. That is how you keep AI systems from slowly degrading while every weekly update insists things are "looking strong". Related Reading Probabilistic Core / Deterministic Shell: Containing Uncertainty Without Shipping Chaos The Minimum Useful Trace: An Observability Contract for Production AI Two-Key Writes: Preventing Accidental Autonomy in AI Systems Retrieval Strategy Playbook AI Observability Basics Generative AI: A Systems and Architecture Reference Architecture Discipline for AI Systems (Vol. 01) On this page Key Takeaways The Pattern Why Golden Sets Exist The Golden Set Contract Required case elements A vendor-neutral case record Outcome classes matter more than vibes Decision Criteria Failure Modes Demo-case optimism Metric collapse Change-surface blindness Stale golden set Judge drift Missing negative cases Reference Architecture A concrete walkthrough Minimal Implementation Step 1: Start with behavior classes Step 2: Keep deterministic assertions where possible Step 3: Use rubrics for what cannot be asserted directly Step 4: Slice by change surface Step 5: Add cases from incidents Evaluation Gates Closing Position Related Reading