Cryptographic Provenance for LLM Inference

CommitLLM — Cryptographic provenance for LLM inference Active development · Lean 4 formalization in progress Cryptographic provenance for LLM inference You have no proof your LLM provider ran the model they claim. CommitLLM is a cryptographic commit-and-audit protocol that closes that gap: the provider serves normally on GPU and returns a compact receipt. A verifier checks the receipt and opened trace on CPU. Linear shell: algebraic checks Nonlinear shell: canonical replay Attention: bounded approximate replay Prefix/KV: statistical unless deep audit GitHub Paper (PDF) Measured on the kept path Routine audit (Llama 70B) 1.3 ms/tok Online tracing overhead ~12–14% Full audit (1 tok, 70B) ~10 ms Within 1 quant bucket >99.8% Verifier CPU only Provider Normal GPU The gap Between fingerprints and zero-knowledge proofs Two unsatisfying extremes—and a design point between them where real deployments need to live. Insufficient Fingerprinting Statistical heuristics provide evidence but not exact per-response verification. A determined provider can game them. CommitLLM Commit-and-audit Commitment-bound end-to-end. Information-theoretically sound algebraic checks for large linear layers, canonical replay for supported nonlinear components, CPU-only verification. Impractical ZK proofs Strong proof objects, but prover costs remain too high for production LLM serving. Impractical at scale today. Protocol Setup once. Commit every response. Verify on challenge. The verifier holds a secret key derived from public weights. The provider commits during normal inference. Expensive work happens only when challenged. Phase 0 · Setup Build the verifier key From a public checkpoint, the verifier computes a Merkle root over weights, secret Freivalds vectors for eight matrix families ( Wq , Wk , Wv , Wo , Wgate , Wup , Wdown , LM_head ), and the model configuration needed for canonical replay. Phase 1 · Commit Serve normally, return a receipt The provider runs inference on the normal GPU path with a tracing sidecar that captures retained state. It returns the response plus a compact receipt binding the execution trace, KV state, deployment manifest, prompt, sampling randomness, and token count. Phase 2 · Audit Challenge specific positions and layers The verifier challenges token positions and layers after the commitment. The provider opens the requested region. Routine audit samples prefix state; deep audit opens everything. Phase 3 · Verify CPU-only checks Embedding Merkle proof. Freivalds on shell matmuls. Exact INT8 bridge recomputation. KV provenance. Attention replay against committed post-attention output. Final-token tail from captured residual. LM-head binding. Decode and output policy replay. Guarantee boundary What is exact, approximate, and statistical Commitment-bound end-to-end, with explicit boundaries for each verification class. Not “uniformly exact”—honestly delineated. Input Exact Embedding Exact Shell matmuls Freivalds INT8 bridges Exact Prefix/KV Statistical* Attention Approximate Final tail Exact LM head Freivalds Decode Fail-closed Exact / canonical replay Algebraic checks Approximate (FP16/BF16) Statistical → exact in deep audit Fail-closed The attention interior remains approximate because native GPU FP16/BF16 attention is not bit-reproducible across devices or even across runs. CommitLLM constrains it strongly—shell-verified Q, K, and V on both sides, commitment-verified prefix state, independent verifier replay, cross-layer consistency through the residual stream—but does not pretend it is exact. In routine audit mode, prefix/KV provenance is statistical: Merkle binding is exact, sampled positions are shell-verified exactly, but unopened positions are covered probabilistically. Deep audit upgrades this to exact full-prefix verification. The honest claim is not “uniformly exact end-to-end” but a precisely delineated guarantee boundary. Audit modes Routine audit stays cheap. Deep audit upgrades coverage. CommitLLM uses the same receipt in both modes. Routine audit keeps steady-state verification light; deep audit opens the full retained window and upgrades prefix provenance to exact verification. Routine audit Low-friction spot checks Designed for normal operation when you want frequent verification without opening the full trace every time. Freivalds-based checks on large linear layers Canonical replay for supported nonlinear subcomputations Sampled prefix and KV provenance with statistical coverage Bounded approximate attention replay on CPU Deep audit Escalate when the stakes are higher Use the same commitment, but require a larger opening. This removes the routine-audit statistical gap on the retained prefix window. Full-prefix and KV openings across the retained audit window Exact prefix provenance instead of sampled coverage The same algebraic, replay, and decode checks as routine audit Higher bandwidth and storage cost, not a different serving path Operationally: routine audit is the default posture; deep audit is the escalation path when a response is high value, disputed, or randomly selected for full review. Core mechanism Verify huge matrix multiplies cheaply The provider claims z = W @ x for a public weight matrix W . Recomputing the full product is expensive. Freivalds’ algorithm gives a much cheaper check: the verifier precomputes v = r T W with a secret random vector r , then checks v·x =? r T ·z in the finite field F p where p = 2 32 −5. If z ≠ Wx , the check fails with probability ≥ 1−1/p. This is information-theoretically sound. Transformers are mostly matrix multiplication; once those multiplies are cheap to audit, the verifier can check model identity without rerunning the full model. W q W k W v W o W gate W up W down LM head Interactive demo · 3×3 over F p Click a button to run Freivalds’ check Honest provider Cheating provider Measurements Performance on the corrected replay path Measured on Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8. Attention mismatch is single-digit and bounded. Measured today Qwen2.5-7B-W8A8 and Llama-3.1-8B-W8A8 Verifier hardware CPU only Provider path Normal GPU serving with tracing Verifier cost · Llama 70B Routine 1.3 ms Full 10 ms Online tracing overhead Base baseline +Trace +12–14% Attention corridor · Qwen2.5-7B-W8A8 L∞ 8 frac_eq >92% frac≤1 >99.8% Attention corridor · Llama-3.1-8B-W8A8 L∞ 9 frac_eq 94–96% frac≤1 >99.9% Deployment fit Built to sit beside real serving stacks CommitLLM is not a replacement inference engine. The provider keeps the normal GPU path and produces request-scoped evidence alongside it. Supported now Continuous batching and paged attention Many user requests can share the same GPU microbatch. CommitLLM still produces per-request receipts and per-request audits. Supported now Tensor parallelism and fused kernels The tracing layer follows the existing execution path instead of replacing production kernels with proof-friendly substitutes. Supported now Quantized serving Quantization metadata is receipt-bound, and the kept path is measured on production-style W8A8 checkpoints. Not the current story Cross-request cache reuse and shortcut decoding Cross-request prefix caching, speculative decoding, and other semantics-changing shortcuts need more protocol work. Unsupported paths should fail closed. Commitment scope Four specs, one receipt CommitLLM binds the entire deployment surface that affects outputs—not just “some model ran.” Spec What it binds input_spec_hash Tokenizer, chat template, BOS/EOS, truncation, padding, system prompt model_spec_hash Checkpoint identity R W , quantization, LoRA/adapter, RoPE config, RMSNorm ε decode_spec_hash Sampler, temperature, top-k/p, penalties, logit bias, grammar, stop rules output_spec_hash Detokenization, cleanup, whitespace normalization Why it matters Provenance for every deployment Enterprise procurement Paying for Llama 70B? Get proof the provider actually served that checkpoint, not a smaller distillation. Regulated deployments Banks, hospitals, legal teams—auditable chain from decision to model version, decode policy, and output. Decentralized compute Networks like Gensyn, Ritual, or Bittensor cannot rely on “trust the node.” CommitLLM provides the missing layer. Agent systems When an agent takes action, which model produced the decision becomes a liability and governance question. Abstract Paper summary Federico Carrone, Diego Kingston, Manuel Puebla, Mauro Toscano Lambda Class · Centro de Criptografía y Seguridad Digital, UBA Large language models are increasingly used in settings where integrity matters, but users still lack technical assurance that a provider actually ran the claimed model, decode policy, and output behavior. Fingerprinting and statistical heuristics can provide signals, but not exact per-response verification. Zero-knowledge proof systems provide stronger guarantees, but at prover costs that remain impractical for production LLM serving. We present CommitLLM, a cryptographic commit-and-audit protocol for open-weight LLM inference. CommitLLM keeps the provider on the normal serving path and keeps verifier work fast and CPU-only. It combines commitment binding, direct audit, and randomized algebraic fingerprints, including Freivalds-style checks for large matrix products, rather than per-response proof generation or full re-execution. Its main costs are retained-state memory over the audit window and audit bandwidth, not per-response proving. The protocol is commitment-bound end-to-end. Within that binding, large linear layers are verified by verifier-secret, information-theoretically sound algebraic checks, quantization/dequantization boundaries and supported nonlinear subcomputations are checked by canonical re-execution, attention is verified by bounded approximate replay, and routine prefix-state provenance is statistical unless deep audit is used. Unsupported semantics fail closed. Repository Code layout The public project name is CommitLLM. Some internal crate and package paths still use the legacy verilm-* prefix while the rename is being completed. Component Path Core types and traits crates/verilm-core Key generation crates/verilm-keygen Verifier crates/verilm-verify Prover (Rust) crates/verilm-prover Python sidecar sidecar/ Python bindings crates/verilm-py Test vectors crates/verilm-test-vectors Lean formalization lean/ Paper paper/main.pdf