Tinygrad Nvidia P2P hack on Talos OS: 3 days with AI, would been 2 weeks without
Github: https://github.com/himekifee/talos-tinygrad-nvidia-p2p-driver
Why this exists: The tinygrad community's P2P kernel patch (aikitoria's donor patch) enables PCIe peer-to-peer / GPUDirect on NVIDIA's open kernel modules — direct GPU-to-GPU memory access without bouncing through host RAM. But getting patched kernel modules to actually load on Talos Linux is an entirely separate war. Talos's immutable, signature-enforced architecture fights you at every turn.
What it took: Almost 3 full days with OpenCode + Claude Opus 4.6 + GPT-5.4 xHigh doing the heavy thinking, iterating, and debugging alongside me. Done by hand, this would have been 2 weeks minimum — and even with LLMs, they kept getting trapped in the same pitfalls because there is essentially zero documentation on how to build custom kernel modules for Talos.
The real problem isn't NVIDIA — it's Talos:
The NVIDIA side is fine. The P2P patch applies, the driver compiles.
Talos is the problem. I genuinely think Talos is well-designed for running Kubernetes. But building against it is a completely different experience. You're dealing with a three-repo architecture (talos, pkgs, extensions) that's completely opaque to outsiders — no guide explains how kernel builds, module signing, sysext packaging, and installer composition fit together. You reverse-engineer it from Dockerfiles, bldr manifests, and Go source.
The security model that makes Talos good is exactly what makes hacking on it painful. Every .ko file must be signed by the exact key from the exact build run that compiled the booted kernel. Not the same version. Not the same config. The same compilation. And Talos enforces this silently — modules just don't load. No error, no log entry.
The signing key trap (where most of the 3 days went):
The signing key must be consistent across a chain of 5 artifacts: kernel image → installer-base → P2P module package → sysext wrapper → final installer. Break any single link, and GPU support silently vanishes. Kbuild can quietly regenerate keys, BuildKit layer caching can serve stale .ko files signed by dead keys, and initramfs can stomp your modules with older ones. A successful `make` doesn't mean your modules will load.
We eventually wrote verify-same-signing-key.sh to extract and compare signer metadata across all images. Once we made that a mandatory build gate, the cycle finally stopped.
Other discoveries the hard way (check the GitHub docs for full details):
- Talos squashfs overlay ordering can silently wipe NVIDIA library entries — glibc and the NVIDIA container toolkit both ship ld.so.cache, and last-write wins - The tinygrad P2P patch targets GitHub's source tree layout, but NVIDIA's redistribution archive uses different paths - The patch alone isn't enough; you also need RM registry overrides via modprobe.d at boot - nvidia-smi topo -p2p a lies about atomics. Trust cuDeviceCanAccessPeer and actual bandwidth tests.
On using LLMs for deep systems work:
The LLMs were genuine force multipliers — reading Dockerfiles, tracing build chains, suggesting what to check next, even this blog is drafted with their help. But without documentation to ground them, they kept falling into the same traps I did. I'd explain the signing key fix, context would rotate, and the next iteration would confidently suggest the exact wrong approach again. On an undocumented system, LLMs multiply your confusion just as efficiently as your productivity.
End result: P2P returns OK across all 4 GPUs, cuCtxEnablePeerAccess succeeds bidirectionally, bandwidth tests confirm direct connectivity. Everything is documented in the repo, so the next person doesn't have to reverse-engineer Talos's build system from scratch.