Show HN: SiMM – Distributed KV Cache for the Long-Context and Agent Era
SiMM is an open-source distributed KV cache engine that addresses GPU memory constraints in LLM inference by storing KV cache in RDMA-backed memory pools, achieving 3.1× speedup over no cache and up to 9× lower KV I/O latency on long-context multi-turn workloads.
With long Chain-of-Thought reasoning and multi-turn agents, prompts are getting much longer. According to OpenRouter’s State of AI 2025, average context length has grown about 4× in the past year.
This creates two problems in inference systems:
• Slow TTFT — long contexts make prefill expensive • High GPU memory cost — KV cache quickly exhausts HBM
Instead of recomputing long prompts or keeping all KV cache in GPU memory, we explored a different approach:
treat KV cache as a distributed memory system.
SiMM is an open-source distributed KV cache engine for LLM inference. It stores KV cache in a high-speed RDMA-backed memory pool and lets engines like SGLang and vLLM reuse cached states across requests.
This converts prefill from a compute-heavy step into a fast I/O lookup.
In our tests with long-context multi-turn workloads:
3.1× speedup vs no cache
2.1× vs local CPU cache
up to 9× lower KV I/O latency
SiMM scales horizontally across nodes and fully utilizes RDMA NIC bandwidth.
GitHub: https://github.com/scitix/SiMM