long-context

1 article
sort: new top best
clear filter
0 2/10

SiMM is an open-source distributed KV cache engine that addresses GPU memory constraints in LLM inference by storing KV cache in RDMA-backed memory pools, achieving 3.1× speedup over no cache and up to 9× lower KV I/O latency on long-context multi-turn workloads.

SiMM SGLang vLLM OpenRouter RDMA
github.com · SherryWong · 14 hours ago · details · hn