llm-inference

2 articles
sort: new top best
clear filter
0 5/10

Systematic benchmarking of NVIDIA Blackwell consumer GPUs for LLM inference across quantization formats and workloads, demonstrating cost-effective private deployment for SMEs with 40-200x lower costs than cloud APIs and sub-second latency for most use cases.

NVIDIA Blackwell RTX 5060 Ti RTX 5070 Ti RTX 5090 Qwen3-8B Gemma3-12B Gemma3-27B GPT-OSS-20B Jonathan Knoop Hendrik Holtmann
arxiv.org · rohansood15 · 13 hours ago · details · hn
0 2/10

SiMM is an open-source distributed KV cache engine that addresses GPU memory constraints in LLM inference by storing KV cache in RDMA-backed memory pools, achieving 3.1× speedup over no cache and up to 9× lower KV I/O latency on long-context multi-turn workloads.

SiMM SGLang vLLM OpenRouter RDMA
github.com · SherryWong · 14 hours ago · details · hn