ml-infrastructure

1 article
sort: new top best
clear filter
0 1/10

Cumulus Labs launches IonRouter, a low-cost inference API optimized for open-source and fine-tuned models, backed by IonAttention—a custom C++ inference runtime designed specifically for NVIDIA GH200 hardware architecture that achieves 588 tokens/s on multimodal workloads through novel optimizations around cache coherence, KV block writeback, and attention scheduling.

IonRouter Cumulus Labs IonAttention TensorDock Palantir Together AI Fireworks Modal RunPod vLLM GH200 OpenAI Veer Suryaa
ionrouter.io · vshah1016 · 1 day ago · details · hn