llm-inference

2 articles

sort: new top best

bug-bounty448 google356 microsoft314 facebook263 xss238 apple180 malware174 rce149 exploit127 bragging-post101 cve99 account-takeover93 phishing83 csrf79 privilege-escalation77 stored-xss65 supply-chain65 authentication-bypass63 dos60 reflected-xss57 browser57 react50 cloudflare49 input-validation48 cross-site-scripting48 reverse-engineering48 access-control47 aws45 docker45 smart-contract45 node44 web343 ethereum43 sql-injection43 web-security42 defi42 web-application41 ssrf38 burp-suite35 vulnerability-disclosure34 idor34 race-condition33 info-disclosure33 buffer-overflow33 html-injection33 oauth32 writeup32 cloud32 smart-contract-vulnerability32 information-disclosure30

0 5/10

Private LLM Inference on Consumer Blackwell GPUs

research

Systematic benchmarking of NVIDIA Blackwell consumer GPUs for LLM inference across quantization formats and workloads, demonstrating cost-effective private deployment for SMEs with 40-200x lower costs than cloud APIs and sub-second latency for most use cases.

llm-inference gpu-optimization quantization model-deployment privacy performance-benchmarking nvidia-blackwell cost-analysis rag model-serving

NVIDIA Blackwell RTX 5060 Ti RTX 5070 Ti RTX 5090 Qwen3-8B Gemma3-12B Gemma3-27B GPT-OSS-20B Jonathan Knoop Hendrik Holtmann

arxiv.org · rohansood15 · 13 hours ago · details · hn

0 2/10

Show HN: SiMM – Distributed KV Cache for the Long-Context and Agent Era

tool

SiMM is an open-source distributed KV cache engine that addresses GPU memory constraints in LLM inference by storing KV cache in RDMA-backed memory pools, achieving 3.1× speedup over no cache and up to 9× lower KV I/O latency on long-context multi-turn workloads.

llm-inference kv-cache distributed-systems rdma performance-optimization gpu-memory long-context open-source

SiMM SGLang vLLM OpenRouter RDMA

github.com · SherryWong · 14 hours ago · details · hn