model-serving

2 articles
sort: new top best
clear filter
0 5/10

Systematic benchmarking of NVIDIA Blackwell consumer GPUs for LLM inference across quantization formats and workloads, demonstrating cost-effective private deployment for SMEs with 40-200x lower costs than cloud APIs and sub-second latency for most use cases.

NVIDIA Blackwell RTX 5060 Ti RTX 5070 Ti RTX 5090 Qwen3-8B Gemma3-12B Gemma3-27B GPT-OSS-20B Jonathan Knoop Hendrik Holtmann
arxiv.org · rohansood15 · 14 hours ago · details · hn
0 5/10

Step-by-step guide for running open-source LLMs locally with Claude Code using llama.cpp, demonstrating deployment of models like Qwen3.5 and GLM-4.7-Flash with quantization and GPU optimization for coding tasks.

Unsloth Claude Code Qwen3.5 GLM-4.7-Flash llama.cpp DeepSeek Gemma Qwen3-Coder-Next OpenAI
unsloth.ai · armcat · 1 day ago · details · hn