quantization

5 articles
sort: new top best
clear filter
0 3/10

NVIDIA introduces NVFP4, a 4-bit floating-point format for NVIDIA Blackwell GPUs that achieves efficient low-precision inference while maintaining model accuracy through a two-level scaling strategy combining fine-grained E4M3 block-level and FP32 tensor-level scaling, reducing memory footprint by 3.5x versus FP16 with less than 1% accuracy degradation on language models.

NVIDIA NVIDIA Blackwell NVFP4 MXFP4 FP4 E4M3 Tensor Cores Eduardo Alvarez Omri Almog Eric Chung Simon Layton Dusan Stosic Ronny Krashinsky Kyle Aubrey
developer.nvidia.com · tosh · 14 hours ago · details · hn
0 5/10

Systematic benchmarking of NVIDIA Blackwell consumer GPUs for LLM inference across quantization formats and workloads, demonstrating cost-effective private deployment for SMEs with 40-200x lower costs than cloud APIs and sub-second latency for most use cases.

NVIDIA Blackwell RTX 5060 Ti RTX 5070 Ti RTX 5090 Qwen3-8B Gemma3-12B Gemma3-27B GPT-OSS-20B Jonathan Knoop Hendrik Holtmann
arxiv.org · rohansood15 · 14 hours ago · details · hn
0 4/10

A technical analysis of sparsity versus quantization as hardware optimization strategies for neural networks, exploring architectural challenges (unstructured sparse data chaos vs. quantization metadata overhead) and current compromises (structured sparsity patterns and algorithmic co-design techniques) used in modern AI accelerators.

NVIDIA Ampere EIE SCNN BitNet b1.58 GPTQ Quip SmoothQuant AWQ StreamingLLM OCP Microscaling Formats Deep Compression
sigarch.org · matt_d · 21 hours ago · details · hn
0 5/10

Technical analysis estimating Claude Opus 4.5/4.6 active parameter counts (100-154B depending on quantization scheme) by reverse-engineering token generation throughput ratios on Google Vertex infrastructure and calibrating against known Chinese model specifications.

Claude Opus 4.5 Claude Opus 4.6 Claude Opus 4.1 Claude Opus 4.0 Claude Sonnet 4.6 Claude Sonnet 4.5 Deepseek V3.1 GLM-4.7 Kimi K2 Google Vertex Amazon Bedrock OpenRouter Anthropic
unexcitedneurons.substack.com · jychang · 23 hours ago · details · hn
0 5/10

Step-by-step guide for running open-source LLMs locally with Claude Code using llama.cpp, demonstrating deployment of models like Qwen3.5 and GLM-4.7-Flash with quantization and GPU optimization for coding tasks.

Unsloth Claude Code Qwen3.5 GLM-4.7-Flash llama.cpp DeepSeek Gemma Qwen3-Coder-Next OpenAI
unsloth.ai · armcat · 1 day ago · details · hn