model-compression

2 articles
sort: new top best
clear filter
0 3/10

NVIDIA introduces NVFP4, a 4-bit floating-point format for NVIDIA Blackwell GPUs that achieves efficient low-precision inference while maintaining model accuracy through a two-level scaling strategy combining fine-grained E4M3 block-level and FP32 tensor-level scaling, reducing memory footprint by 3.5x versus FP16 with less than 1% accuracy degradation on language models.

NVIDIA NVIDIA Blackwell NVFP4 MXFP4 FP4 E4M3 Tensor Cores Eduardo Alvarez Omri Almog Eric Chung Simon Layton Dusan Stosic Ronny Krashinsky Kyle Aubrey
developer.nvidia.com · tosh · 16 hours ago · details · hn
0 4/10

A technical analysis of sparsity versus quantization as hardware optimization strategies for neural networks, exploring architectural challenges (unstructured sparse data chaos vs. quantization metadata overhead) and current compromises (structured sparsity patterns and algorithmic co-design techniques) used in modern AI accelerators.

NVIDIA Ampere EIE SCNN BitNet b1.58 GPTQ Quip SmoothQuant AWQ StreamingLLM OCP Microscaling Formats Deep Compression
sigarch.org · matt_d · 23 hours ago · details · hn