quantization

5 articles

sort: new top best

bug-bounty480 google297 xss277 microsoft249 facebook211 rce159 apple150 exploit136 bragging-post102 account-takeover98 malware94 csrf84 cve79 privilege-escalation74 authentication-bypass65 stored-xss65 writeup61 reflected-xss57 browser54 react53 ssrf51 phishing50 dos50 input-validation49 cloudflare49 access-control49 cross-site-scripting48 node46 aws46 smart-contract45 docker45 sql-injection45 ethereum44 defi43 web-security43 web-application42 supply-chain42 oauth41 web339 burp-suite36 lfi34 vulnerability-disclosure34 idor34 html-injection33 smart-contract-vulnerability32 race-condition32 clickjacking31 reverse-engineering31 information-disclosure30 csp-bypass30

0 3/10

NVFP4: Efficient and Accurate Low-Precision Inference

research

NVIDIA introduces NVFP4, a 4-bit floating-point format for NVIDIA Blackwell GPUs that achieves efficient low-precision inference while maintaining model accuracy through a two-level scaling strategy combining fine-grained E4M3 block-level and FP32 tensor-level scaling, reducing memory footprint by 3.5x versus FP16 with less than 1% accuracy degradation on language models.

quantization low-precision-inference model-compression nvfp4 floating-point-formats nvidia-blackwell tensor-cores ai-optimization fp4 mxfp4 e4m3 hardware-acceleration

NVIDIA NVIDIA Blackwell NVFP4 MXFP4 FP4 E4M3 Tensor Cores Eduardo Alvarez Omri Almog Eric Chung Simon Layton Dusan Stosic Ronny Krashinsky Kyle Aubrey

developer.nvidia.com · tosh · 14 hours ago · details · hn

0 5/10

Private LLM Inference on Consumer Blackwell GPUs

research

Systematic benchmarking of NVIDIA Blackwell consumer GPUs for LLM inference across quantization formats and workloads, demonstrating cost-effective private deployment for SMEs with 40-200x lower costs than cloud APIs and sub-second latency for most use cases.

llm-inference gpu-optimization quantization model-deployment privacy performance-benchmarking nvidia-blackwell cost-analysis rag model-serving

NVIDIA Blackwell RTX 5060 Ti RTX 5070 Ti RTX 5090 Qwen3-8B Gemma3-12B Gemma3-27B GPT-OSS-20B Jonathan Knoop Hendrik Holtmann

arxiv.org · rohansood15 · 14 hours ago · details · hn

0 4/10

To Sparsify or to Quantize: A Hardware Architecture View

research

A technical analysis of sparsity versus quantization as hardware optimization strategies for neural networks, exploring architectural challenges (unstructured sparse data chaos vs. quantization metadata overhead) and current compromises (structured sparsity patterns and algorithmic co-design techniques) used in modern AI accelerators.

hardware-architecture neural-network-optimization sparsity quantization model-compression ai-accelerators tensor-cores memory-bandwidth deep-learning llm-optimization

NVIDIA Ampere EIE SCNN BitNet b1.58 GPTQ Quip SmoothQuant AWQ StreamingLLM OCP Microscaling Formats Deep Compression

sigarch.org · matt_d · 21 hours ago · details · hn

0 5/10

Estimating the Size of Claude Opus 4.5/4.6

research

Technical analysis estimating Claude Opus 4.5/4.6 active parameter counts (100-154B depending on quantization scheme) by reverse-engineering token generation throughput ratios on Google Vertex infrastructure and calibrating against known Chinese model specifications.

model-architecture inference-optimization parameter-estimation large-language-models moe-models quantization hardware-performance token-throughput

Claude Opus 4.5 Claude Opus 4.6 Claude Opus 4.1 Claude Opus 4.0 Claude Sonnet 4.6 Claude Sonnet 4.5 Deepseek V3.1 GLM-4.7 Kimi K2 Google Vertex Amazon Bedrock OpenRouter Anthropic

unexcitedneurons.substack.com · jychang · 23 hours ago · details · hn

0 5/10

How to Run Local LLMs with Claude Code (Unsloth)

tutorial

Step-by-step guide for running open-source LLMs locally with Claude Code using llama.cpp, demonstrating deployment of models like Qwen3.5 and GLM-4.7-Flash with quantization and GPU optimization for coding tasks.

llm local-deployment llama-cpp quantization gguf claude-code model-serving inference gpu-optimization tutorial

Unsloth Claude Code Qwen3.5 GLM-4.7-Flash llama.cpp DeepSeek Gemma Qwen3-Coder-Next OpenAI

unsloth.ai · armcat · 1 day ago · details · hn