Systematic benchmarking of NVIDIA Blackwell consumer GPUs for LLM inference across quantization formats and workloads, demonstrating cost-effective private deployment for SMEs with 40-200x lower costs than cloud APIs and sub-second latency for most use cases.
LAPIS is a compiler framework built on MLIR that optimizes sparse linear algebra operations across diverse architectures using Kokkos dialect for performance portability and a partition dialect for distributed memory execution. The framework demonstrates MLIR's capability to enable linear algebra-level optimizations for both sparse and dense kernels on GPUs, with applications to graph algorithms, relational databases, and scientific machine learning.
Step-by-step guide for running open-source LLMs locally with Claude Code using llama.cpp, demonstrating deployment of models like Qwen3.5 and GLM-4.7-Flash with quantization and GPU optimization for coding tasks.
A comprehensive survey of 16 open-source reinforcement learning libraries that implement asynchronous training architectures, analyzing design choices across 7 axes (orchestration, buffer design, weight sync protocols, staleness management, LoRA support, distributed backends) to optimize GPU utilization by disaggregating inference and training workloads.