This article argues that concerns about function call overhead in Rust async code are often unfounded, demonstrating that modern compilers inline small functions in release builds, making indirection cost negligible compared to actual I/O and system-level operations. The author emphasizes that code readability and maintainability should take priority over micro-optimizations, and provides concrete benchmarking and profiling techniques to measure real performance impact.
A deep technical exploration of porting a Flash Attention kernel from GPU (Triton) to TPU using JAX, covering the fundamental differences in programming models, compiler behavior, and hardware architectures. The author details how JAX's functional, immutable paradigm and XLA compilation differ from explicit GPU kernel writing, and includes benchmarking and a custom systolic array emulator to understand TPU data flow.
A practical guide to using C++ for bare metal and embedded systems development, covering STL constraints, template metaprogramming, memory management, and real-time system design without OS services. The author demonstrates how C++ superiority in code reuse and generic programming can benefit embedded developers through the embxx library and ARM-based examples.