online-softmax

1 article
sort: new top best
clear filter
0 8/10

A deep technical exploration of porting a Flash Attention kernel from GPU (Triton) to TPU using JAX, covering the fundamental differences in programming models, compiler behavior, and hardware architectures. The author details how JAX's functional, immutable paradigm and XLA compilation differ from explicit GPU kernel writing, and includes benchmarking and a custom systolic array emulator to understand TPU data flow.

Archer Zhang JAX XLA Triton TPU Colab Flash Attention
archerzhang.me · azhng · 6 days ago · details · hn