flash-attention

2 articles
sort: new top best
clear filter
0 6/10

A detailed account of troubleshooting open-source ML infrastructure when post-training the Kimi-K2-Thinking 1T parameter model, exposing bugs and inefficiencies in HuggingFace Transformers and quantization libraries that aren't documented and can hide several layers in the dependency stack.

Kimi-K2-Thinking HuggingFace LLaMA-Factory KTransformers DeepSeek-V3 PyTorch vLLM compressed_tensors TriviaQA PEFT Transformers
workshoplabs.ai · addiefoote8 · 4 days ago · details · hn
0 8/10

A deep technical exploration of porting a Flash Attention kernel from GPU (Triton) to TPU using JAX, covering the fundamental differences in programming models, compiler behavior, and hardware architectures. The author details how JAX's functional, immutable paradigm and XLA compilation differ from explicit GPU kernel writing, and includes benchmarking and a custom systolic array emulator to understand TPU data flow.

Archer Zhang JAX XLA Triton TPU Colab Flash Attention
archerzhang.me · azhng · 6 days ago · details · hn