Paper notes: Efficiently Scaling Transformer Inference
Background Efficient inference of massive transformers is difficult and important. The paper (Efficiently Scaling Transformer Inference) from Google discusses their approach to inference of their massive transformer model in generous (perhaps surprising) detail. This post will explain the paper with some of my own notes. All data, tables, and images are from the paper! I’ll assume some familiarity with the decoder-only transformer architecture and MPI collective communication routines (intro here) as well as the basics of Transformer inference (e.