Speculative Decoding
Speculative Decoding LLM inference will constitute an increasingly large proportion of compute cost. Unfortunately, for autoregressive LLMs, it is slow. Speculative decoding is a clever technique described by both Leviathan et al. 2022 and Chen et al. 2023, two concurrent papers (somewhat amusingly, from Google Research and Deepmind respectively). I’ll explain the technique, its derivation, and newer variants in this post.
Autoregressive sampling is typically memory bandwidth bound since tokens are sampled one-by-one.
Stuff I've been reading #1
Nintil: Massive Input vs Spaced Repetition Is deep learning a “formal subject”? Deep models and their properties are complex and empirical, like biology. It is one of the few branches of computer science which is experimental. Most hot new papers will be forgotten. So I’d lean towards massive input for learning deep learning quickly.
Of course, some fundamentals are worth memorizing, particularly on the engineering side. Deeply understanding Efficiently Scaling Transformer Inference without a firm grasp of tensor parallelism was very difficult, so studying the Megatron paper first was very helpful.