Stuff I've been reading #1

September 12, 2023

Nintil: Massive Input vs Spaced Repetition

Is deep learning a “formal subject”? Deep models and their properties are complex and empirical, like biology. It is one of the few branches of computer science which is experimental. Most hot new papers will be forgotten. So I’d lean towards massive input for learning deep learning quickly.

Of course, some fundamentals are worth memorizing, particularly on the engineering side. Deeply understanding Efficiently Scaling Transformer Inference without a firm grasp of tensor parallelism was very difficult, so studying the Megatron paper first was very helpful.

Ilya on running out of data

I think the above insight (some sources of data are much more interesting than others) is one part of why Ilya appears unconcerned about running out of data. I’d suspect that synthetic data is also part of the equation, as many are guessing–spending more compute at inference time via search under some objective/ CoT can produce data that has more knowledge. Also, it might be possible to rearrange (pre-)training data so that an LM has to produce very strong models of the world in order to predict the next word.

Interesting fireside chat though, and this entire series from the Simon Institute on LLM’s is nice.

Sasha Rush’s GPU Puzzles

These were pretty fun. Sometimes the rendering will fail, but not the actual problem run–the fix for me was handling an out of bounds array access.

Simon Boehm: How to Optimize a CUDA Matmul Kernel

Great follow-up to GPU puzzles & very nice visualizations.

C.S. Lewis: The Inner Ring

I won’t provid an excerpt here, but one should sometimes step back from the object level (or the first order meta-level!) and think about what comprises a good life.

phi-1.5 technical report

Just as not all nats are the same within a training run, not all perplexities are the same between datasets. A model which can predict the next token in a textbook or a conversation knows a lot more about the world than one which can predict the next token of e.g. a Buzzfeed article. The scaling constants are dataset-specific, remember–which is why trying to fit functional forms to downstream tasks is a worthwhile pursuit (though they seem less clean and Fundamental). Caveat: I don’t think this particular paper is likely to hold up (memorizing test set), nor do I think that 1b scale models are particularly likely to reason in the current training paradigm.

Books:

Computer Architecture: A Quantitative Approach

Reasonably approachable hardware book if you have a software background (though I’ve yet to read the instruction level parallelism & multicore chapters). It scratches the itch to understand how computers work one layer under the software. I appreciate the quantitative approach which helps with intuition (e.g. reminds me of Cell Biology By The Numbers and a cracked out version of Jeff Deans Numbers)