Stanford CS336 Lecture Notes 3 - Architectures and Hyperparameters
Notes on transformer architectures, normalization strategies, positional embeddings (sinusoidal, absolute, relative, ALiBi, RoPE), activation functions, attention variants, and hyperparameter choices.