Optimization Methods

Batch Normalization

Batch normalization is a technique by which inputs to successive layers of a neural network can be normalized to a Guassian distribution. Batch normalization allows the use of the same learning rate for each layer, instead of tuning it per-layer, easing the training of very deep neural networks. According to the original paper, batch normalization alone reduced training time for an ImageNet implementation by an order of magnitude.

Resources

Karpathy’s youtube series about transformer networks specific timestamp discussing batch normalization
CS 480 University of Waterloo discussing the related concept of layer normalization
What are the consequences of layer norm vs batch norm? - Stackexchange

8-bit matrix multiplication

Can be used to perform inference and training using much less GPU memory. Empirically there is apparently not much dowside for language models up to 175B parameters.

arxiv paper
huggingface tutorial
bitsandbytes library providing 8-bit matrix operations