Optimization Methods

Batch Normalization

Batch normalization is a technique by which inputs to successive layers of a neural network can be normalized to a Guassian distribution. Batch normalization allows the use of the same learning rate for each layer, instead of tuning it per-layer, easing the training of very deep neural networks. According to the original paper, batch normalization alone reduced training time for an ImageNet implementation by an order of magnitude.

Resources

8-bit matrix multiplication

Can be used to perform inference and training using much less GPU memory. Empirically there is apparently not much dowside for language models up to 175B parameters.