Batch normalization is a technique by which inputs to successive layers of a neural network can be normalized to a Guassian distribution. Batch normalization allows the use of the same learning rate for each layer, instead of tuning it per-layer, easing the training of very deep neural networks. According to the original paper, batch normalization alone reduced training time for an ImageNet implementation by an order of magnitude.
Resources
Can be used to perform inference and training using much less GPU memory. Empirically there is apparently not much dowside for language models up to 175B parameters.