Types of Networks

Transformer networks

Transformer networks were introduced by the seminal paper Attention is All You Need, replacing the previous state of the art, Recurrent Neural Networks based on LSTM units. RNNs were difficult to train due to the need to compute successive hidden states sequentially. Transformer networks use a mechanism called attention to selectively attend to different parts of the input. Transformers, as the name implies, are often used to convert between different types of sequential input.


Attention Mechanism

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\mathrm{T}}{\sqrt{d_k}}\right)V\]

The attention mechanism is often analogized to a database lookup system. Given a query \(Q\), a set of keys \(K\) and values \(V\), the attention mechanism computes a similarity measure between the elements in \(Q\) and the elements in \(K\) and returns a weighted sum of values \(V\) using the similarity measures as weights. \(QK^T\) produces raw logits which are scaled by the dimension of the key vectors \(\sqrt{d_k}\) to stabilize gradients.


Attention units are often extended into multiple units called “heads” with attention being split between them. It’s still unclear to me how this works.



From 2015, this paper seems to have introduced the idea of skip-connections to deep networks. The researchers note that this technique had been introduced before in simple feed-forward networks. The idea is fairly simple and the paper is straight-forward. Resnets seem to be in fairly common use today for classification tasks and the paper notes that when introduced they acheived a 3.57% error on ImageNet competitions.


U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg.[1] * https://en.wikipedia.org/wiki/U-Net

Architecture diagram has a “U” shape, hence the name. U-nets are used in Diffusion networks.