📚 Notes: Attention! Transformers!

Published on Sep. 7, 2020, last edited on Jan. 26, 2022

Attention mechanisms, used in encoder-decoder reccurrent neural networks are able to align an input sequence to an output sequence. The transformer, described in “Attention Is All You Need” is a network architecture capable of sequence-to-sequence translations, without using recurrence, and thus, allowing parallelization.

Encoder-Decoder with attention mechanisms

In order to solve sequence transduction or sequence-to-sequence problems (i.e. from $(x_1, x_2, …, x_n)$ to $(y_1, y_n, …, y_m)$, we can use an encoder followed by a decoder. In neural machine translation, standard encoder-decoder networks aims at “encoding a source sentence into a fixed-length vector from which a decoder generates a translation” [1]. According to [1], “the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture”. Compressing all the source sentence information into a single vector that is fixed in length is tough.

Attention mechanisms, first introduced by [1], doesn’t want to reduce all the information in a single vector, but instead, it “encodes the input sequence into a sequence of vectors, and chooses a subset of these vectors adaptively while decoding the translation”.

During each encoding step (from $1$ to $n$), an input hidden state $h_j$ is generated by the Reccurrent Neural Network. A context vector $c_i$ is built using all those $j$ hidden states. In Bahdanau attention [1], the context vector is computed as a the weighted sum of the input hidden states $h_j$. Weight is computed using the softmax function, on the alignment model. The alignment model is given using a feedforward neural network that takes the input hidden state $h_j$ and the previous output hidden state $s_{i-1}$.

Some other similar attention mechanisms uses different alingment models.

In [4] was generalised the attention functions using a set of query, key and value that gives an output:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values […]. [5]

$$ \text{attention function} = \text{Weighted sum of the values, weight computed with a compatibily function between query and key} \\ \Rightarrow \text{output} = \sum \text{values}\times\text{compatibily-function(query, key)} $$

With this, we can write Bahdanau’s attention as:

$$ \begin{align} \text{context} & = \sum h_j \times \text{NN-softmax}(s_{i-1}, h_j) \\ &\text{where:} \\ \text{keys} & = h_j \\ \text{values} & = h_j \\ \text{query} &= s_{i-1} \\ \text{output} &= \text{context} \end{align} $$

Global vs Local Attention

Global attention allow the attention to attend on all the source positions, whereas local attention allows attention on only a few source positions [2]. Attending to all source words is computationally expensive, hence limiting the alignment to a small context window.

In [3] where presented hard and soft attentions, which are similiar to local and global attention :

Soft-attention : similar to global attention, where “wheights are sloftly placed over all patches”
Hard-attention : similar to local attention, but more complicated and not differentiable

Self-attention

Self-attention (or intra-attention), introduced in [4] also creates three vectors from the encoder input: the query, the key and the value.

$$ \text{context}_t = \sum \text{value}_t \times \text{softmax}\left(\frac{\text{query}_t \cdot \text{key}_t^T}{\sqrt{d_{\text{key}_t}}}\right) $$

Query, key and value (at time step $t$ are vectors extracted from the input embedding $t$: we compute them by multiplying the input $t$ with trained weights matrices $W^Q$, $W^K$ and $ W^V$. Self attention is attention between words within the same sentence, whereas Bahdanau’s attention for exemple attend words between the source sequence and the output sequence (between the encoder and the decoder).

What’s the difference between this self-attention mechanism and the standard attention?

In “standard”-attention, the query vector is the decoder RNN hidden state: it doesn’t come from the encoder network. Here, the query is just a transformation of the input $x_t$ using the weight matrice $W^Q$: data comes from the same network.
The compatibility function here is just a softmax, whereas in the standard attention is the softmax of an output given by a feedforward neural network.
Key and value vectors are here different vectors, in the standard attention both where the encoder hidden state $h_j$.

The difference between intra and inter attention is described in [6]:

Intra-attention : inside the encoder or decoder themselves
Inter-attention : “standard” attention, between the encoder and the decoder

How are the weights matrices $W^Q$, $W^K$ and $W^V$ learned?

Learned during training.

Mutli-head attention

Usign multi-head attention, as described in [5], not a signle self-attention function is computed, but instead $h$ different self-attention “heads” are computed, and then concatenated together, and then projected again using an other weight matrice $W^O$. Each attention layer is computed in parallel.

The Transformer

As classic recurrent (RNN) encoder-decoder networks, the Transformer is made of an encoder, connected to the decoder. In “Attention Is All You Need”, 6 encoder layers are stacked together, followed by 6 other decoder layer.

Three types of muli-head attentions are used in the transformer :

Self-attention inside the encoder layers.
Self-attention inside the decoder layers.
Inter attentention / standard attention inside the decoder, with queries from the decoder and keys and values from the encoder.

With this layout, an encoder layer is made of a self-attention component followed by a feed-forward NN. An decoder layer contains a self attention, a classical encoder-decoder inter attention and then an other feed-forward NN.