Notes: Attention! Transformers!

Sep. 7, 2020

Attention mechanisms, used in encoder-decoder reccurrent neural networks are able to align an input sequence to an output sequence. The transformer, described in “Attention Is All You Need” is a network architecture capable of sequence-to-sequence translations, without using recurrence, and thus, allowing parallelization.

Encoder-Decoder with attention mechanisms

In order to solve sequence transduction or sequence-to-sequence problems (i.e. from $(x_1, x_2, …, x_n)$ to $(y_1, y_n, …, y_m)$, we can use an encoder followed by a decoder. In neural machine translation, standard encoder-decoder networks aims at “encoding a source sentence into a fixed-length vector from which a decoder generates a translation” [1]. According to [1], “the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture”. Compressing all the source sentence information into a single vector that is fixed in length is tough.

Attention mechanisms, first introduced by [1], doesn’t want to reduce all the information in a single vector, but instead, it “encodes the input sequence into a sequence of vectors, and chooses a subset of these vectors adaptively while decoding the translation”.

During each encoding step (from $1$ to $n$), an input hidden state $h_j$ is generated by the Reccurrent Neural Network. A context vector $c_i$ is built using all those $j$ hidden states. In Bahdanau attention [1], the context vector is computed as a the weighted sum of the input hidden states $h_j$. Weight is computed using the softmax function, on the alignment model. The alignment model is given using a feedforward neural network that takes the input hidden state $h_j$ and the previous output hidden state $s_{i-1}$.

Some other similar attention mechanisms uses different alingment models.

In [4] was generalised the attention functions using a set of query, key and value that gives an output:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values […]. [5]

$$ \text{attention function} = \text{Weighted sum of the values, weight computed with a compatibily function between query and key} \\ \Rightarrow \text{output} = \sum \text{values}\times\text{compatibily-function(query, key)} $$

With this, we can write Bahdanau’s attention as:

$$ \begin{align} \text{context} & = \sum h_j \times \text{NN-softmax}(s_{i-1}, h_j) \\ &\text{where:} \\ \text{keys} & = h_j \\ \text{values} & = h_j \\ \text{query} &= s_{i-1} \\ \text{output} &= \text{context} \end{align} $$

Global vs Local Attention

Global attention allow the attention to attend on all the source positions, whereas local attention allows attention on only a few source positions [2]. Attending to all source words is computationally expensive, hence limiting the alignment to a small context window.

In [3] where presented hard and soft attentions, which are similiar to local and global attention :


Self-attention (or intra-attention), introduced in [4] also creates three vectors from the encoder input: the query, the key and the value.

$$ \text{context}_t = \sum \text{value}_t \times \text{softmax}\left(\frac{\text{query}_t \cdot \text{key}_t^T}{\sqrt{d_{\text{key}_t}}}\right) $$

Query, key and value (at time step $t$ are vectors extracted from the input embedding $t$: we compute them by multiplying the input $t$ with trained weights matrices $W^Q$, $W^K$ and $ W^V$. Self attention is attention between words within the same sentence, whereas Bahdanau’s attention for exemple attend words between the source sequence and the output sequence (between the encoder and the decoder).

What’s the difference between this self-attention mechanism and the standard attention?

The difference between intra and inter attention is described in [6]:

How are the weights matrices $W^Q$, $W^K$ and $W^V$ learned?

Learned during training.

Mutli-head attention

Usign multi-head attention, as described in [5], not a signle self-attention function is computed, but instead $h$ different self-attention “heads” are computed, and then concatenated together, and then projected again using an other weight matrice $W^O$. Each attention layer is computed in parallel.

The Transformer

As classic recurrent (RNN) encoder-decoder networks, the Transformer is made of an encoder, connected to the decoder. In “Attention Is All You Need”, 6 encoder layers are stacked together, followed by 6 other decoder layer.

Three types of muli-head attentions are used in the transformer :

With this layout, an encoder layer is made of a self-attention component followed by a feed-forward NN. An decoder layer contains a self attention, a classical encoder-decoder inter attention and then an other feed-forward NN.


  1. Neural Machine Translation by Jointly Learning to Align and Translate
  2. Effective Approaches to Attention-based Neural Machine Translation
  3. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
  4. Hierarchical Attention Networks for Document Classification
  5. Attention Is All You Need
  6. Long Short-Term Memory-Networks for Machine Reading
  7. The Illustrated Transformer
  8. Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
  9. Attention? Attention!
  10. Attention and its Different Forms
  11. [Reddit] What is the difference between self attention and attention