Transformer

Attention Is All You Need

Existing models for tasks like machine translation, which process sequences of words, faced two main challenges: they were either slow because they had to process words one after another, or they struggled to understand how distant words in a long sentence related to each other.

architecturecode maptraining recipe
Section 1

What It Does

Existing models for tasks like machine translation, which process sequences of words, faced two main challenges: they were either slow because they had to process words one after another, or they struggled to understand how distant words in a long sentence related to each other. This paper introduces the Transformer, a novel model that completely abandons these traditional approaches. Instead, it relies entirely on a mechanism called 'attention,' which allows the model to simultaneously look at all other words in a sentence and decide which ones are most relevant to understanding the current word. The Transformer uses this attention mechanism in both an 'encoder' that processes the input sentence and a 'decoder' that generates the output sentence, ensuring it still understands the order of words by adding special 'positional encodings.' This design is revolutionary because it allows the model to process all words in a sentence simultaneously, dramatically speeding up training compared to older methods. Crucially, it also makes it much easier for the model to identify connections between words, no matter how far apart they are in a sentence. As a result, the Transformer achieved state-of-the-art performance on complex tasks like machine translation, setting a new standard for how sequence data is processed in AI.

Section 2

The Mechanism

Briefing Section 2: The Mechanism

The Transformer model processes sequence-to-sequence tasks using an encoder-decoder architecture, as illustrated in Figure 1. The encoder maps an input sequence into a continuous representation, which the decoder then uses to generate an output sequence one token at a time. The entire process is built upon attention mechanisms, eschewing recurrence and convolutions.

2.1. Input and Positional Encoding

The process begins by converting input tokens into high-dimensional vectors.

  1. Token Embedding: Both the source and target sequences are first passed through separate embedding layers to convert each token ID into a vector of dimension d_model = 512.

  2. Positional Encoding: Since the model contains no recurrent or convolutional layers, it has no inherent sense of token order. To provide this crucial information, positional encodings are added to the input embeddings. These encodings are fixed, non-learned vectors calculated using sine and cosine functions of different frequencies:

    PE<sub>(pos, 2i)</sub> = sin(pos / 10000<sup>2i / d_model</sup>) PE<sub>(pos, 2i+1)</sub> = cos(pos / 10000<sup>2i / d_model</sup>)

    Here, pos is the position of the token in the sequence, and i is the dimension within the embedding vector. This sinusoidal approach allows the model to learn relative positional relationships, as the encoding for any position pos + k can be represented as a linear function of the encoding for position pos (Section 3.5).

2.2. The Encoder Stack

The encoder's role is to build a rich, context-aware representation of the entire input sequence. It consists of a stack of N=6 identical layers. The output of the final encoder layer is a sequence of vectors, one for each input token, which is then passed to every layer in the decoder. Each encoder layer has two sub-layers.

2.2.1. Multi-Head Self-Attention

The first sub-layer allows each token in the input sequence to look at all other tokens in the sequence to better inform its own representation. This is the core mechanism for understanding context. It is built from a fundamental unit called Scaled Dot-Product Attention, depicted in Figure 2 (left).

The attention score is calculated as:

Attention(Q, K, V) = softmax( (QK<sup>T</sup>) / √d<sub>k</sub> ) V --- (Eq. 1)

  • Prerequisite: The input to this layer is a sequence of vectors. For each vector, three new vectors are created through linear projections: a Query (Q), a Key (K), and a Value (V). In self-attention, Q, K, and V are all derived from the same input sequence (the output of the previous layer).
  • Step 1: Score Calculation: The dot product of a query vector Q with all key vectors K is computed to produce a raw similarity score between them.
  • Step 2: Scaling: This score is scaled down by dividing by √d<sub>k</sub>, where d_k is the dimension of the key vectors (here, d_k=64). This scaling is necessary because for large values of d_k, the dot products can become very large, pushing the softmax function into regions with vanishingly small gradients, which would impede learning (Section 3.2.1).
  • Step 3: Weighting: A softmax function is applied to the scaled scores to obtain attention weights, which are positive and sum to one. These weights determine how much focus to place on each input token when encoding the current token.
  • Step 4: Output: The final output for the query is a weighted sum of all the value vectors V, using the computed attention weights.

Instead of performing a single attention function, the model employs Multi-Head Attention (Figure 2, right). This involves projecting the Q, K, and V vectors h=8 times with different, learned linear projections. Each of these projected versions of Q, K, and V is fed into a separate attention "head." The outputs of these 8 heads are then concatenated and projected again with another learned linear transformation to produce the final output. This allows the model to jointly attend to information from different representation subspaces at different positions, enriching its ability to capture complex relationships (Section 3.2.2).

2.2.2. Position-wise Feed-Forward Network

The second sub-layer is a simple, fully connected feed-forward network (FFN) applied to each position's vector independently. It consists of two linear transformations with a ReLU activation in between:

FFN(x) = max(0, xW<sub>1</sub> + b<sub>1</sub>)W<sub>2</sub> + b<sub>2</sub> --- (Eq. 2)

  • Here, x is the output from the attention sub-layer. W_1 and b_1 are the weight matrix and bias for the first linear transformation (which expands the dimension from d_model=512 to d_ff=2048), and W_2 and b_2 are for the second (which projects it back to d_model=512). This sub-layer provides non-linearity and further transforms the representations.
2.2.3. Residual Connections and Normalization

Each of the two sub-layers (Multi-Head Attention and FFN) in a layer has a residual connection around it, followed by layer normalization (Section 3.1). The output of each sub-layer is LayerNorm(x + Sublayer(x)). This is a critical component that allows for the training of a deep stack of N layers by preventing gradients from vanishing and stabilizing the learning process.

2.3. The Decoder Stack

The decoder's role is to generate the output sequence one token at a time, using the encoder's output as context. It also consists of a stack of N=6 identical layers. For each step in the output sequence, the decoder takes the previously generated tokens as input.

  • Prerequisite: The decoder input is the target sequence, "shifted right." This means that for predicting the token at position i, the decoder is only given the ground-truth tokens from positions 1 to i-1. This preserves the auto-regressive property, ensuring predictions are based only on past information.

Each decoder layer has three sub-layers.

2.3.1. Masked Multi-Head Self-Attention

This sub-layer is nearly identical to the self-attention mechanism in the encoder. However, it is modified to prevent positions from attending to subsequent positions. This is achieved by applying a "look-ahead mask" inside the Scaled Dot-Product Attention (the "Mask (opt.)" step in Figure 2). Before the softmax step, the mask sets all values corresponding to future positions to negative infinity, effectively zeroing out their attention weights. This is necessary to maintain the auto-regressive property during training.

2.3.2. Multi-Head Cross-Attention

This second sub-layer is what connects the encoder and decoder. It performs multi-head attention, but its inputs are different:

  • The Queries (Q) come from the output of the previous decoder sub-layer (the masked self-attention).
  • The Keys (K) and Values (V) come from the output of the final layer of the encoder stack.

This mechanism allows every position in the decoder to attend over all positions in the input sequence, enabling it to weigh the importance of different parts of the source sentence when generating the next target token.

2.3.3. Position-wise Feed-Forward Network

This third sub-layer is identical in structure and function to the FFN in the encoder layer.

As in the encoder, each of these three sub-layers is wrapped with a residual connection and layer normalization.

2.4. Final Output Generation

After the final decoder layer produces its output vectors, a final linear transformation projects these vectors into a much larger vector with dimensions equal to the size of the target vocabulary. A softmax function is then applied to this vector to convert the scores (logits) into a probability distribution. The token with the highest probability is chosen as the output for that time step.

Section 3

Prerequisites

Here are the foundational concepts required to understand the "Attention Is All You Need" paper, presented in a dependency-ordered list:

1. Recurrent Neural Networks (RNNs)

  1. Problem How can a neural network process sequential data (like sentences) where the order of elements is crucial and the input length can vary? Traditional feed-forward networks struggle with variable-length inputs and maintaining context over time.

  2. Solution RNNs introduce a 'memory' or 'hidden state' that is updated at each step of the sequence. The output at a given step t is a function of the input at step t and the hidden state from step t-1. This recurrent loop allows information to persist through the sequence, capturing temporal dependencies. Variants like LSTMs and GRUs address vanishing/exploding gradients in long sequences.

  3. Usage in this paper The Transformer is explicitly designed to replace RNNs. The paper argues that the inherently sequential nature of RNNs (where h_t depends on h_{t-1}) is a major bottleneck for parallelization during training and limits their efficiency for very long sequences. The Transformer's attention-only architecture solves these limitations.

2. Encoder-Decoder Architecture

  1. Problem How can a model handle sequence-to-sequence tasks (like machine translation or text summarization) where the input and output sequences can have different lengths, vocabularies, and grammatical structures?

  2. Solution This architecture is split into two main parts: an 'encoder' that reads the entire input sequence and transforms it into a rich, fixed-size context representation (or a sequence of context vectors), and a 'decoder' that takes this context and generates the output sequence one element at a time, conditioned on previously generated elements.

  3. Usage in this paper The Transformer follows this high-level encoder-decoder structure. The left side of Figure 1 in the paper illustrates the encoder stack, which processes the input sentence, and the right side shows the decoder stack, which generates the translated sentence.

3. Attention Mechanisms

  1. Problem In the basic encoder-decoder architecture (especially with RNNs), the entire meaning of a long input sequence is often compressed into a single, fixed-size context vector by the encoder. This creates an information bottleneck, making it difficult for the decoder to access specific, relevant details from the input when generating later parts of the output sequence.

  2. Solution Attention allows the decoder, at each step of generating an output, to dynamically look back at all parts of the encoder's output (or its own previous outputs). It computes a set of 'attention weights' to determine which input parts are most relevant for the current output step and creates a weighted average of these parts as a dynamic context vector. This provides a flexible way to access information without a fixed-size bottleneck.

  3. Usage in this paper Attention is the core building block of the Transformer, replacing recurrence and convolutions entirely. It is used in three distinct ways:

    • Self-attention in the encoder: Input tokens attend to other input tokens to build richer representations.
    • Masked self-attention in the decoder: Output tokens attend to previous output tokens to maintain the auto-regressive property during generation.
    • Cross-attention between encoder and decoder: The decoder attends to the encoder's output, mimicking traditional attention mechanisms to align input and output.

4. Residual Connections (Skip Connections)

  1. Problem As neural networks get deeper (i.e., have many layers), they become very difficult to train effectively. A common issue is the 'vanishing gradient' problem, where gradients shrink exponentially as they are backpropagated through many layers, preventing weights in early layers from updating and learning.

  2. Solution Residual (or 'skip') connections add the input of a layer (or a block of layers) directly to its output. Mathematically, if F(x) is the output of a layer, the residual connection makes the output x + F(x). This creates a direct path for the gradient to flow through the network, mitigating the vanishing gradient problem and allowing for the training of much deeper models without significant performance degradation.

  3. Usage in this paper Residual connections are employed extensively throughout the Transformer. They are used around each of the two sub-layers (the multi-head attention sub-layer and the position-wise feed-forward network sub-layer) in every encoder and decoder layer. The paper specifies the operation as LayerNorm(x + Sublayer(x)).

5. Layer Normalization

  1. Problem During training, the distribution of each layer's inputs changes as the parameters of the previous layers change. This phenomenon, called 'internal covariate shift', can slow down training, make it unstable, and require careful initialization and lower learning rates.

  2. Solution Layer Normalization stabilizes the distributions by normalizing the inputs to a layer. For each training example and for each layer, it computes the mean and variance across all features (or hidden units) for that single example and uses them to rescale the inputs. This helps to speed up and stabilize training, making deep networks less sensitive to initialization and more robust.

  3. Usage in this paper Layer Normalization is applied after each residual connection in both the encoder and decoder layers. It is a critical component for stabilizing the training of the deep Transformer architecture, especially given its post-normalization placement (LayerNorm(x + Sublayer(x))).

6. Label Smoothing

  1. Problem When training a classification model with a softmax output and one-hot labels (e.g., [0, 1, 0]), the model is encouraged to make its predictions extremely confident (pushing one logit to a very high positive value and others to very low negative values). This can lead to over-fitting, poor calibration (overestimating probabilities), and reduced generalization ability, especially if the training data contains noise or mislabeled examples.

  2. Solution Label smoothing replaces the hard 0 and 1 targets with soft targets. For example, a target of [0, 1, 0] might be changed to [0.05, 0.9, 0.05] (where 0.05 is epsilon_ls / (num_classes - 1) and 0.9 is 1 - epsilon_ls). This discourages the model from becoming overconfident, forces it to learn a more robust internal representation, and improves generalization.

  3. Usage in this paper Label smoothing with a value of epsilon_ls = 0.1 is used as a regularization technique during training of the Transformer. The paper notes that while this technique might slightly increase the perplexity (a measure of how well the model predicts a sample), it consistently improves accuracy and the BLEU score (a common metric for machine translation quality).

Section 4

Implementation Map

Implementation-Oriented Walkthrough

The generated implementation map below is rendered exactly as code, preserving assumptions and provenance markers.

# Component: Add & Norm
# Provenance: paper-stated
# Assumption: The `eps` parameter for `nn.LayerNorm` is assumed to be `1e-6`, which is a common default for numerical stability in PyTorch's `LayerNorm` implementation.
# Assumption: The `d_model` parameter, representing the dimensionality of the model (and thus the feature dimension for LayerNorm), is inferred as a necessary input for `nn.LayerNorm` to specify the `normalized_shape` in the Transformer architecture.
import torch
import torch.nn as nn

class AddNorm(nn.Module):
    """
    Implements the Add & Norm component as described in the paper.
    This corresponds to the post-norm architecture: LayerNorm(x + Sublayer(x)).
    """
    def __init__(self, d_model: int, eps: float = 1e-6):
        super().__init__()
        # INFERRED: d_model is the dimensionality of the model, required for LayerNorm.
        #           It represents the feature dimension of the input to LayerNorm.
        # ASSUMED: eps is a small value added to the variance for numerical stability in LayerNorm.
        #          A common default is 1e-6 or 1e-5 in PyTorch's LayerNorm.
        self.norm = nn.LayerNorm(d_model, eps=eps) # Eq. (N) - Layer Normalization

    def forward(self, x: torch.Tensor, sublayer_output: torch.Tensor) -> torch.Tensor:
        # Eq. (N) - Residual connection: x + Sublayer(x)
        # The paper describes "Add & Norm" as applying LayerNorm after the residual connection.
        # The ambiguity resolution 'layernorm_placement' confirms this exact order: `LayerNorm(x + Sublayer(x))`
        return self.norm(x + sublayer_output)

# Component: Decoder Layer
# Provenance: paper-stated
# Assumption: d_k = d_v = d_model / num_heads, as is standard for Transformer attention.
# Assumption: Bias terms are initialized to zero, which is a common practice.
# Assumption: Dropout is applied to the attention weights before multiplying with V inside MultiHeadAttention.
# Assumption: Mask values are 0 for masked positions, 1 for unmasked positions.
# Assumption: d_model must be divisible by num_heads for d_k to be an integer.
# Assumption: bias=True for all linear layers based on ambiguity resolution 'bias_in_linear_layers'.
# Assumption: Using nn.init.xavier_uniform_ as it's a standard choice for Xavier initialization. The ambiguity resolution 'weight_initialization' mentions "variance-scaling initializer similar to Xavier uniform, scaling by (d_in + d_out) / 2". nn.init.xavier_uniform_ scales by sqrt(6 / (fan_in + fan_out)), which is a common form of Xavier.
# Assumption: Using a large negative number (-1e9) to effectively zero out masked attention scores after softmax for numerical stability.
# Assumption: Using nn.init.kaiming_uniform_ as it's a standard choice for Kaiming initialization. The ambiguity resolution 'weight_initialization' states "Kaiming (He) initialization" for FFN. For the second layer of FFN, while it doesn't have a ReLU *after* it, it's common practice to use Kaiming for both layers in the FFN block if the first layer uses ReLU.
# Assumption: Dropout is applied to the final output of the FFN sub-layer, as per ambiguity resolution 'dropout_placement_ffn'.
# Assumption: LayerNorm placement is 'post-norm' as per ambiguity resolution 'layernorm_placement': LayerNorm(x + Sublayer(x)).
# Assumption: Dropout for the residual connection is applied after the sub-layer output, before adding to the input. This is distinct from internal dropouts within MultiHeadAttention or FFN.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Helper module: MultiHeadAttention
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int, dropout_rate: float):
        super().__init__()
        # ASSUMED: d_k = d_v = d_model / num_heads, as is standard for Transformer attention.
        # INFERRED: d_model must be divisible by num_heads for d_k to be an integer.
        if d_model % num_heads != 0:
            raise ValueError(f"d_model ({d_model}) must be divisible by num_heads ({num_heads})")
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.d_model = d_model

        # Linear projections for Q, K, V
        # INFERRED: bias=True for all linear layers based on ambiguity resolution 'bias_in_linear_layers'.
        self.q_linear = nn.Linear(d_model, d_model, bias=True)
        self.k_linear = nn.Linear(d_model, d_model, bias=True)
        self.v_linear = nn.Linear(d_model, d_model, bias=True)

        # Output linear projection
        # INFERRED: bias=True for all linear layers based on ambiguity resolution 'bias_in_linear_layers'.
        self.out_linear = nn.Linear(d_model, d_model, bias=True)

        # ASSUMED: Dropout is applied to the attention weights before multiplying with V.
        self.dropout = nn.Dropout(dropout_rate)

        self._reset_parameters() # For weight initialization

    def _reset_parameters(self):
        # Weight initialization: Xavier (Glorot) for attention projections.
        # INFERRED: Using nn.init.xavier_uniform_ as it's a standard choice for Xavier initialization.
        # The ambiguity resolution 'weight_initialization' mentions "variance-scaling initializer similar to Xavier uniform, scaling by (d_in + d_out) / 2".
        # nn.init.xavier_uniform_ scales by sqrt(6 / (fan_in + fan_out)), which is a common form of Xavier.
        nn.init.xavier_uniform_(self.q_linear.weight)
        nn.init.xavier_uniform_(self.k_linear.weight)
        nn.init.xavier_uniform_(self.v_linear.weight)
        nn.init.xavier_uniform_(self.out_linear.weight)

        # ASSUMED: Bias terms are initialized to zero, which is a common practice.
        if self.q_linear.bias is not None:
            nn.init.constant_(self.q_linear.bias, 0.)
        if self.k_linear.bias is not None:
            nn.init.constant_(self.k_linear.bias, 0.)
        if self.v_linear.bias is not None:
            nn.init.constant_(self.v_linear.bias, 0.)
        if self.out_linear.bias is not None:
            nn.init.constant_(self.out_linear.bias, 0.)

    def forward(self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, mask: torch.Tensor = None):
        batch_size = query.size(0)

        # 1) Linear projections and split into heads
        # Eq. (1) (implicitly, as part of h_i = Attention(QW_Q, KW_K, VW_V))
        q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.k_linear(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.v_linear(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # 2) Scaled Dot-Product Attention
        # Eq. (1)
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            # Apply mask (for self-attention, it's look-ahead mask; for cross-attention, it's padding mask)
            # ASSUMED: Mask values are 0 for masked positions, 1 for unmasked positions.
            # INFERRED: Using a large negative number (-1e9) to effectively zero out masked attention scores after softmax.
            scores = scores.masked_fill(mask == 0, -1e9)

        attn_weights = F.softmax(scores, dim=-1) # Eq. (1)
        attn_weights = self.dropout(attn_weights) # ASSUMED: Dropout applied to attention weights before multiplying with V

        context = torch.matmul(attn_weights, v) # Eq. (1)

        # 3) Concatenate heads and apply final linear layer
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        output = self.out_linear(context) # Eq. (1)

        return output, attn_weights

# Helper module: PositionwiseFeedForward
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout_rate: float):
        super().__init__()
        # INFERRED: bias=True for all linear layers based on ambiguity resolution 'bias_in_linear_layers'.
        self.w_1 = nn.Linear(d_model, d_ff, bias=True)
        self.w_2 = nn.Linear(d_ff, d_model, bias=True)
        # INFERRED: Dropout is applied to the final output of the FFN sub-layer, as per ambiguity resolution 'dropout_placement_ffn'.
        self.dropout = nn.Dropout(dropout_rate)

        self._reset_parameters() # For weight initialization

    def _reset_parameters(self):
        # Weight initialization: Kaiming (He) for FFN with ReLU activations.
        # INFERRED: Using nn.init.kaiming_uniform_ as it's a standard choice for Kaiming initialization.
        # The ambiguity resolution 'weight_initialization' states "Kaiming (He) initialization" for FFN.
        nn.init.kaiming_uniform_(self.w_1.weight, nonlinearity='relu')
        # For the second layer, while it doesn't have a ReLU *after* it, it's common practice
        # to use Kaiming for both layers in the FFN block if the first layer uses ReLU.
        nn.init.kaiming_uniform_(self.w_2.weight, nonlinearity='relu')

        # ASSUMED: Bias terms are initialized to zero, which is a common practice.
        if self.w_1.bias is not None:
            nn.init.constant_(self.w_1.bias, 0.)
        if self.w_2.bias is not None:
            nn.init.constant_(self.w_2.bias, 0.)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Eq. (3)
        # FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
        # Dropout applied to the output of FFN before residual connection, as per ambiguity resolution 'dropout_placement_ffn'.
        return self.dropout(self.w_2(F.relu(self.w_1(x))))


class DecoderLayer(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout_rate: float, eps: float):
        super().__init__()

        # Masked Multi-Head Self-Attention sub-layer
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout_rate)
        # INFERRED: LayerNorm placement is 'post-norm' as per ambiguity resolution 'layernorm_placement': LayerNorm(x + Sublayer(x)).
        self.self_attn_norm = nn.LayerNorm(d_model, eps=eps)
        # INFERRED: Dropout for the residual connection is applied after the sub-layer output, before adding to the input.
        self.self_attn_residual_dropout = nn.Dropout(dropout_rate)

        # Multi-Head Encoder-Decoder Attention sub-layer
        self.cross_attn = MultiHeadAttention(d_model, num_heads, dropout_rate)
        # INFERRED: LayerNorm placement is 'post-norm' as per ambiguity resolution 'layernorm_placement'.
        self.cross_attn_norm = nn.LayerNorm(d_model, eps=eps)
        # INFERRED: Dropout for the residual connection.
        self.cross_attn_residual_dropout = nn.Dropout(dropout_rate)

        # Feed-Forward Network sub-layer
        # The FFN module itself includes the dropout as per ambiguity resolution 'dropout_placement_ffn'.
        self.ffn = PositionwiseFeedForward(d_model, d_ff, dropout_rate)
        # INFERRED: LayerNorm placement is 'post-norm' as per ambiguity resolution 'layernorm_placement'.
        self.ffn_norm = nn.LayerNorm(d_model, eps=eps)
        # No separate residual dropout here, as the FFN module already applies it internally as per 'dropout_placement_ffn'.

    def forward(self,
                x: torch.Tensor,
                encoder_output: torch.Tensor,
                src_mask: torch.Tensor,
                tgt_mask: torch.Tensor) -> torch.Tensor:

        # Masked Multi-Head Self-Attention sub-layer
        # Post-norm architecture: LayerNorm(x + Sublayer(x))
        # First, apply LayerNorm to the input of the sub-layer.
        _x = self.self_attn_norm(x)
        # Then, compute the sub-layer output.
        self_attn_output, _ = self.self_attn(_x, _x, _x, tgt_mask) # Q, K, V, mask
        # Apply dropout to the sub-layer output, then add to the original input (residual connection).
        x = x + self.self_attn_residual_dropout(self_attn_output) # Eq. (2)

        # Multi-Head Encoder-Decoder Attention sub-layer
        # Post-norm architecture: LayerNorm(x + Sublayer(x))
        _x = self.cross_attn_norm(x)
        cross_attn_output, _ = self.cross_attn(_x, encoder_output, encoder_output, src_mask) # Q, K, V, mask
        x = x + self.cross_attn_residual_dropout(cross_attn_output) # Eq. (2)

        # Feed-Forward Network sub-layer
        # Post-norm architecture: LayerNorm(x + Sublayer(x))
        _x = self.ffn_norm(x)
        # The FFN module already applies dropout to its output as per 'dropout_placement_ffn'.
        ffn_output = self.ffn(_x)
        x = x + ffn_output # Eq. (2)

        return x

# Component: Decoder Stack
# Provenance: paper-stated
# Assumption: d_model: TODO: Specify model dimension (e.g., 512).
# Assumption: num_heads: TODO: Specify number of attention heads (e.g., 8).
# Assumption: d_ff: TODO: Specify feed-forward inner dimension (e.g., 2048).
# Assumption: dropout_rate: TODO: Specify dropout rate (e.g., 0.1).
# Assumption: N: TODO: Specify number of decoder layers (e.g., 6).
# Assumption: eps: TODO: Specify epsilon for LayerNorm (e.g., 1e-6).
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Helper: Scaled Dot-Product Attention
class ScaledDotProductAttention(nn.Module):
    """
    Computes scaled dot-product attention.
    """
    def __init__(self, dropout_rate):
        super().__init__()
        self.dropout = nn.Dropout(dropout_rate)
        # INFERRED: Dropout is applied to the attention weights before multiplying with V.
        # This is standard practice in the original Transformer paper.

    def forward(self, query, key, value, mask=None):
        # Eq. (1)
        d_k = query.size(-1)
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k) # Eq. (1)

        if mask is not None:
            # INFERRED: Use a large negative number for masked positions to ensure softmax outputs zero.
            scores = scores.masked_fill(mask == 0, -1e9)

        p_attn = F.softmax(scores, dim=-1) # Eq. (1)
        p_attn = self.dropout(p_attn) # INFERRED: Dropout applied to attention probabilities.

        return torch.matmul(p_attn, value), p_attn

# Helper: Multi-Head Attention
class MultiHeadAttention(nn.Module):
    """
    Implements Multi-Head Attention as described in the paper.
    """
    def __init__(self, d_model, num_heads, dropout_rate):
        super().__init__()
        # ASSUMED: d_model, num_heads, dropout_rate are provided as hyperparameters.
        self.d_model = d_model
        self.num_heads = num_heads
        # INFERRED: d_k (dimension per head) is d_model // num_heads.
        # This must be an integer.
        # Eq. (2)
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.d_k = d_model // num_heads

        # Linear projections for Q, K, V, and output.
        # Eq. (3)
        # bias_in_linear_layers: Include bias terms for all linear transformations.
        self.linears = nn.ModuleList([nn.Linear(d_model, d_model, bias=True) for _ in range(4)])
        # INFERRED: The first three linears are for W_Q, W_K, W_V, and the last one is for W_O.

        self.attention = ScaledDotProductAttention(dropout_rate)
        # INFERRED: Output dropout for the MultiHeadAttention sub-layer is handled by SublayerConnection.

        self._reset_parameters()

    def _reset_parameters(self):
        # weight_initialization: Xavier (Glorot) initialization for attention projections.
        # The tensor2tensor library used a variance-scaling initializer similar to Xavier uniform.
        for i in range(4):
            nn.init.xavier_uniform_(self.linears[i].weight)
            if self.linears[i].bias is not None:
                nn.init.constant_(self.linears[i].bias, 0.) # INFERRED: Bias initialized to zero.

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # 1) Do all the linear projections in batch from d_model => num_heads x d_k
        # Eq. (3)
        query, key, value = [
            l(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
            for l, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = self.attention(query, key, value, mask=mask) # Eq. (1)

        # 3) "Concat" using a view and apply a final linear.
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model) # Eq. (4)
        return self.linears[-1](x) # Eq. (4)

# Helper: Position-wise Feed-Forward Networks
class PositionwiseFeedForward(nn.Module):
    """
    Implements the position-wise feed-forward network.
    """
    def __init__(self, d_model, d_ff, dropout_rate):
        super().__init__()
        # ASSUMED: d_model, d_ff, dropout_rate are provided as hyperparameters.
        # Eq. (5)
        # bias_in_linear_layers: Include bias terms for all linear transformations.
        self.w_1 = nn.Linear(d_model, d_ff, bias=True)
        self.w_2 = nn.Linear(d_ff, d_model, bias=True)
        # dropout_placement_ffn: Apply dropout only to the final output of the FFN sub-layer,
        # i.e., `Dropout(FFN(x))`, before the residual connection. This means the dropout
        # is handled by the SublayerConnection, not internally within this module.

        self._reset_parameters()

    def _reset_parameters(self):
        # weight_initialization: Kaiming (He) initialization for models with ReLU activations (like the FFN).
        nn.init.kaiming_uniform_(self.w_1.weight, nonlinearity='relu')
        if self.w_1.bias is not None:
            nn.init.constant_(self.w_1.bias, 0.) # INFERRED: Bias initialized to zero.

        # weight_initialization: For the second linear layer, Xavier (Glorot) is a standard choice.
        nn.init.xavier_uniform_(self.w_2.weight)
        if self.w_2.bias is not None:
            nn.init.constant_(self.w_2.bias, 0.) # INFERRED: Bias initialized to zero.

    def forward(self, x):
        # Eq. (5)
        # INFERRED: ReLU activation is used as per the paper.
        return self.w_2(F.relu(self.w_1(x)))

# Helper: SublayerConnection (Add & Norm)
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer normalization.
    Implements the post-norm architecture: LayerNorm(x + Dropout(Sublayer(x))).
    """
    def __init__(self, d_model, dropout_rate, eps=1e-6):
        super().__init__()
        # ASSUMED: d_model, dropout_rate, eps are provided as hyperparameters.
        self.norm = nn.LayerNorm(d_model, eps=eps)
        self.dropout = nn.Dropout(dropout_rate)
        # layernorm_placement: Implement the post-norm architecture: LayerNorm(x + Sublayer(x)).
        # This means LayerNorm is applied *after* the residual connection and dropout.

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        # layernorm_placement: Post-norm architecture: LayerNorm(x + Sublayer(x))
        # The dropout is applied to the sublayer output before adding to the residual.
        return self.norm(x + self.dropout(sublayer(x)))

# Decoder Layer
class DecoderLayer(nn.Module):
    """
    One layer of the decoder.
    Consists of masked multi-head self-attention, encoder-decoder attention, and a feed-forward network.
    Each sub-layer is followed by a residual connection and layer normalization.
    """
    def __init__(self, d_model, num_heads, d_ff, dropout_rate, eps=1e-6):
        super().__init__()
        # ASSUMED: d_model, num_heads, d_ff, dropout_rate, eps are provided as hyperparameters.
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout_rate)
        self.src_attn = MultiHeadAttention(d_model, num_heads, dropout_rate) # Encoder-Decoder Attention
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout_rate)
        self.sublayer = nn.ModuleList([SublayerConnection(d_model, dropout_rate, eps) for _ in range(3)])
        # INFERRED: Three sublayers in a decoder layer as per the paper's architecture.

    def forward(self, x, memory, src_mask, tgt_mask):
        "Follows Figure 1 (right) for connections."
        # Masked Multi-Head Self-Attention
        # Eq. (1)
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))

        # Multi-Head Encoder-Decoder Attention
        # Eq. (1)
        # INFERRED: Query comes from the decoder, Key/Value come from the encoder output (memory).
        x = self.sublayer[1](x, lambda x: self.src_attn(x, memory, memory, src_mask))

        # Position-wise Feed-Forward Network
        # Eq. (5)
        x = self.sublayer[2](x, self.feed_forward)
        return x

# Decoder Stack
class DecoderStack(nn.Module):
    """
    A stack of N identical decoder layers.
    """
    def __init__(self, N, d_model, num_heads, d_ff, dropout_rate, eps=1e-6):
        super().__init__()
        # ASSUMED: N (number of layers), d_model, num_heads, d_ff, dropout_rate, eps are provided as hyperparameters.
        self.layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, d_ff, dropout_rate, eps)
            for _ in range(N)
        ])
        # INFERRED: A final LayerNorm is typically applied after the last decoder layer
        # in the overall Transformer architecture before the final linear projection.
        self.norm = nn.LayerNorm(d_model, eps=eps)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)
Section 5

Missing Details

weight_initialization: Weight Initialization Strategy

  • Type: missing_training_detail
  • Section: 3. Model Architecture
  • Ambiguous point: The paper does not specify how the weights of the various linear layers (in multi-head attention and feed-forward networks) and embedding layers are initialized.
  • Implementation consequence: Improper weight initialization can lead to training instability, such as exploding or vanishing gradients, or slow convergence. Different initialization schemes (e.g., Xavier/Glorot, Kaiming/He) can significantly impact the final model performance. Without this detail, reproducing the results is difficult.
  • Agent resolution: A common and effective strategy for models with ReLU activations (like the FFN) is Kaiming (He) initialization. For other layers, Xavier (Glorot) initialization is a standard choice. The tensor2tensor library, mentioned in the paper, used a variance-scaling initializer similar to Xavier uniform, scaling by (d_in + d_out) / 2.
  • Confidence: 0.5

layernorm_placement: Layer Normalization Placement (Pre-Norm vs. Post-Norm)

  • Type: underspecified_architecture
  • Section: 3.1 Encoder and Decoder Stacks
  • Ambiguous point: The paper states the output of each sub-layer is LayerNorm(x + Sublayer(x)). This is known as 'post-norm'.
  • Implementation consequence: Post-norm architectures, as described, can be difficult to train without a careful learning rate warmup, as the gradients can vanish or explode at the beginning of training for deep stacks. Later research has shown that 'pre-norm' (x + Sublayer(LayerNorm(x))) leads to more stable training and often removes the need for a slow learning rate warmup. Implementing post-norm exactly as described might make it harder to train the model from scratch.
  • Agent resolution: Implement the post-norm architecture as described, LayerNorm(x + Sublayer(x)), and ensure the learning rate schedule with warmup (warmup_steps = 4000) is also implemented exactly, as it is critical for stabilizing the training of this architecture.
  • Confidence: 0.5

bias_in_linear_layers: Use of Bias in Linear Layers

  • Type: underspecified_architecture
  • Section: 3.2.2 Multi-Head Attention & 3.3 Position-wise Feed-Forward Networks
  • Ambiguous point: The paper's formula for the Position-wise Feed-Forward Network, max(0, xW1 + b1)W2 + b2, explicitly includes bias terms (b1, b2). However, the description of Multi-Head Attention does not mention if the projection matrices W_Q, W_K, W_V, and W_O have corresponding bias terms.
  • Implementation consequence: If biases are incorrectly added or omitted from the attention projection layers, the parameter count of the model will be different, and the representational capacity of the attention heads could be affected. This could lead to a failure to replicate the reported performance.
  • Agent resolution: The standard implementation, and the one used in the official tensor2tensor library, is to include bias terms for all linear transformations, including the attention projections (W_Q, W_K, W_V, W_O) and the feed-forward layers.
  • Confidence: 0.5

positional_encoding_max_length: Maximum Sequence Length for Positional Encoding

  • Type: missing_hyperparameter
  • Section: 3.5 Positional Encoding
  • Ambiguous point: The paper describes a sinusoidal formula for positional encodings which can theoretically handle any sequence length. However, in practice, these are typically pre-computed into a fixed-size matrix for efficiency. The maximum length of this matrix is not specified.
  • Implementation consequence: If the pre-computed matrix is too small, the model will fail at inference time if given a sequence longer than the maximum length it was trained on. If it's unnecessarily large, it will consume excess memory. The choice of max length affects the model's ability to generalize to longer sequences, which the paper claims is a benefit of the sinusoidal method.
  • Agent resolution: Choose a maximum sequence length that is larger than any sequence encountered during training and typical for the task. A common choice is 512 or 1024 for machine translation, or 2048 for longer-form text tasks. The original implementation often used a default of 512 or set it based on the longest sequence in the training data.
  • Confidence: 0.5

dropout_placement_ffn: Dropout Placement within FFN

  • Type: underspecified_architecture
  • Section: 5.4 Regularization
  • Ambiguous point: The paper states dropout is applied 'to the output of each sub-layer'. For the FFN sub-layer, this means after the second linear transformation. It is not specified if dropout is also applied inside the FFN, for example, after the ReLU activation.
  • Implementation consequence: Adding an extra dropout layer inside the FFN would change the regularization scheme and could affect model performance and convergence. Many popular implementations (e.g., in PyTorch's nn.Transformer) do add dropout after the activation within the FFN, which would be a deviation from the paper's description.
  • Agent resolution: Follow the paper's description strictly: apply dropout only to the final output of the FFN sub-layer, i.e., Dropout(FFN(x)), before the residual connection. Do not add a separate dropout layer inside the FFN.
  • Confidence: 0.5

embedding_weight_sharing: Scope of Embedding Weight Sharing

  • Type: underspecified_architecture
  • Section: 3.4 Embeddings and Softmax
  • Ambiguous point: The paper states: 'we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation'. This could mean all three (input embedding, output embedding, pre-softmax linear) share one matrix, or that the output embedding and pre-softmax linear share one, and the input embedding is separate.
  • Implementation consequence: If all three matrices are shared, the model's parameter count is reduced, but it forces the input and output token representations into the same space, which might not be optimal. The more common practice is to only tie the weights of the output embedding and the pre-softmax linear layer, as they both map from the model's hidden dimension to vocabulary logits/embeddings.
  • Agent resolution: Implement weight sharing between the output embedding layer and the pre-softmax linear transformation. The input embedding layer should have its own separate weight matrix. This is the most common interpretation and implementation of this technique.
  • Confidence: 0.5
Section 6

Training Recipe

Hyperparameter Registry

namevaluesourcestatussuggested_default
N (Encoder/Decoder Layers)63.1 Encoder and Decoder Stackspaper-stated
d_model (Model Dimension)5123.1 Encoder and Decoder Stackspaper-stated
h (Number of Attention Heads)83.2.2 Multi-Head Attentionpaper-stated
d_k (Key Dimension)643.2.2 Multi-Head Attentionpaper-stated
d_v (Value Dimension)643.2.2 Multi-Head Attentionpaper-stated
d_ff (Feed-Forward Inner Dimension)20483.3 Position-wise Feed-Forward Networkspaper-stated
P_drop (Dropout Rate, base model)0.15.4 Regularizationpaper-stated
epsilon_ls (Label Smoothing)0.15.4 Regularizationpaper-stated
OptimizerAdam5.3 Optimizerpaper-stated
Adam beta10.95.3 Optimizerpaper-stated
Adam beta20.985.3 Optimizerpaper-stated
Adam epsilon1e-95.3 Optimizerpaper-stated
warmup_steps (Learning Rate Schedule)40005.3 Optimizerpaper-stated
EN-DE Vocabulary Size370005.1 Training Data and Batchingpaper-stated
EN-FR Vocabulary Size320005.1 Training Data and Batchingpaper-stated
Batch Size (Tokens per batch)25000 source tokens and 25000 target tokens5.1 Training Data and Batchingpaper-stated
Training Steps (base model)1000005.2 Hardware and Schedulepaper-stated
Training Steps (big model)3000005.2 Hardware and Schedulepaper-stated
Hardware8 NVIDIA P100 GPUs5.2 Hardware and Schedulepaper-stated
Beam Size (Inference)46.1 Machine Translationpaper-stated
Length Penalty alpha (Inference)0.66.1 Machine Translationpaper-stated
Positional Encoding Wavelength Max100003.5 Positional Encodingpaper-stated
N (big model)6Table 3: Variations on the Transformer architecturepaper-stated
d_model (big model)1024Table 3: Variations on the Transformer architecturepaper-stated
d_ff (big model)4096Table 3: Variations on the Transformer architecturepaper-stated
h (big model)16Table 3: Variations on the Transformer architecturepaper-stated
d_k (big model)inferredTable 3: Variations on the Transformer architectureinferred64
d_v (big model)inferredTable 3: Variations on the Transformer architectureinferred64
P_drop (big model, EN-DE)0.3Table 3: Variations on the Transformer architecturepaper-stated
P_drop (big model, EN-FR)0.16.1 Machine Translationpaper-stated
N (Constituency Parsing)46.3 English Constituency Parsingpaper-stated
d_model (Constituency Parsing)10246.3 English Constituency Parsingpaper-stated
WSJ-only Vocabulary Size (Parsing)160006.3 English Constituency Parsingpaper-stated
Semi-supervised Vocabulary Size (Parsing)320006.3 English Constituency Parsingpaper-stated
Beam Size (Parsing)216.3 English Constituency Parsingpaper-stated
Length Penalty alpha (Parsing)0.36.3 English Constituency Parsingpaper-stated
Max Output Length (Inference)input length + 506.1 Machine Translationpaper-stated
Max Output Length (Parsing)input length + 3006.3 English Constituency Parsingpaper-stated

Sample Artifacts