What It Does
Previously, computer models designed to understand language could only process text in one direction, like reading a sentence from left to right. This limitation meant they often struggled to fully grasp context, especially for tasks like question answering where understanding the entire sentence and its relationship to others is key. This paper introduces BERT, a novel method for training language models that can understand text by looking at words from both their left and right sides simultaneously. BERT learns this deep understanding through two clever training exercises: first, it predicts randomly hidden words within a sentence by considering all the words around them, and second, it determines if two sentences logically follow each other. This innovative bidirectional training allows BERT to develop a much richer and more complete understanding of language, enabling it to be easily adapted to excel at a wide variety of language tasks, such as answering questions or classifying text, and achieving significantly better performance than previous methods.
The Mechanism
Briefing Section 2: The Mechanism
The BERT framework operates in two distinct phases: pre-training and fine-tuning. The core mechanism involves preparing a specialized input representation, processing it through a deep bidirectional Transformer encoder, and then using the output for either general language understanding tasks (pre-training) or specific downstream tasks (fine-tuning).
Step 1: Input Representation
To handle a variety of downstream tasks, BERT requires a unified and rich input format that can represent either a single sentence or a pair of sentences (e.g., Question-Paragraph pairs). This is achieved by constructing an input embedding for each token from the sum of three distinct embeddings, a process visualized for fine-tuning tasks in Figure 4 (el-364).
-
Tokenization: The input text is first tokenized using a WordPiece tokenizer
(Section 3), which breaks words into common sub-word units. This helps manage vocabulary size and handle out-of-vocabulary words. Two special tokens are added:[CLS]: A special classification token inserted at the beginning of every sequence. Its final hidden state is used as the aggregate sequence representation for classification tasks.[SEP]: A separator token used to distinguish between different sentences, such as separating a question from a paragraph in SQuAD.
-
Embedding Summation: Each token in the input sequence is converted into a vector by summing three learned embeddings:
- Token Embeddings: These represent the meaning of the token itself, mapping each token in the 30,000-word vocabulary to a vector of hidden size
H(e.g., 768 for BERT<sub>BASE</sub>). - Segment Embeddings: These are necessary to distinguish between sentences in a pair. A learned embedding for "Sentence A" is added to every token in the first sentence, and a learned embedding for "Sentence B" is added to every token in the second sentence. This is crucial for the Next Sentence Prediction pre-training task and for sentence-pair fine-tuning tasks like Question Answering, as shown in
Figure 4 (a, c). - Position Embeddings: The core self-attention mechanism of the Transformer is permutation-invariant, meaning it has no inherent sense of token order. To counteract this, learned position embeddings are added to each token to encode its position in the sequence. This is a key prerequisite for the model to understand sentence structure
(A01).
- Token Embeddings: These represent the meaning of the token itself, mapping each token in the 30,000-word vocabulary to a vector of hidden size
The resulting summed vector for each token serves as the input to the main model.
Step 2: Bidirectional Processing via Transformer Encoder
The sequence of input embeddings is fed into the core of the BERT model: a multi-layer bidirectional Transformer encoder. The architecture consists of a stack of identical layers (L=12 for BERT<sub>BASE</sub>, L=24 for BERT<sub>LARGE</sub>) (Section 3).
Each layer transforms the sequence of vectors using two main sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. The key innovation of BERT lies in its application of the self-attention mechanism.
- Prerequisite: Self-Attention: Self-attention allows the model to compute a token's representation by weighing the influence of all other tokens in the sequence.
- BERT's Bidirectionality: Unlike unidirectional models like GPT which mask future tokens (a left-to-right attention), BERT's self-attention mechanism allows every token to attend to every other token in the sequence, both to its left and right, in every layer. This is why BERT is described as "deeply bidirectional." This architectural choice is necessary to build a comprehensive, context-aware representation of each token, which is critical for tasks that require a holistic understanding of the entire input.
After passing through all L layers, the encoder outputs a sequence of final hidden states, T_i ∈ R^H, for each input token i. The final hidden state corresponding to the [CLS] token is denoted as C ∈ R^H. These output vectors are then used for the pre-training tasks.
Step 3: Unsupervised Pre-training
To make the bidirectional encoder learn meaningful language representations, it is pre-trained on a large unlabeled corpus using two novel, simultaneous unsupervised tasks (Section 3.1). The total training loss is the unweighted sum of the losses from these two tasks (A07).
-
Task #1: Masked Language Model (MLM)
- Motivation: A standard language model objective (predicting the next word) is inherently unidirectional. To train a bidirectional model, a different objective is needed.
- Mechanism: 15% of the input tokens are randomly selected for prediction. Of these selected tokens:
- 80% are replaced with a special
[MASK]token. - 10% are replaced with a random token from the vocabulary.
- 10% are left unchanged.
This 80/10/10 strategy is necessary to mitigate the mismatch between pre-training, which sees
[MASK]tokens, and fine-tuning, which does not. The model's objective is to predict the original token based on its final hidden stateT_i, which is conditioned on the full, unmasked context from both directions. AnMLM Head(a simple classification layer over the vocabulary) is placed on top of the Transformer's output to compute this prediction.
- 80% are replaced with a special
-
Task #2: Next Sentence Prediction (NSP)
- Motivation: Many important downstream tasks, such as Question Answering (QA) and Natural Language Inference (NLI), require an understanding of the relationships between sentences. This is not directly captured by language modeling alone.
- Mechanism: The model is presented with two sentences, A and B. For 50% of the training examples, B is the actual sentence that follows A in the original text; for the other 50%, B is a random sentence sampled from the corpus
(A03). The model must predict whether B is the true next sentence. This binary classification task is trained using the[CLS]token's final hidden stateC, which is passed to a simpleNSP Head. This forces the[CLS]representation to capture the relationship between the two input sentences.
Step 4: Adaptation for Fine-Tuning
Once pre-training is complete, the MLM and NSP heads are discarded. The pre-trained BERT parameters provide a powerful starting point for a wide range of downstream tasks, requiring only the addition of a small, task-specific output layer. As illustrated in Figure 4 (el-364), the same pre-trained model can be adapted with minimal changes.
-
For Sentence-level Classification: For tasks like sentiment analysis or NLI, a single linear classification layer is added on top of the BERT model. The final hidden state of the
[CLS]token,C, is fed into this layer to produce classification logits, which are then passed through a softmax function(Figure 4 a, b). -
For Token-level Tasks (e.g., SQuAD Question Answering): For tasks that require predicting a span of text, the final hidden states of all tokens,
T_i, are used. As shown inFigure 4 (c), two new vectors are introduced for fine-tuning: a start vectorSand an end vectorE. The probability of a tokenibeing the start of the answer span is calculated as:P_i = e^(S · T_i) / Σ_j e^(S · T_j)(el-103)Where:
P_iis the probability of tokenibeing the start of the answer.S ∈ R^His a learnable start-of-span vector.T_i ∈ R^His the final hidden state of the i-th token from BERT.S · T_iis the dot product, yielding a scalar score for tokenibeing the start.- The denominator is a softmax function that normalizes the scores for all tokens
jin the paragraph into a probability distribution.
A similar calculation is performed with the end vector
Eto find the probability distribution for the end of the answer span. The model is trained to predict the correct start and end indices. This approach allows BERT to be effectively adapted for complex token-level tasks with minimal architectural modification.
Prerequisites
Here is a dependency-ordered list of concepts foundational to understanding the paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding":
Section 3: What You Need To Already Know
This section outlines the fundamental concepts and technologies that form the bedrock of BERT. Understanding these prerequisites is crucial for grasping BERT's innovations and its impact on Natural Language Processing.
1. Word Embeddings / Tokenization
-
Problem Neural networks require numerical inputs, but human language consists of discrete text. How can we convert words and sentences into a format that a machine learning model can process meaningfully?
-
Solution Tokenization is the process of breaking down raw text into smaller units called tokens (which can be words, sub-words, or characters). Each unique token is then mapped to a dense numerical vector called an embedding. These embeddings are typically learned during training, allowing them to capture semantic relationships (e.g., words with similar meanings will have similar embedding vectors).
-
Usage in this paper BERT uses a sub-word tokenizer called WordPiece to handle out-of-vocabulary words and manage vocabulary size. Its input representation for each token is the sum of three learned embeddings: a token embedding (representing the token itself), a segment embedding (indicating which sentence the token belongs to), and a position embedding (denoting its position within the sequence).
2. Unsupervised Learning
-
Problem Training powerful machine learning models often requires vast amounts of labeled data, which is expensive, time-consuming, and often impractical to obtain for every possible task. How can we leverage the enormous quantities of readily available unlabeled data (like text on the internet) to build effective models?
-
Solution Unsupervised learning involves training models on tasks that do not require explicit manual labels. Instead, the labels are generated directly from the input data itself, a technique often referred to as self-supervised learning. For example, in text, one can mask a word and ask the model to predict it, using the original word as the 'label'. This forces the model to learn underlying patterns and structures in the data.
-
Usage in this paper BERT's entire pre-training phase is a prime example of unsupervised learning. Both the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks generate their own labels from raw input text, enabling BERT to learn rich, general-purpose language representations from a massive unlabeled corpus without human annotation.
3. Language Modeling
-
Problem How can we train a machine to learn the statistical patterns, grammatical rules, and semantic nuances of a language from raw text without explicit linguistic annotations or task-specific labels?
-
Solution Language modeling is a fundamental task in NLP where a model is trained to predict the probability of a sequence of words. Traditionally, this involves predicting the next word in a sequence given the preceding words (e.g., given "The cat sat on the", predict "mat"). By performing this prediction task over vast amounts of text, the model implicitly learns about syntax, semantics, and context.
-
Usage in this paper BERT introduces a novel approach to language modeling called the Masked Language Model (MLM). Unlike traditional unidirectional language models, MLM randomly masks a percentage of input tokens and trains the model to predict the original masked tokens. This forces the model to learn context from both the left and right sides of a word simultaneously, leading to deeply bidirectional representations.
4. Self-Attention Mechanism
-
Problem When interpreting a specific word in a sentence, its meaning is often heavily influenced by other words, potentially far away. For example, in "The animal didn't cross the street because it was too tired," "it" refers to "the animal." Traditional sequential models (like RNNs) struggle to efficiently capture these long-range dependencies and contextual relationships across an entire sentence.
-
Solution Self-attention is a mechanism that allows a model to dynamically weigh the importance of all other words in an input sequence when processing a single word. For each word, it computes 'attention scores' against every other word in the sequence (including itself). The final representation of the word is then a weighted sum of all word representations, where the weights are determined by these attention scores. This enables the model to focus on the most relevant contextual words, regardless of their position.
-
Usage in this paper Self-attention is the fundamental building block of each Transformer layer within BERT. Because it allows each token to attend to all other tokens in the input sequence (both left and right), it is the core mechanism that enables BERT to create its "deeply bidirectional" representations, crucial for understanding context from all directions.
5. Transformer Architecture
-
Problem Recurrent Neural Networks (RNNs) and their variants (LSTMs, GRUs) process sequential data one token at a time. This sequential nature makes them slow to train (due to limited parallelization) and inherently difficult to capture dependencies between words that are very far apart in a sentence, as information can degrade over long sequences.
-
Solution The Transformer architecture overcomes these limitations by completely eschewing recurrence and convolutions. It processes all tokens in a sequence simultaneously using the self-attention mechanism. This allows for massive parallelization during training, significantly speeding up computation. By directly relating any two words in the sequence through self-attention, regardless of their distance, the Transformer effectively captures long-range dependencies.
-
Usage in this paper BERT's core architecture is a multi-layer Transformer encoder. The paper leverages the Transformer's ability to process sequences in parallel and its powerful self-attention mechanism to create deep bidirectional representations, which are essential for its state-of-the-art performance across various NLP tasks.
6. Pre-training and Fine-tuning Paradigm
-
Problem Training very large, complex neural networks (deep learning models) from scratch for every specific NLP task (like sentiment analysis, question answering, or named entity recognition) requires an immense amount of task-specific labeled data. This data is often expensive, time-consuming, and difficult to obtain, leading to a bottleneck in developing high-performing models for diverse applications.
-
Solution This paradigm involves two distinct stages. First, a large model is 'pre-trained' on a massive, easily available unlabeled dataset (e.g., all of Wikipedia and BooksCorpus) using general, self-supervised tasks (like language modeling). This process teaches the model a broad, general 'understanding' of the language's structure, semantics, and context. Second, this pre-trained model is then 'fine-tuned' by continuing its training on a much smaller, task-specific labeled dataset. During fine-tuning, a small task-specific output layer is added, and all parameters (both the original pre-trained ones and the new layer) are updated with a low learning rate, adapting the model's general knowledge to the specific downstream task.
-
Usage in this paper This two-stage methodology is the core contribution and operational principle of BERT. The model is pre-trained on the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks using a vast text corpus. Subsequently, this single pre-trained BERT model is fine-tuned with minimal architectural changes on 11 different downstream NLP tasks, achieving new state-of-the-art results and demonstrating the power of this transfer learning approach.
Implementation Map
Implementation-Oriented Walkthrough
The generated implementation map below is rendered exactly as code, preserving assumptions and provenance markers.
# Component: BERT Model (Multi-layer Transformer Encoder)
# Provenance: paper-stated
# Assumption: A01: Position embeddings are implemented as a learned lookup table of size (max_sequence_length, hidden_size), as specified in the resolved ambiguity. They are trained from scratch.
# Assumption: The specific values for hyperparameters like `vocab_size`, `hidden_size`, `num_attention_heads`, `num_hidden_layers`, `intermediate_size`, `max_position_embeddings`, `type_vocab_size`, `hidden_dropout_prob`, `attention_probs_dropout_prob`, and `layer_norm_eps` are not provided in the prompt. The code is designed to accept these as arguments. For a typical BERT-base model, these would be, for example: `vocab_size=30522`, `hidden_size=768`, `num_attention_heads=12`, `num_hidden_layers=12`, `intermediate_size=3072`, `max_position_embeddings=512`, `type_vocab_size=2`, `hidden_dropout_prob=0.1`, `attention_probs_dropout_prob=0.1`, `layer_norm_eps=1e-12`.
import torch
import torch.nn as nn
import math
class BertEmbeddings(nn.Module):
"""
Construct the embeddings from word, position and token_type embeddings.
"""
def __init__(self, vocab_size, hidden_size, max_position_embeddings, type_vocab_size, hidden_dropout_prob, layer_norm_eps):
super().__init__()
self.word_embeddings = nn.Embedding(vocab_size, hidden_size)
# ASSUMED: A01 - Position embeddings are a learned lookup table.
self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
self.LayerNorm = nn.LayerNorm(hidden_size, eps=layer_norm_eps)
self.dropout = nn.Dropout(hidden_dropout_prob)
# INFERRED: Position IDs are usually generated internally if not provided.
# This is a common practice in HuggingFace Transformers based on the original BERT implementation.
self.register_buffer("position_ids", torch.arange(max_position_embeddings).expand((1, -1)))
def forward(self, input_ids=None, token_type_ids=None, position_ids=None):
if input_ids is None:
# TODO: Handle case where input_ids is None, perhaps raise an error or return zero embeddings.
# For now, assume input_ids is always provided.
raise ValueError("input_ids must be provided for BertEmbeddings.")
input_shape = input_ids.size()
seq_length = input_shape[1]
if position_ids is None:
# INFERRED: If position_ids are not provided, generate them based on sequence length.
position_ids = self.position_ids[:, :seq_length]
if token_type_ids is None:
# INFERRED: If token_type_ids are not provided, assume all zeros (segment A).
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=input_ids.device)
words_embeddings = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = words_embeddings + position_embeddings + token_type_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
class BertSelfAttention(nn.Module):
def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob):
super().__init__()
if hidden_size % num_attention_heads != 0:
raise ValueError(
f"The hidden size ({hidden_size}) is not a multiple of the number of attention "
f"heads ({num_attention_heads})"
)
self.num_attention_heads = num_attention_heads
self.attention_head_size = hidden_size // num_attention_heads
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(hidden_size, self.all_head_size)
self.key = nn.Linear(hidden_size, self.all_head_size)
self.value = nn.Linear(hidden_size, self.all_head_size)
self.dropout = nn.Dropout(attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3) # (batch_size, num_heads, seq_len, head_size)
def forward(self, hidden_states, attention_mask=None):
query_layer = self.query(hidden_states)
key_layer = self.key(hidden_states)
value_layer = self.value(hidden_states)
query_layer = self.transpose_for_scores(query_layer)
key_layer = self.transpose_for_scores(key_layer)
value_layer = self.transpose_for_scores(value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) # Eq. (1)
attention_scores = attention_scores / math.sqrt(self.attention_head_size) # Eq. (1) scaling
if attention_mask is not None:
# Apply the attention mask (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1) # Eq. (2)
# This is actually dropping out entire tokens to attend to, which might
# make more sense for attention modules than dropping individual attention
# scores.
attention_probs = self.dropout(attention_probs)
context_layer = torch.matmul(attention_probs, value_layer) # Eq. (3)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
return context_layer
class BertSelfOutput(nn.Module):
def __init__(self, hidden_size, hidden_dropout_prob, layer_norm_eps):
super().__init__()
self.dense = nn.Linear(hidden_size, hidden_size)
self.LayerNorm = nn.LayerNorm(hidden_size, eps=layer_norm_eps)
self.dropout = nn.Dropout(hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor) # Residual connection + LayerNorm
return hidden_states
class BertAttention(nn.Module):
def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob, layer_norm_eps):
super().__init__()
self.self = BertSelfAttention(hidden_size, num_attention_heads, attention_probs_dropout_prob)
self.output = BertSelfOutput(hidden_size, hidden_dropout_prob, layer_norm_eps)
def forward(self, hidden_states, attention_mask=None):
self_output = self.self(hidden_states, attention_mask)
attention_output = self.output(self_output, hidden_states)
return attention_output
class BertIntermediate(nn.Module):
def __init__(self, hidden_size, intermediate_size):
super().__init__()
self.dense = nn.Linear(hidden_size, intermediate_size)
# INFERRED: BERT uses GELU activation function, as per the original implementation.
self.intermediate_act_fn = nn.GELU()
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states
class BertOutput(nn.Module):
def __init__(self, intermediate_size, hidden_size, hidden_dropout_prob, layer_norm_eps):
super().__init__()
self.dense = nn.Linear(intermediate_size, hidden_size)
self.LayerNorm = nn.LayerNorm(hidden_size, eps=layer_norm_eps)
self.dropout = nn.Dropout(hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor) # Residual connection + LayerNorm
return hidden_states
class BertLayer(nn.Module):
def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob, layer_norm_eps, intermediate_size):
super().__init__()
self.attention = BertAttention(hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob, layer_norm_eps)
self.intermediate = BertIntermediate(hidden_size, intermediate_size)
self.output = BertOutput(intermediate_size, hidden_size, hidden_dropout_prob, layer_norm_eps)
def forward(self, hidden_states, attention_mask=None):
attention_output = self.attention(hidden_states, attention_mask)
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output
class BertEncoder(nn.Module):
def __init__(self, num_hidden_layers, hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob, layer_norm_eps, intermediate_size):
super().__init__()
self.layer = nn.ModuleList([
BertLayer(
hidden_size,
num_attention_heads,
attention_probs_dropout_prob,
hidden_dropout_prob,
layer_norm_eps,
intermediate_size
)
for _ in range(num_hidden_layers)
])
def forward(self, hidden_states, attention_mask=None):
for i, layer_module in enumerate(self.layer):
hidden_states = layer_module(hidden_states, attention_mask)
return hidden_states
class BertModel(nn.Module):
"""
The bare BERT Model transformer outputting raw hidden-states without any specific head on top.
"""
def __init__(self, vocab_size, hidden_size, num_attention_heads, num_hidden_layers,
intermediate_size, max_position_embeddings, type_vocab_size,
hidden_dropout_prob, attention_probs_dropout_prob, layer_norm_eps):
super().__init__()
self.embeddings = BertEmbeddings(
vocab_size, hidden_size, max_position_embeddings, type_vocab_size,
hidden_dropout_prob, layer_norm_eps
)
self.encoder = BertEncoder(
num_hidden_layers, hidden_size, num_attention_heads,
attention_probs_dropout_prob, hidden_dropout_prob, layer_norm_eps,
intermediate_size
)
# INFERRED: Pooler layer for sequence classification tasks, often used for [CLS] token output.
# This is part of the standard BERT architecture, though not strictly "encoder" output.
# It's a linear layer followed by tanh activation, as per the original BERT implementation.
self.pooler = nn.Linear(hidden_size, hidden_size)
self.pooler_activation = nn.Tanh()
self.init_weights()
def init_weights(self):
# INFERRED: Standard BERT weight initialization, typically a truncated normal distribution.
self.apply(self._init_weights)
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, (nn.Linear, nn.Embedding)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=0.02)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
if isinstance(module, nn.Linear) and module.bias is not None:
module.bias.data.zero_()
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None):
if input_ids is None:
# TODO: Handle case where input_ids is None, perhaps raise an error.
raise ValueError("input_ids must be provided for BertModel.")
# We create a 3D attention mask from a 2D tensor mask.
# Sizes are [batch_size, 1, 1, to_seq_length]
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
# This attention mask is more simple than the triangular one used in causal attention
# used in OpenAI GPT, we just need to prepare the broadcast here.
if attention_mask is None:
attention_mask = torch.ones_like(input_ids)
# Extended attention mask for broadcasting
# (batch_size, 1, 1, seq_length)
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw attention scores in the self-attention layer,
# the softmax will be nearly zero for the masked positions.
extended_attention_mask = extended_attention_mask.to(dtype=self.embeddings.word_embeddings.weight.dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
embedding_output = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids
)
encoder_output = self.encoder(
embedding_output,
attention_mask=extended_attention_mask
)
# Pooler output for [CLS] token
# INFERRED: The pooler takes the hidden state of the first token ([CLS])
# and applies a linear layer followed by a Tanh activation.
first_token_tensor = encoder_output[:, 0]
pooled_output = self.pooler(first_token_tensor)
pooled_output = self.pooler_activation(pooled_output)
return encoder_output, pooled_output
# Component: Classification Head (Fine-tuning)
# Provenance: inferred
# Assumption: The architecture of the classification head (dropout layer followed by a linear layer) is inferred based on common practices for fine-tuning BERT-like models for sequence classification, as no specific architecture was detailed in the provided context.
# Assumption: A default `dropout_prob` of 0.1 is assumed if not explicitly provided, consistent with typical BERT fine-tuning configurations.
# Assumption: The `hidden_size` parameter, representing the BERT model's hidden state dimension, is a placeholder and must be provided from the specific BERT model configuration (e.g., 768 for BERT-base).
# Assumption: The `num_labels` parameter, representing the number of classes for the downstream task, is a placeholder and must be provided based on the specific task requirements (e.g., 2 for binary classification).
import torch
import torch.nn as nn
class ClassificationHead(nn.Module):
"""
Classification Head for fine-tuning BERT on sequence classification tasks.
It takes the pooled output (usually the [CLS] token's representation)
and projects it to the number of output labels.
"""
def __init__(self, hidden_size: int, num_labels: int, dropout_prob: float = None):
super().__init__()
# INFERRED: A dropout layer is typically applied before the final classification layer
# for regularization during fine-tuning, as seen in official BERT implementations.
# ASSUMED: If dropout_prob is not provided, a common default of 0.1 is used,
# consistent with BERT's pre-training and fine-tuning practices.
self.dropout = nn.Dropout(dropout_prob if dropout_prob is not None else 0.1)
# INFERRED: A linear layer is used to project the hidden state of the [CLS] token
# to the number of output classes for the specific classification task.
self.classifier = nn.Linear(hidden_size, num_labels)
# TODO: hidden_size - The dimension of the BERT model's hidden states.
# This value is typically 768 for BERT-base and 1024 for BERT-large.
# It must be provided from the BERT model configuration. # ASSUMED
# TODO: num_labels - The number of classes for the specific downstream classification task.
# This value is task-dependent (e.g., 2 for binary classification, N for multi-class). # ASSUMED
def forward(self, pooled_output: torch.Tensor) -> torch.Tensor:
"""
Forward pass for the classification head.
Args:
pooled_output (torch.Tensor): The pooled output from the BERT model,
typically the final hidden state of the [CLS] token.
Shape: (batch_size, hidden_size)
Returns:
torch.Tensor: Logits for each class. Shape: (batch_size, num_labels)
"""
# Apply dropout for regularization
x = self.dropout(pooled_output)
# Project the hidden state to the number of output labels
logits = self.classifier(x)
return logits
# Component: Final Hidden States (T_i)
# Provenance: inferred
# Assumption: Standard BERT architecture components are used as described in the paper.
# Assumption: A01: The position embeddings are a learned lookup table of size (max_sequence_length, hidden_size), e.g., (512, 768). They are trained from scratch along with the rest of the model. For sequences longer than 512, a common strategy is to truncate the input, though this is not specified in the paper.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math as Math # For Math.sqrt
# ASSUMED: Standard BERT architecture components are used as described in the paper.
# INFERRED: The final hidden states (T_i) are the output of the last layer of the Transformer encoder.
def gelu(x):
"""
Original Google BERT implementation uses this approximation of GELU.
"""
return x * 0.5 * (1.0 + torch.erf(x / Math.sqrt(2.0))) # Eq. (GELU approximation)
class BertEmbeddings(nn.Module):
"""
Construct the embeddings from word, position and token_type embeddings.
"""
def __init__(self):
super().__init__()
self.vocab_size = TODO_VOCAB_SIZE # INFERRED: From BERT pre-training, e.g., 30522 for uncased base.
self.hidden_size = TODO_HIDDEN_SIZE # INFERRED: From BERT base config, e.g., 768.
self.max_position_embeddings = TODO_MAX_POSITION_EMBEDDINGS # ASSUMED: A01, e.g., 512.
self.type_vocab_size = TODO_TYPE_VOCAB_SIZE # INFERRED: From BERT config, usually 2 (segment A, segment B).
self.hidden_dropout_prob = TODO_HIDDEN_DROPOUT_PROB # INFERRED: From BERT config, e.g., 0.1.
self.layer_norm_eps = TODO_LAYER_NORM_EPS # INFERRED: From BERT config, e.g., 1e-12.
self.word_embeddings = nn.Embedding(self.vocab_size, self.hidden_size)
self.position_embeddings = nn.Embedding(self.max_position_embeddings, self.hidden_size) # ASSUMED: A01
self.token_type_embeddings = nn.Embedding(self.type_vocab_size, self.hidden_size)
self.LayerNorm = nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps) # Eq. (Layer Normalization)
self.dropout = nn.Dropout(self.hidden_dropout_prob)
def forward(self, input_ids=None, token_type_ids=None, position_ids=None):
seq_length = input_ids.size(1)
if position_ids is None:
position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
if token_type_ids is None:
token_type_ids = torch.zeros_like(input_ids)
words_embeddings = self.word_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = words_embeddings + position_embeddings + token_type_embeddings # Eq. (Embedding Summation)
embeddings = self.LayerNorm(embeddings) # Eq. (Layer Normalization)
embeddings = self.dropout(embeddings)
return embeddings
class BertSelfAttention(nn.Module):
def __init__(self):
super().__init__()
self.hidden_size = TODO_HIDDEN_SIZE # INFERRED: From BERT base config, e.g., 768.
self.num_attention_heads = TODO_NUM_ATTENTION_HEADS # INFERRED: From BERT base config, e.g., 12.
self.attention_probs_dropout_prob = TODO_ATTENTION_PROBS_DROPOUT_PROB # INFERRED: From BERT config, e.g., 0.1.
self.attention_head_size = self.hidden_size // self.num_attention_heads # INFERRED: Standard calculation.
self.all_head_size = self.num_attention_heads * self.attention_head_size # INFERRED: Standard calculation.
self.query = nn.Linear(self.hidden_size, self.all_head_size)
self.key = nn.Linear(self.hidden_size, self.all_head_size)
self.value = nn.Linear(self.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(self.attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3) # (batch_size, num_heads, seq_len, head_dim)
def forward(self, hidden_states, attention_mask=None):
query_layer = self.query(hidden_states)
key_layer = self.key(hidden_states)
value_layer = self.value(hidden_states)
query_layer = self.transpose_for_scores(query_layer)
key_layer = self.transpose_for_scores(key_layer)
value_layer = self.transpose_for_scores(value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) # Eq. (Scaled Dot-Product Attention - part 1)
attention_scores = attention_scores / Math.sqrt(self.attention_head_size) # Eq. (Scaled Dot-Product Attention - part 2)
if attention_mask is not None:
# Apply the attention mask (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask # Eq. (Masking for attention scores)
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores) # Eq. (Softmax)
attention_probs = self.dropout(attention_probs)
context_layer = torch.matmul(attention_probs, value_layer) # Eq. (Scaled Dot-Product Attention - part 3)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
return context_layer
class BertSelfOutput(nn.Module):
def __init__(self):
super().__init__()
self.hidden_size = TODO_HIDDEN_SIZE # INFERRED: From BERT base config, e.g., 768.
self.hidden_dropout_prob = TODO_HIDDEN_DROPOUT_PROB # INFERRED: From BERT config, e.g., 0.1.
self.layer_norm_eps = TODO_LAYER_NORM_EPS # INFERRED: From BERT config, e.g., 1e-12.
self.dense = nn.Linear(self.hidden_size, self.hidden_size)
self.LayerNorm = nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps) # Eq. (Layer Normalization)
self.dropout = nn.Dropout(self.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor) # Eq. (Residual Connection + Layer Normalization)
return hidden_states
class BertAttention(nn.Module):
def __init__(self):
super().__init__()
self.self = BertSelfAttention()
self.output = BertSelfOutput()
def forward(self, hidden_states, attention_mask=None):
self_output = self.self(hidden_states, attention_mask)
attention_output = self.output(self_output, hidden_states)
return attention_output
class BertIntermediate(nn.Module):
def __init__(self):
super().__init__()
self.hidden_size = TODO_HIDDEN_SIZE # INFERRED: From BERT base config, e.g., 768.
self.intermediate_size = TODO_INTERMEDIATE_SIZE # INFERRED: From BERT base config, e.g., 3072.
self.hidden_act = TODO_HIDDEN_ACT # INFERRED: From BERT config, e.g., 'gelu'.
self.dense = nn.Linear(self.hidden_size, self.intermediate_size)
if self.hidden_act == "gelu":
self.intermediate_act_fn = gelu
elif self.hidden_act == "relu":
self.intermediate_act_fn = F.relu
else:
raise ValueError(f"Unsupported activation function: {self.hidden_act}")
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states) # Eq. (Activation Function)
return hidden_states
class BertOutput(nn.Module):
def __init__(self):
super().__init__()
self.hidden_size = TODO_HIDDEN_SIZE # INFERRED: From BERT base config, e.g., 768.
self.intermediate_size = TODO_INTERMEDIATE_SIZE # INFERRED: From BERT base config, e.g., 3072.
self.hidden_dropout_prob = TODO_HIDDEN_DROPOUT_PROB # INFERRED: From BERT config, e.g., 0.1.
self.layer_norm_eps = TODO_LAYER_NORM_EPS # INFERRED: From BERT config, e.g., 1e-12.
self.dense = nn.Linear(self.intermediate_size, self.hidden_size)
self.LayerNorm = nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps) # Eq. (Layer Normalization)
self.dropout = nn.Dropout(self.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor) # Eq. (Residual Connection + Layer Normalization)
return hidden_states
class BertLayer(nn.Module):
def __init__(self):
super().__init__()
self.attention = BertAttention()
self.intermediate = BertIntermediate()
self.output = BertOutput()
def forward(self, hidden_states, attention_mask=None):
attention_output = self.attention(hidden_states, attention_mask)
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output
class BertEncoder(nn.Module):
def __init__(self):
super().__init__()
self.num_hidden_layers = TODO_NUM_HIDDEN_LAYERS # INFERRED: From BERT base config, e.g., 12.
self.layer = nn.ModuleList([BertLayer() for _ in range(self.num_hidden_layers)])
def forward(self, hidden_states, attention_mask=None):
# The paper states "The final hidden state T_i for each input token i"
# This implies the output of the last layer.
for i, layer_module in enumerate(self.layer):
hidden_states = layer_module(hidden_states, attention_mask)
return hidden_states # This is T_i
class BertModel(nn.Module):
"""
The bare BERT Model transformer outputting raw hidden-states (T_i) without any specific head on top.
"""
def __init__(self):
super().__init__()
self.embeddings = BertEmbeddings()
self.encoder = BertEncoder()
def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None):
if attention_mask is None:
attention_mask = torch.ones_like(input_ids)
if token_type_ids is None:
token_type_ids = torch.zeros_like(input_ids)
# We create a 3D attention mask from a 2D tensor mask.
# Sizes are (batch_size, 1, 1, to_seq_length)
# So we can broadcast to (batch_size, num_heads, from_seq_length, to_seq_length)
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# This effectively masks out attention to padded tokens by setting their scores to a very small number.
extended_attention_mask = extended_attention_mask.to(dtype=self.embeddings.word_embeddings.weight.dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 # Eq. (Masking for attention scores)
embedding_output = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids
)
encoder_outputs = self.encoder(
embedding_output,
attention_mask=extended_attention_mask
)
# The final hidden states (T_i) are the output of the BertEncoder
final_hidden_states = encoder_outputs
return final_hidden_states
Missing Details
A01: Position Embedding Implementation
- Type: underspecified_architecture
- Section: 3 BERT, Input/Output Representations
- Ambiguous point: The paper states that position embeddings are learned, not the fixed sinusoidal embeddings from the original Transformer paper. However, it does not specify how they are learned, their maximum length, or how the model would handle sequences longer than the 512 tokens seen during pre-training.
- Implementation consequence: If an implementer assumes a fixed maximum length (e.g., 512) for the learned position embeddings, the model will fail on longer sequences at inference time. If the embeddings are not initialized or trained correctly, it could degrade model performance, as positional information is critical for the self-attention mechanism.
- Agent resolution: Assume the position embeddings are a learned lookup table of size (max_sequence_length, hidden_size), e.g., (512, 768). They are trained from scratch along with the rest of the model. For sequences longer than 512, a common strategy is to truncate the input, though this is not specified in the paper.
- Confidence: 0.5
A02: Tokenizer Vocabulary Generation
- Type: missing_training_detail
- Section: 3 BERT, Input/Output Representations
- Ambiguous point: The paper specifies a 30,000 token WordPiece vocabulary but does not detail its creation process. It's unclear what corpus was used to train the tokenizer, whether it was cased or uncased for pre-training, or what other configuration settings were used.
- Implementation consequence: Using a different vocabulary or tokenizer settings would create a complete mismatch with the released pre-trained weights, making it impossible to replicate the paper's results. The model's performance is highly sensitive to the tokenization scheme.
- Agent resolution: The official BERT implementation released by Google uses a specific vocabulary file (
vocab.txt). It is standard practice to use this provided file. The base model is uncased, and a separate cased model was also released. The pre-training described for the main results likely used the uncased vocabulary. - Confidence: 0.5
A03: NSP Random Sentence Sampling Strategy
- Type: missing_training_detail
- Section: 3.1 Pre-training BERT, Task #2: Next Sentence Prediction (NSP)
- Ambiguous point: For the Next Sentence Prediction task, the paper states that 50% of the time, sentence B is a 'random sentence from the corpus'. It does not specify the sampling strategy: is the random sentence from the same document, a different document, or the entire corpus? Are there constraints on its length?
- Implementation consequence: The difficulty of the NSP task depends heavily on this sampling strategy. If random sentences are always from different documents, the model might learn to solve the task using simple topic differences, rather than learning about coherence and logical flow. This could make the pre-training less effective for downstream NLI tasks.
- Agent resolution: The official implementation samples random sentences from the entire corpus, not just the same document. A sentence is chosen at random, and there are no explicit constraints other than it not being the true next sentence.
- Confidence: 0.5
A04: SQuAD v2.0 No-Answer Threshold (τ)
- Type: missing_hyperparameter
- Section: 4.3 SQuAD v2.0
- Ambiguous point: For SQuAD v2.0, the model predicts a no-answer response if the score of the best non-null span is not greater than the no-answer span score by a threshold τ. The paper states τ is 'selected on the dev set to maximize F1' but does not provide the value of τ or the search procedure.
- Implementation consequence: Without the value of τ, the exact F1 score on the SQuAD v2.0 dev and test sets cannot be replicated. Different values of τ will produce a different precision/recall trade-off for answerable vs. unanswerable questions, leading to different results.
- Agent resolution: The value of τ must be found by running inference on the development set with the fine-tuned model and searching for the threshold that maximizes the F1 score. A common approach is to iterate through a range of possible score differences observed on the dev set and pick the one that yields the best F1.
- Confidence: 0.5
A05: TriviaQA Augmentation Details for SQuAD
- Type: missing_training_detail
- Section: 4.2 SQuAD v1.1, Footnote 12
- Ambiguous point: The best SQuAD v1.1 model was first fine-tuned on TriviaQA. The paper provides minimal details on this intermediate step: 'first 400 tokens in documents, that contain at least one of the provided possible answers'. Key details like the learning rate, number of epochs, and batch size for this phase are missing.
- Implementation consequence: The state-of-the-art SQuAD v1.1 results are not reproducible without these crucial hyperparameters. The performance boost from this intermediate training step is significant, and incorrect settings could lead to worse results or negative transfer.
- Agent resolution: Assume the same fine-tuning hyperparameters as the main SQuAD task (e.g., LR=5e-5, Batch=32) and train for a similar number of epochs (e.g., 2-3). This is a reasonable starting point for experimentation.
- Confidence: 0.5
A06: Details of Random Restarts
- Type: missing_training_detail
- Section: 4.1 GLUE
- Ambiguous point: For BERT_LARGE on GLUE, the authors 'ran several random restarts and selected the best model on the Dev set'. The paper does not specify how many restarts 'several' is, nor what was re-randomized (data shuffling, classifier layer initialization, or both).
- Implementation consequence: The reported GLUE scores for BERT_LARGE might be the result of cherry-picking from an unknown number of runs. This makes it difficult to assess the model's expected performance and variance. An implementer might get a lower score with a single run and incorrectly assume their implementation is flawed.
- Agent resolution: Implementers should be aware that fine-tuning can be unstable. A common practice is to run experiments with 3 to 5 different random seeds and report the mean and standard deviation. To replicate the paper's 'best' score, one would need to run multiple trials and select the best-performing one on the dev set.
- Confidence: 0.5
A07: Pre-training Loss Weighting
- Type: ambiguous_loss_function
- Section: A.2 Pre-training Procedure
- Ambiguous point: The total pre-training loss is described as 'the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.' It is not explicitly stated if these two loss components are weighted equally (i.e., weight of 1.0 for each) or if there is some other weighting scheme.
- Implementation consequence: If the two losses are not weighted equally, the model's focus during pre-training would shift. For example, a higher weight on NSP might make the model better at sentence-pair tasks at the expense of token-level understanding. An incorrect implementation of the loss function would lead to a differently optimized pre-trained model.
- Agent resolution: Assume the losses are added with equal weight (1.0). The loss function is
Loss = Loss_MLM + Loss_NSP. - Confidence: 0.5
Training Recipe
Hyperparameter Registry
| name | value | source | status | suggested_default |
|---|---|---|---|---|
| BERT_BASE: Number of Layers (L) | 12 | 3 BERT, Model Architecture | paper-stated | |
| BERT_BASE: Hidden Size (H) | 768 | 3 BERT, Model Architecture | paper-stated | |
| BERT_BASE: Number of Attention Heads (A) | 12 | 3 BERT, Model Architecture | paper-stated | |
| BERT_BASE: Total Parameters | 110M | 3 BERT, Model Architecture | paper-stated | |
| BERT_BASE: Feed-forward/Filter Size | 3072 | 3 BERT, Model Architecture, Footnote 3 | paper-stated | |
| BERT_LARGE: Number of Layers (L) | 24 | 3 BERT, Model Architecture | paper-stated | |
| BERT_LARGE: Hidden Size (H) | 1024 | 3 BERT, Model Architecture | paper-stated | |
| BERT_LARGE: Number of Attention Heads (A) | 16 | 3 BERT, Model Architecture | paper-stated | |
| BERT_LARGE: Total Parameters | 340M | 3 BERT, Model Architecture | paper-stated | |
| BERT_LARGE: Feed-forward/Filter Size | 4096 | 3 BERT, Model Architecture, Footnote 3 | paper-stated | |
| Vocabulary Size | 30000 | 3 BERT, Input/Output Representations | paper-stated | |
| Tokenizer | WordPiece | 3 BERT, Input/Output Representations | paper-stated | |
| Pre-training: Masking Percentage (MLM) | 0.15 | 3.1 Pre-training BERT, Task #1: Masked LM | paper-stated | |
| Pre-training: Masking Strategy - [MASK] token | 0.8 | 3.1 Pre-training BERT, Task #1: Masked LM | paper-stated | |
| Pre-training: Masking Strategy - Random token | 0.1 | 3.1 Pre-training BERT, Task #1: Masked LM | paper-stated | |
| Pre-training: Masking Strategy - Unchanged token | 0.1 | 3.1 Pre-training BERT, Task #1: Masked LM | paper-stated | |
| Pre-training: Next Sentence Label Ratio (IsNext/NotNext) | 0.5 | 3.1 Pre-training BERT, Task #2: Next Sentence Prediction (NSP) | paper-stated | |
| Pre-training: Max Sequence Length | 512 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Batch Size | 256 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Total Steps | 1000000 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Number of Epochs | 40 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Optimizer | Adam | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Learning Rate | 1e-4 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Adam beta_1 | 0.9 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Adam beta_2 | 0.999 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: L2 Weight Decay | 0.01 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Learning Rate Warmup Steps | 10000 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Learning Rate Decay Schedule | linear | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Dropout Probability | 0.1 | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Activation Function | GELU | A.2 Pre-training Procedure | paper-stated | |
| Pre-training: Sequence Length Schedule | 90% steps at length 128, 10% steps at length 512 | A.2 Pre-training Procedure | paper-stated | |
| Fine-tuning: Dropout Probability | 0.1 | A.3 Fine-tuning Procedure | paper-stated | |
| Fine-tuning: Batch Size (Search Range) | [16, 32] | A.3 Fine-tuning Procedure | paper-stated | |
| Fine-tuning: Learning Rate (Search Range) | [5e-5, 3e-5, 2e-5] | A.3 Fine-tuning Procedure | paper-stated | |
| Fine-tuning: Number of Epochs (Search Range) | [2, 3, 4] | A.3 Fine-tuning Procedure | paper-stated | |
| Fine-tuning (GLUE): Batch Size | 32 | 4.1 GLUE | paper-stated | |
| Fine-tuning (GLUE): Number of Epochs | 3 | 4.1 GLUE | paper-stated | |
| Fine-tuning (GLUE): Learning Rate (Selected from) | [5e-5, 4e-5, 3e-5, 2e-5] | 4.1 GLUE | paper-stated | |
| Fine-tuning (SQuAD v1.1): Number of Epochs | 3 | 4.2 SQuAD v1.1 | paper-stated | |
| Fine-tuning (SQuAD v1.1): Learning Rate | 5e-5 | 4.2 SQuAD v1.1 | paper-stated | |
| Fine-tuning (SQuAD v1.1): Batch Size | 32 | 4.2 SQuAD v1.1 | paper-stated | |
| Fine-tuning (SQuAD v2.0): Number of Epochs | 2 | 4.3 SQuAD v2.0 | paper-stated | |
| Fine-tuning (SQuAD v2.0): Learning Rate | 5e-5 | 4.3 SQuAD v2.0 | paper-stated | |
| Fine-tuning (SQuAD v2.0): Batch Size | 48 | 4.3 SQuAD v2.0 | paper-stated | |
| Fine-tuning (SWAG): Number of Epochs | 3 | 4.4 SWAG | paper-stated | |
| Fine-tuning (SWAG): Learning Rate | 2e-5 | 4.4 SWAG | paper-stated | |
| Fine-tuning (SWAG): Batch Size | 16 | 4.4 SWAG | paper-stated | |
| Feature-based NER: BiLSTM Layers | 2 | 5.3 Feature-based Approach with BERT | paper-stated | |
| Feature-based NER: BiLSTM Hidden Size | 768 | 5.3 Feature-based Approach with BERT | paper-stated |