This is a detailed blog on how to build a transformer from scratch in refrence to the blog I posted on Medium.
I’ll walk you through my journey of building a transformer providing all the details of the model from scratch and training it to translate pseudo-code into C++ code. This project combines natural language processing (NLP) with code generation, and I’m excited to share the details with you.
Inspiration and Resources
My work is inspired by Umar Jamil’s video series on building a transformer from scratch Coding a Transformer from scratch on PyTorch, with full explanation, training and inference. I also leaned on the Udacity course, Introduction to Deep Learning with PyTorch, to strengthen my understanding of PyTorch.
Project Overview
The goal was to create a model that could understand the logic expressed in simple pseudo-code and convert it into functional C++ code. This is a challenging task that requires the model to understand both natural language and programming syntax.
I utilized Kaggle Notebooks for this project, taking advantage of the generous GPU resources.
Model Architecture
Encoder: Processes the input pseudo-code and creates a representation of its meaning. Decoder: Takes the encoder’s output and generates the corresponding C++ code. Attention Mechanism: Allows the model to focus on the most relevant parts of the input when generating the output.
Given the relatively small sequence lengths in my dataset (mostly under 300 characters), I was able to reduce the model’s size, which meant less computing power was needed. The main changes I made to reduce size are:
- Sequence Length: 300
- Model Dimension (d_model): 256
- Number of Encoder/Decoder Layers (N): 3
- Number of Attention Heads (h): 4
I kept the feed-forward network size (d_ff) at 2048, as in the original paper.
Here’s the PyTorch code for the core components of the Transformer model:
Input Embeddings, Converting Words to Numbers
This is where the input words are transformed into numerical representations that the model can work with.
class InputEmbeddings(nn.Module):
def __init__(self, d_model: int, vocab_size: int) -> None:
super().__init__()
self.d_model = d_model
self.vocab_size = vocab_size
self.embedding = nn.Embedding(vocab_size, d_model)
def forward(self, x):
return self.embedding(x) * math.sqrt(self.d_model)
- nn.Embedding(vocab_size, d_model): This creates a lookup table. Each word in the vocabulary (vocab_size) gets assigned a vector of size d_model.
- d_model: is a hyperparameter controlling the size of these vectors (e.g., 256 in our case).
Positional Encoding, Adding Word Order Information
Transformers don’t inherently know the order of words. Positional encoding adds this information.
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
super().__init__()
self.d_model = d_model
self.seq_len = seq_len
self.dropout = nn.Dropout(dropout)
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
return self.dropout(x)
The code calculates fixed positional encodings using sine and cosine functions. These encodings are added to the word embeddings. The specific formulas create unique patterns for each position in the sequence.
Multi Head Attention Block, Focusing on Relevant Words
This is the core of the Transformer, enabling it to weigh the importance of different words when processing a sequence.
class MultiHeadAttentionBlock(nn.Module):
def __init__(self, d_model: int, h: int, dropout: float) -> None:
super().__init__()
self.d_model = d_model
self.h = h
assert d_model % h == 0, "d_model is not divisible by h"
self.d_k = d_model // h
self.w_q = nn.Linear(d_model, d_model, bias=False)
self.w_k = nn.Linear(d_model, d_model, bias=False)
self.w_v = nn.Linear(d_model, d_model, bias=False)
self.w_o = nn.Linear(d_model, d_model, bias=False)
self.dropout = nn.Dropout(dropout)
@staticmethod
def attention(query, key, value, mask, dropout: nn.Dropout):
d_k = query.shape[-1]
attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
attention_scores.masked_fill_(mask == 0, -1e9)
attention_scores = attention_scores.softmax(dim=-1)
if dropout is not None:
attention_scores = dropout(attention_scores)
return (attention_scores @ value), attention_scores
def forward(self, q, k, v, mask):
query = self.w_q(q)
key = self.w_k(k)
value = self.w_v(v)
query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1, 2)
key = key.view(key.shape[0], query.shape[1], self.h, self.d_k).transpose(1, 2)
value = value.view(value.shape[0], query.shape[1], self.h, self.d_k).transpose(1, 2)
x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)
x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.h * self.d_k)
return self.w_o(x)
- w_q, w_k, w_v: These are linear layers that transform the input into query, key, and value vectors. Think of the query as “what am I looking for?”, the key as “what do I contain?”, and the value as “what information do I have to offer?”.
- attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k): This calculates the attention scores by taking the dot product of the query and key vectors. The division by sqrt(d_k) scales the scores to prevent them from becoming too large.
- attention_scores.softmax(dim=-1): Applies the softmax function to normalize the attention scores into probabilities.
- (attention_scores @ value): calculates a weighted sum of the value vectors, where the weights are the attention scores.
Feed Forward Block, Adding Complexity
This block adds non-linearity and allows the model to learn more complex representations.
class FeedForwardBlock(nn.Module):
def __init__(self, d_model: int, d_ff: int, dropout: float) -> None:
super().__init__()
self.linear_1 = nn.Linear(d_model, d_ff)
self.dropout = nn.Dropout(dropout)
self.linear_2 = nn.Linear(d_ff, d_model)
def forward(self, x):
return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))
Two linear layers with a ReLU activation in between. The first linear layer expands the dimension from d_model to d_ff (e.g., 2048). The ReLU adds non-linearity. The second linear layer projects back to d_model.
Encoder Block & Decoder Block, Layering the Processing
These blocks combine the attention and feed-forward mechanisms into a single layer.
The Encoder block
class EncoderBlock(nn.Module):
def __init__(self, features: int, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
super().__init__()
self.self_attention_block = self_attention_block
self.feed_forward_block = feed_forward_block
self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(2)])
def forward(self, x, src_mask):
x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, src_mask))
x = self.residual_connections[1](x, self.feed_forward_block)
return x
The Decoder block
class DecoderBlock(nn.Module):
def __init__(self,
features: int,
self_attention_block: MultiHeadAttentionBlock,
cross_attention_block: MultiHeadAttentionBlock,
feed_forward_block: FeedForwardBlock,
dropout: float) -> None:
super().__init__()
self.self_attention_block = self_attention_block
self.cross_attention_block = cross_attention_block
self.feed_forward_block = feed_forward_block
self.residual_connections = nn.ModuleList(
[
ResidualConnection(features, dropout) for _ in range(3)
])
def forward(self, x, encoder_output, src_mask, tgt_mask):
x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, tgt_mask))
x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, src_mask))
x = self.residual_connections[2](x, self.feed_forward_block)
return x
- These blocks use residual connections (Residual Connection) around the attention and feed-forward blocks. This helps gradients to flow more easily during training and prevents vanishing gradients. The decoder block has cross attention to attend to encoder output and “self attention” for attending the input.
- src_mask and tgt_mask: These masks are used to prevent the model from attending to padding tokens or future tokens (in the decoder).
Projection Layer, Converting to Output Probabilities
This layer maps the final hidden state to a probability distribution over the target vocabulary.
class ProjectionLayer(nn.Module):
def __init__(self, d_model, vocab_size) -> None:
super().__init__()
self.proj = nn.Linear(d_model, vocab_size)
def forward(self, x) -> None:
return self.proj(x)
- nn.Linear(d_model, vocab_size): This linear layer maps the d_model-dimensional output to a vector with a size equal to the target vocabulary. A softmax function (often applied implicitly during loss calculation) converts this to a probability distribution.
This focused breakdown should give you a clearer picture of the key components and their role in the Transformer model, along with relevant code snippets.
Key Concepts
- d_model: The dimension of the embedding vectors. Think of it as the number of features used to represent each word or token.
- N: The number of layers in the encoder and decoder. More layers can capture more complex relationships, but also increase the model’s complexity.
- h: The number of attention heads. Multiple heads allow the model to attend to different parts of the input in parallel.
- Dropout: A technique to prevent overfitting by randomly setting some neurons to zero during training.
- Feed Forward Network: Neural Networks that process data in a single direction and does not form a cycle.
Tokenization
Instead of building a tokenizer from scratch, I used the AutoTokenizer class from Hugging Face’s Transformers library. This saved me a lot of time and effort, especially considering the complexities of tokenizing code, which includes handling brackets and other syntax elements.
- Source Tokenizer: bert-base-uncased (for pseudo-code) — Popular NLP tokenizer
- Target Tokenizer: Salesforce/codet5-small (for C++ code) — Specialized for code
from transformers import AutoTokenizer
tokenizer_src = AutoTokenizer.from_pretrained("bert-base-uncased") # NLP tokenizer
tokenizer_tgt = AutoTokenizer.from_pretrained("Salesforce/codet5-small") # Code tokenizer
So what is Tokenization ?
Tokenization is the process of breaking down text (or code) into smaller units called “tokens”. These tokens are then converted into numerical representations that the model can understand.
Why Use Pre-trained Tokenizers?
Pre-trained tokenizers have been trained on vast amounts of data, allowing them to:
- Handle a wide range of vocabulary.
- Understand common patterns and structures in the language.
Save you the effort of creating a tokenizer from scratch.
Training and Results
I trained the model on a dataset of pseudo-code and corresponding C++ code. After training, the model was able to generate C++ code from pseudo-code with reasonable accuracy.
1. Initialization & Setup:
import torch
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm
import os
# TensorBoard writer
writer = SummaryWriter('runs/tmodel')
# Initialize device and model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.nn.DataParallel(model, device, device_ids=[0, 1]) # Wrap model for multi-GPU training
model.to(device)
# Optimizer and Loss Function
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
loss_fn = torch.nn.CrossEntropyLoss()
-
Device Selection: device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”) This line checks if a GPU is available and sets the device accordingly. Training on a GPU is much faster.
-
Multi-GPU Training: model = torch.nn.DataParallel(model, device_ids=[0, 1]) If you have multiple GPUs, this line wraps the model to distribute the training across them. device_ids specifies which GPUs to use. Note that the DataParallel wrapper affects how you access the underlying model later (using model.module).
-
Optimizer: optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) This creates an Adam optimizer, which is used to update the model’s weights during training. model.parameters() specifies the parameters that the optimizer should update, and lr=1e-4 is the learning rate (how much the weights are adjusted in each step).
-
Loss Function: loss_fn = torch.nn.CrossEntropyLoss() This defines the loss function, which measures the difference between the model’s predictions and the actual targets. Cross-entropy loss is commonly used for classification tasks.
2. Checkpoint Loading (Resuming Training):
checkpoint_path = "latest_checkpoint.pt"
initial_epoch = 0
global_step = 0
if os.path.exists(checkpoint_path):
checkpoint = torch.load(checkpoint_path)
model.module.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
initial_epoch = checkpoint['epoch'] + 1
global_step = checkpoint['global_step']
print(f"Resuming training from epoch {initial_epoch}, global step {global_step}")
This section checks if a checkpoint file exists (latest_checkpoint.pt). If so, it loads the model’s weights, the optimizer’s state, and the training epoch and global step from the checkpoint, allowing you to resume training from where you left off.
3. The Main Training Loop:
num_epochs = 10
for epoch in range(initial_epoch, num_epochs):
torch.cuda.empty_cache() # Clear cache before each epoch
model.train() # Set the model to training mode
batch_iterator = tqdm(train_dataloader, desc=f"Processing Epoch {epoch:02d}") # for progress bar
for batch in batch_iterator:
encoder_input = batch['encoder_input'].to(device)
decoder_input = batch['decoder_input'].to(device)
encoder_mask = batch['encoder_mask'].to(device)
decoder_mask = batch['decoder_mask'].to(device)
# Forward pass
encoder_output = model.module.encode(encoder_input, encoder_mask)
decoder_output = model.module.decode(encoder_output, encoder_mask, decoder_input, decoder_mask)
proj_output = model.module.project(decoder_output)
# Compare output with labels
label = batch['label'].to(device)
loss = loss_fn(proj_output.view(-1, tokenizer_tgt.vocab_size), label.view(-1))
batch_iterator.set_postfix({"loss": f"{loss.item():6.3f}"})
# Backpropagation
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
global_step += 1
- Epoch Loop: The outer loop runs for
num_epochs
, processing the entire training dataset multiple times to help the model learn better. - Batch Loop: Inside each epoch, the inner loop iterates over batches from
train_dataloader
, allowing efficient training and gradient updates. - Move Data to Device: Each batch’s data (inputs, masks, labels) is moved to the selected device (
GPU
orCPU
) for faster computation. - Forward Pass:
encoder_output = model.module.encode(encoder_input, encoder_mask)
: The encoder processes the input sequence and produces a representation.decoder_output = model.module.decode(encoder_output, encoder_mask, decoder_input, decoder_mask)
: The decoder uses the encoder’s output and its own input to generate predictions.proj_output = model.module.project(decoder_output)
: The projection layer converts decoder outputs to probabilities over the target vocabulary.
- Loss Calculation:
loss = loss_fn(proj_output.view(-1, tokenizer_tgt.vocab_size), label.view(-1))
computes the difference between predicted and actual outputs, reshaping tensors as needed. - Backpropagation:
loss.backward()
: Computes gradients for all model parameters.optimizer.step()
: Updates model weights using the gradients.optimizer.zero_grad(set_to_none=True)
: Clears gradients for the next iteration, saving memory.
- Global Step:
global_step += 1
increments the training step counter, useful for logging and checkpointing.
4. Validation & Checkpoint Saving:
run_validation(model.module, val_dataloader, tokenizer_src, tokenizer_tgt, seq_len, device, lambda msg: batch_iterator.write(msg), global_step, writer)
checkpoint_data = {
'epoch': epoch,
'model_state_dict': model.module.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'global_step': global_step
}
torch.save(checkpoint_data, checkpoint_path)
print(f"Checkpoint saved at epoch {epoch}")
-
Validation: run_validation(…) runs the validation loop (not shown in this snippet) to evaluate the model’s performance on a validation dataset. This helps monitor for overfitting.
-
Checkpoint Saving: A checkpoint is saved after each epoch. This includes the model’s weights (model_state_dict), the optimizer’s state (optimizer_state_dict), and the current epoch and global step.
5. Tensor Board Logging
writer = SummaryWriter('runs/tmodel')
writer.add_scalar('train loss', loss.item(), global_step)
writer.flush()
Logging: writer.add_scalar(‘train loss’, loss.item(), global_step) logs the training loss to TensorBoard, allowing you to visualize the training progress. writer.flush() ensures the data is written to disk.
Examples
Example 1:
Pseudo-code:
put 3 in x
C++ code:
int x = 3;
Example 2:
Pseudo-code:
if y > x
put 3 in x
else
put 4 in y
C++ Code:
if ( y > x ) { int x = 3 ; } else { int y = 4 ; }
Try it yourself
I’ve created a Streamlit app where you can test the model with your own pseudo-code: Text 2 Code
Conclusion
Building a transformer from scratch and training it to translate pseudo-code to C++ was a challenging but rewarding experience. I learned a lot about NLP, code generation, and the transformer architecture. I hope this blog post has inspired you to explore the exciting world of deep learning!