Skip to content

Transformers based Model : Mathematics of Computation⚓︎

Objectives

How do we determine transformer based model like GPT-2, Llamma-2 …

  1. Total number of parameters ?
  2. Total training and inference time memory requirement ?
  3. Visualize internal network architecture ?

How to perform the mathematics of computation of transformers based model from huggingface ?⚓︎

Step-0 : Setup

!pip install transformers torch

# import the model
from transformers import GPT2Model
# load the gpt-2 model from huggingface
model = GPT2Model.from_pretrained('gpt2')

Step-1 : Visualize the architecture and count the total parameters of model.

def count_params(model, is_human: bool = False):
    params: int = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return f"{params / 1e6:.2f}M" if is_human else params

print(model)
print("Total # of params:", count_params(model, is_human=True))

Step-2 : Count of parameters layer by layer of model

  1. Embedding layer : Encoding + Position
V: int = model.config.vocab_size
E: int = model.config.n_embd
P: int = model.config.n_positions
expected_wte = V * E
expected_wpe: int = P * E
print(f"wte | Expected: {expected_wte}")
print(f"wte | True:     {count_params(model._modules['wte'])}")
print(f"wpe | Expected: {expected_wpe}")
print(f"wpe | True:     {count_params(model._modules['wpe'])}")
  1. Transformers : Normalization Layer
expected_ln_1 = 2 * E
print(f"ln_1 | Expected: {expected_ln_1}")
print(f"ln_1 | True:     {count_params(model._modules['h'][0].ln_1)}")

Self-Attention =

expected_c_attn = E * (3 * E) + (3 * E)
expected_c_proj = E * E + E
expected_attn_dropout = 0
expected_resid_dropout = 0
expected_attn = expected_c_attn + expected_c_proj + expected_attn_dropout + expected_resid_dropout
print(f"c_attn | Expected: {expected_c_attn}")
print(f"c_attn | True:     {count_params(model._modules['h'][0].attn.c_attn)}")
print(f"c_proj | Expected: {expected_c_proj}")
print(f"c_proj | True:     {count_params(model._modules['h'][0].attn.c_proj)}")
print(f"attn_dropout | Expected: {expected_attn_dropout}")
print(f"attn_dropout | True:     {count_params(model._modules['h'][0].attn.attn_dropout)}")
print(f"resid_dropout | Expected: {expected_resid_dropout}")
print(f"resid_dropout | True:     {count_params(model._modules['h'][0].attn.resid_dropout)}")
print(f"attn | Expected: {expected_attn}")
print(f"attn | True:     {count_params(model._modules['h'][0].attn)}")

Normalization Layer

expected_ln_2 = 2 * E
print(f"ln_2 | Expected: {expected_ln_2}")
print(f"ln_2 | True:     {count_params(model._modules['h'][0].ln_2)}")

Multi-Layer Perceptron

H: int = 4 * E
expected_c_fc = E * H + H
expected_c_proj = H * E + E
expected_act = 0
expected_dropout = 0
expected_mlp = expected_c_fc + expected_c_proj + expected_act + expected_dropout
print(f"c_fc | Expected: {expected_c_fc}")
print(f"c_fc | True:     {count_params(model._modules['h'][0].mlp.c_fc)}")
print(f"c_proj | Expected: {expected_c_proj}")
print(f"c_proj | True:     {count_params(model._modules['h'][0].mlp.c_proj)}")
print(f"act | Expected: {expected_act}")
print(f"act | True:     {count_params(model._modules['h'][0].mlp.act)}")
print(f"dropout | Expected: {expected_dropout}")
print(f"dropout | True:     {count_params(model._modules['h'][0].mlp.dropout)}")
print(f"mlp | Expected: {expected_mlp}")
print(f"mlp | True:     {count_params(model._modules['h'][0].mlp)}")

Final Normalization Layer

expected_ln_f = 2 * E
print(f"ln_f | Expected: {expected_ln_f}")
print(f"ln_f | True:     {count_params(model._modules['ln_f'])}")

Step-3 : Complete Formula of Transformer

expected_ln_f = 2 * E
print(f"ln_f | Expected: {expected_ln_f}")
print(f"ln_f | True:     {count_params(model._modules['ln_f'])}")

References⚓︎

  1. Transformer Math 101 | EleutherAI Blog
  2. Transformer Math (Part 1) - Counting Model Parameters
  3. The Annotated Transformer
  4. The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.