Transformers based Model : Mathematics of Computation⚓︎
Objectives
How do we determine transformer based model like GPT-2, Llamma-2 …
- Total number of parameters ?
- Total training and inference time memory requirement ?
- Visualize internal network architecture ?
How to perform the mathematics of computation of transformers based model from huggingface ?⚓︎
Step-0 : Setup
!pip install transformers torch
# import the model
from transformers import GPT2Model
# load the gpt-2 model from huggingface
model = GPT2Model.from_pretrained('gpt2')
Step-1 : Visualize the architecture and count the total parameters of model.
def count_params(model, is_human: bool = False):
params: int = sum(p.numel() for p in model.parameters() if p.requires_grad)
return f"{params / 1e6:.2f}M" if is_human else params
print(model)
print("Total # of params:", count_params(model, is_human=True))
Step-2 : Count of parameters layer by layer of model
- Embedding layer : Encoding + Position
V: int = model.config.vocab_size
E: int = model.config.n_embd
P: int = model.config.n_positions
expected_wte = V * E
expected_wpe: int = P * E
print(f"wte | Expected: {expected_wte}")
print(f"wte | True: {count_params(model._modules['wte'])}")
print(f"wpe | Expected: {expected_wpe}")
print(f"wpe | True: {count_params(model._modules['wpe'])}")
- Transformers : Normalization Layer
expected_ln_1 = 2 * E
print(f"ln_1 | Expected: {expected_ln_1}")
print(f"ln_1 | True: {count_params(model._modules['h'][0].ln_1)}")
Self-Attention =
expected_c_attn = E * (3 * E) + (3 * E)
expected_c_proj = E * E + E
expected_attn_dropout = 0
expected_resid_dropout = 0
expected_attn = expected_c_attn + expected_c_proj + expected_attn_dropout + expected_resid_dropout
print(f"c_attn | Expected: {expected_c_attn}")
print(f"c_attn | True: {count_params(model._modules['h'][0].attn.c_attn)}")
print(f"c_proj | Expected: {expected_c_proj}")
print(f"c_proj | True: {count_params(model._modules['h'][0].attn.c_proj)}")
print(f"attn_dropout | Expected: {expected_attn_dropout}")
print(f"attn_dropout | True: {count_params(model._modules['h'][0].attn.attn_dropout)}")
print(f"resid_dropout | Expected: {expected_resid_dropout}")
print(f"resid_dropout | True: {count_params(model._modules['h'][0].attn.resid_dropout)}")
print(f"attn | Expected: {expected_attn}")
print(f"attn | True: {count_params(model._modules['h'][0].attn)}")
Normalization Layer
expected_ln_2 = 2 * E
print(f"ln_2 | Expected: {expected_ln_2}")
print(f"ln_2 | True: {count_params(model._modules['h'][0].ln_2)}")
Multi-Layer Perceptron
H: int = 4 * E
expected_c_fc = E * H + H
expected_c_proj = H * E + E
expected_act = 0
expected_dropout = 0
expected_mlp = expected_c_fc + expected_c_proj + expected_act + expected_dropout
print(f"c_fc | Expected: {expected_c_fc}")
print(f"c_fc | True: {count_params(model._modules['h'][0].mlp.c_fc)}")
print(f"c_proj | Expected: {expected_c_proj}")
print(f"c_proj | True: {count_params(model._modules['h'][0].mlp.c_proj)}")
print(f"act | Expected: {expected_act}")
print(f"act | True: {count_params(model._modules['h'][0].mlp.act)}")
print(f"dropout | Expected: {expected_dropout}")
print(f"dropout | True: {count_params(model._modules['h'][0].mlp.dropout)}")
print(f"mlp | Expected: {expected_mlp}")
print(f"mlp | True: {count_params(model._modules['h'][0].mlp)}")
Final Normalization Layer
expected_ln_f = 2 * E
print(f"ln_f | Expected: {expected_ln_f}")
print(f"ln_f | True: {count_params(model._modules['ln_f'])}")
Step-3 : Complete Formula of Transformer
expected_ln_f = 2 * E
print(f"ln_f | Expected: {expected_ln_f}")
print(f"ln_f | True: {count_params(model._modules['ln_f'])}")