Back to Explore

nanoGPT

public
masterThe simplest, fastest repository for training/finetuning medium-sized GPTs.
View on GitHub

AI Architecture

Sign in to run a focused analysis on specific parts of this codebase.

Sign in

Trigger

User runs 'python train.py config/train_shakespeare_char.py'

1
TriggerPython

Parse configuration

Configurator loads config/train_shakespeare_char.py and merges with command-line arguments to set hyperparameters like n_layer=6, n_head=6, n_embd=384, block_size=256.

configurator.parse_argsexec
argparse
train.pyconfigurator.pyconfig/train_shakespeare_char.py
2
ProcessingPython

Prepare Shakespeare dataset

Data preparation script downloads Shakespeare text, tokenizes at character-level, and creates train.bin (1M tokens) and val.bin (111K tokens) as memory-mapped numpy arrays.

preparenumpy.memmap
numpyrequests
data/shakespeare_char/prepare.py
3
ProcessingPyTorch

Initialize GPT model

Creates GPT instance from model.py with specified architecture (6 layers, 6 heads, 384 embedding dim, 256 context). Initializes weights randomly using Xavier/Kaiming initialization.

GPT.__init__GPT._init_weights
torchtorch.nn
train.pymodel.py
4
ProcessingPyTorch 2.0

Compile model

Uses PyTorch 2.0 torch.compile() to optimize model for faster execution, reducing iteration time from ~250ms to ~135ms through kernel fusion and other optimizations.

torch.compile
torch
train.py
5
ProcessingPyTorch

Configure AdamW optimizer

Sets up AdamW optimizer with weight decay, separating parameters into decay and no-decay groups. Configures learning rate schedule with warmup and cosine decay.

model.configure_optimizerstorch.optim.AdamW
torch.optim
train.pymodel.py
6
DatabasePyTorch

Load training batch

Loads batch of sequences from train.bin memory-mapped file. Extracts random block_size chunks as input (X) and shifted targets (Y) for next-token prediction.

get_batchtorch.from_numpy
numpytorch
train.py
7
ProcessingPyTorch

Forward pass through GPT

Input tokens pass through embedding layer, then 6 transformer blocks (each with multi-head attention and MLP), producing logits for next token prediction across vocabulary.

GPT.forwardBlock.forwardCausalSelfAttention.forward
torchtorch.nn
model.py
8
Parallel
ProcessingPyTorch

Compute cross-entropy loss

Calculates cross-entropy loss between predicted logits and actual next tokens. Loss is averaged across batch and sequence dimensions.

F.cross_entropy
torch.nn.functional
train.pymodel.py
ProcessingPyTorch

Backward pass

Computes gradients via backpropagation through all model parameters. Gradient scaling is applied if using mixed precision training.

loss.backwardscaler.scale
torch
train.py
9
Parallel
ProcessingPyTorch

Clip gradients

Clips gradient norms to max_grad_norm (typically 1.0) to prevent exploding gradients and stabilize training.

torch.nn.utils.clip_grad_norm_
torch.nn.utils
train.py
ProcessingPyTorch

Update model weights

Optimizer step updates all model parameters based on computed gradients. Learning rate is adjusted according to warmup/decay schedule.

optimizer.stepoptimizer.zero_gradget_lr
torch.optim
train.py
10
ProcessingPyTorch

Validation evaluation

Periodically evaluates model on validation set (val.bin) using estimate_loss function. Computes average loss over eval_iters batches without gradient computation.

estimate_lossmodel.evaltorch.no_grad
torch
train.py
11
Externalwandb

Log metrics to W&B

Logs training loss, validation loss, learning rate, iteration time, and other metrics to Weights & Biases for experiment tracking and visualization.

wandb.log
wandb
train.py
12
ProcessingPyTorch

Save model checkpoint

Periodically saves model state_dict, optimizer state, config, and iteration count to out_dir. Best validation loss checkpoint is kept for inference.

torch.save
torch
train.py

Analyzed 2/17/2026, 3:14:49 PM

Languages
Python100.0%
Recent Commits
  • A

    Update README to mention nanochat and deprecation

    Andrej · 4mo ago

  • A

    Merge pull request #578 from devin-open-source/devin/1733728337-fix-warmup-lr

    Andrej · 15mo ago

  • D

    fix: ensure non-zero learning rate during warmup at iteration 0

    Devin AI · 15mo ago

  • A

    Merge pull request #463 from goswamig/test1

    Andrej · 22mo ago

  • A

    Merge branch 'master' into test1

    Andrej · 22mo ago

Sign in to analyze your own repositories.

Sign in