nanoGPT
AI Architecture
Sign in to run a focused analysis on specific parts of this codebase.
Sign inTrigger
User runs 'python train.py config/train_shakespeare_char.py'
Parse configuration
Configurator loads config/train_shakespeare_char.py and merges with command-line arguments to set hyperparameters like n_layer=6, n_head=6, n_embd=384, block_size=256.
Prepare Shakespeare dataset
Data preparation script downloads Shakespeare text, tokenizes at character-level, and creates train.bin (1M tokens) and val.bin (111K tokens) as memory-mapped numpy arrays.
Initialize GPT model
Creates GPT instance from model.py with specified architecture (6 layers, 6 heads, 384 embedding dim, 256 context). Initializes weights randomly using Xavier/Kaiming initialization.
Compile model
Uses PyTorch 2.0 torch.compile() to optimize model for faster execution, reducing iteration time from ~250ms to ~135ms through kernel fusion and other optimizations.
Configure AdamW optimizer
Sets up AdamW optimizer with weight decay, separating parameters into decay and no-decay groups. Configures learning rate schedule with warmup and cosine decay.
Load training batch
Loads batch of sequences from train.bin memory-mapped file. Extracts random block_size chunks as input (X) and shifted targets (Y) for next-token prediction.
Forward pass through GPT
Input tokens pass through embedding layer, then 6 transformer blocks (each with multi-head attention and MLP), producing logits for next token prediction across vocabulary.
Compute cross-entropy loss
Calculates cross-entropy loss between predicted logits and actual next tokens. Loss is averaged across batch and sequence dimensions.
Backward pass
Computes gradients via backpropagation through all model parameters. Gradient scaling is applied if using mixed precision training.
Clip gradients
Clips gradient norms to max_grad_norm (typically 1.0) to prevent exploding gradients and stabilize training.
Update model weights
Optimizer step updates all model parameters based on computed gradients. Learning rate is adjusted according to warmup/decay schedule.
Validation evaluation
Periodically evaluates model on validation set (val.bin) using estimate_loss function. Computes average loss over eval_iters batches without gradient computation.
Log metrics to W&B
Logs training loss, validation loss, learning rate, iteration time, and other metrics to Weights & Biases for experiment tracking and visualization.
Save model checkpoint
Periodically saves model state_dict, optimizer state, config, and iteration count to out_dir. Best validation loss checkpoint is kept for inference.
Analyzed 2/17/2026, 3:14:49 PM
- A
Update README to mention nanochat and deprecation
Andrej · 4mo ago
- A
Merge pull request #578 from devin-open-source/devin/1733728337-fix-warmup-lr
Andrej · 15mo ago
- D
fix: ensure non-zero learning rate during warmup at iteration 0
Devin AI · 15mo ago
- A
Merge pull request #463 from goswamig/test1
Andrej · 22mo ago
- A
Merge branch 'master' into test1
Andrej · 22mo ago
Sign in to analyze your own repositories.
Sign in