Understanding how model dimension (d) and sequence position (m) affect matrix sizes
Model | Model Dim (d) | Heads | Head Dim | Layers | Context Length | Parameters |
---|---|---|---|---|---|---|
GPT-4 | 12,288 | 96 | 128 | 120 | 128K | ~1.7T |
Claude Sonnet 4 | ~5,120 | 40 | 128 | ~64 | 200K โ 1M | ~200B |
Gemini 2.5 Pro | ~8,192 | 64 | 128 | ~80 | 2M | ~500B |
DeepSeek V3 | 7,168 | 128 | 128 | 61 | 128K | 671B |
LLaMA 3.1 70B | 8,192 | 64 | 128 | 80 | 128K | 70.6B |
Qwen 2.5 72B | 8,192 | 64 | 128 | 80 | 128K | 72.7B |
d determines the transformation matrix sizes (learned parameters)
m determines the resulting tensor sizes (per-sequence computations)