๐ŸŽฏ Q, K, V Matrix Sizes: d vs m Correlation

Understanding how model dimension (d) and sequence position (m) affect matrix sizes

๐Ÿ›๏ธ Major Model Architectures Comparison

Model Model Dim (d) Heads Head Dim Layers Context Length Parameters
GPT-4 12,288 96 128 120 128K ~1.7T
Claude Sonnet 4 ~5,120 40 128 ~64 200K โ†’ 1M ~200B
Gemini 2.5 Pro ~8,192 64 128 ~80 2M ~500B
DeepSeek V3 7,168 128 128 61 128K 671B
LLaMA 3.1 70B 8,192 64 128 80 128K 70.6B
Qwen 2.5 72B 8,192 64 128 80 128K 72.7B
๐Ÿ“Š Key Observations:
โ€ข All modern models converge on head_dim = 128 (sweet spot for efficiency)
โ€ข Model dimension (d) = num_heads ร— head_dim in most cases
โ€ข Context length varies dramatically, but Q/K/V transformation matrices stay the same size!
โ€ข Parameter count grows with dยฒ, layers, and vocabulary size

๐Ÿ”ง Interactive Matrix Size Calculator

๐Ÿ“ Step-by-Step Matrix Construction

๐ŸŽฏ The Key Insight

d determines the transformation matrix sizes (learned parameters)

m determines the resulting tensor sizes (per-sequence computations)

๐Ÿ”„ Context Length Comparison

๐Ÿ”ต Short Context (4 tokens)

๐Ÿ”ด Long Context (1000 tokens)

๐Ÿ’ก Memory & Computation Analysis