🎯 Q, K, V Matrix Sizes: d vs m Correlation

Understanding how model dimension (d) and sequence position (m) affect matrix sizes

🏛️ Major Model Architectures Comparison

Model	Model Dim (d)	Heads	Head Dim	Layers	Context Length	Parameters
GPT-4	12,288	96	128	120	128K	~1.7T
Claude Sonnet 4	~5,120	40	128	~64	200K → 1M	~200B
Gemini 2.5 Pro	~8,192	64	128	~80	2M	~500B
DeepSeek V3	7,168	128	128	61	128K	671B
LLaMA 3.1 70B	8,192	64	128	80	128K	70.6B
Qwen 2.5 72B	8,192	64	128	80	128K	72.7B

            📊 Key Observations:

            • All modern models converge on head_dim = 128 (sweet spot for efficiency)

            • Model dimension (d) = num_heads × head_dim in most cases

            • Context length varies dramatically, but Q/K/V transformation matrices stay the same size!

            • Parameter count grows with d², layers, and vocabulary size

🔧 Interactive Matrix Size Calculator

Model Dimension (d):

Sequence Length (max m+1):

Head Dimension:

📐 Step-by-Step Matrix Construction

🎯 The Key Insight

d determines the transformation matrix sizes (learned parameters)

m determines the resulting tensor sizes (per-sequence computations)

🎯 Q, K, V Matrix Sizes: d vs m Correlation

🏛️ Major Model Architectures Comparison

🔧 Interactive Matrix Size Calculator

📐 Step-by-Step Matrix Construction

🎯 The Key Insight

🔄 Context Length Comparison

🔵 Short Context (4 tokens)

🔴 Long Context (1000 tokens)

💡 Memory & Computation Analysis