🌀 RoPE: Rotary Position Embedding Tutorial

Learn how the same transformer model behaves differently with various context lengths!

📚 Step 1: Understanding Basic RoPE

What is RoPE?

RoPE (Rotary Position Embedding) encodes position by rotating token embeddings in high-dimensional space. Instead of adding position vectors, it rotates them!

Key Formula:
θ_i = 10000^(-2i/d) where i = dimension pair index
For position m: rotate by angle m × θ_i

📝 Step 1: Understanding Dimension Pairs

What are Dimension Pairs?

RoPE doesn't rotate individual dimensions - it rotates pairs of adjacent dimensions together as 2D coordinates (also called "2D chunks").

🎯 Visual Example: 8D Embedding → 4 Dimension Pairs

Original 8D embedding:
[x₀, x₁, x₂, x₃, x₄, x₅, x₆, x₇]

Grouped into dimension pairs:
• Pair 0: (x₀, x₁) ← dimensions 0 and 1
• Pair 1: (x₂, x₃) ← dimensions 2 and 3
• Pair 2: (x₄, x₅) ← dimensions 4 and 5
• Pair 3: (x₆, x₇) ← dimensions 6 and 7

Each pair = 2D coordinates that get rotated!

🔄 Why Pairs?
RoPE applies 2D rotations - you need exactly 2 coordinates to rotate in a plane:

Think of each pair as coordinates on a 2D plane:
• (x₀, x₁) = point on plane 0
• (x₂, x₃) = point on plane 1
• (x₄, x₅) = point on plane 2
• etc.

Each plane gets rotated by a different amount based on position!

📝 Step 2: Token Embeddings & RoPE Application

Complete RoPE Process

RoPE works by rotating pairs of embedding dimensions based on position. Let's see the complete transformation!

Sample Text:

Custom Text:

Embedding Dimension:

🔧 Step 3: Interactive RoPE Calculator

Model Dimension (d):

Sequence Length:

Base (θ base):

🎯 Step 4: Context Length Impact

Short Context (4 tokens)

Long Context (8 tokens)

⚠️ Key Insight: Notice how the same Q, K, V transformation matrices work for both contexts, but the rotational patterns differ!

🧮 Step 5: Attention Score Calculation

How RoPE Affects Attention

When we compute attention scores (Q × K^T), the rotational encoding creates relative position awareness.

📈 Step 6: Long Context Challenges (Up to 128 Tokens)

Testing RoPE at Scale

Let's see how RoPE behaves with longer sequences and why context extension becomes challenging.

Context Length: 16 tokens

Model Size:

🔬 Step 7: Context Extension Challenges

The Problem with Longer Contexts

As context length grows beyond training:

🔄 Rotation angles become unfamiliar - model hasn't seen these angles during training
💾 Memory grows quadratically - attention matrix becomes huge
⚡ Computation explodes - O(n²) complexity