šŸŒ€ RoPE: Rotary Position Embedding Tutorial

Learn how the same transformer model behaves differently with various context lengths!

šŸ“š Step 1: Understanding Basic RoPE

What is RoPE?

RoPE (Rotary Position Embedding) encodes position by rotating token embeddings in high-dimensional space. Instead of adding position vectors, it rotates them!

Key Formula:
Īø_i = 10000^(-2i/d) where i = dimension pair index
For position m: rotate by angle m Ɨ Īø_i

šŸ“ Step 1: Understanding Dimension Pairs

What are Dimension Pairs?

RoPE doesn't rotate individual dimensions - it rotates pairs of adjacent dimensions together as 2D coordinates (also called "2D chunks").

šŸŽÆ Visual Example: 8D Embedding → 4 Dimension Pairs

Original 8D embedding:
[xā‚€, x₁, xā‚‚, xā‚ƒ, xā‚„, xā‚…, x₆, x₇]

Grouped into dimension pairs:
• Pair 0: (xā‚€, x₁) ← dimensions 0 and 1
• Pair 1: (xā‚‚, xā‚ƒ) ← dimensions 2 and 3
• Pair 2: (xā‚„, xā‚…) ← dimensions 4 and 5
• Pair 3: (x₆, x₇) ← dimensions 6 and 7

Each pair = 2D coordinates that get rotated!
šŸ”„ Why Pairs?
RoPE applies 2D rotations - you need exactly 2 coordinates to rotate in a plane:

Think of each pair as coordinates on a 2D plane:
• (xā‚€, x₁) = point on plane 0
• (xā‚‚, xā‚ƒ) = point on plane 1
• (xā‚„, xā‚…) = point on plane 2
• etc.

Each plane gets rotated by a different amount based on position!

šŸ“ Step 2: Token Embeddings & RoPE Application

Complete RoPE Process

RoPE works by rotating pairs of embedding dimensions based on position. Let's see the complete transformation!

šŸ”§ Step 3: Interactive RoPE Calculator

šŸŽÆ Step 4: Context Length Impact

Short Context (4 tokens)

Long Context (8 tokens)

āš ļø Key Insight: Notice how the same Q, K, V transformation matrices work for both contexts, but the rotational patterns differ!

🧮 Step 5: Attention Score Calculation

How RoPE Affects Attention

When we compute attention scores (Q Ɨ K^T), the rotational encoding creates relative position awareness.

šŸ“ˆ Step 6: Long Context Challenges (Up to 128 Tokens)

Testing RoPE at Scale

Let's see how RoPE behaves with longer sequences and why context extension becomes challenging.

16 tokens

šŸ”¬ Step 7: Context Extension Challenges

The Problem with Longer Contexts

As context length grows beyond training: