Learn how the same transformer model behaves differently with various context lengths!
RoPE (Rotary Position Embedding) encodes position by rotating token embeddings in high-dimensional space. Instead of adding position vectors, it rotates them!
RoPE doesn't rotate individual dimensions - it rotates pairs of adjacent dimensions together as 2D coordinates (also called "2D chunks").
RoPE works by rotating pairs of embedding dimensions based on position. Let's see the complete transformation!
When we compute attention scores (Q Ć K^T), the rotational encoding creates relative position awareness.
Let's see how RoPE behaves with longer sequences and why context extension becomes challenging.
As context length grows beyond training: