šŸŽÆ Context Length Impact: Training vs Inference

Explore how models trained on long contexts perform on shorter sequences with step-by-step mathematical analysis

šŸ”§ Interactive Model Configuration

šŸ”„ RoPE Frequency Calculation: Fixed vs Dynamic

Understanding What Changes and What Doesn't

A common question: As conversations get longer, are RoPE frequencies recalculated?

šŸ”‘ Answer: NO! RoPE frequencies (Īø values) are calculated ONCE based on model dimension and NEVER change during inference.

šŸ“š Step-by-Step Mathematical Analysis

šŸ“š Step-by-Step Mathematical Analysis

Understanding the Question

We want to understand: If a model is trained on very long contexts, how does it perform on much shorter contexts?

šŸ¤” Key Question: Do the learned parameters (W_Q, W_K, W_V, token embeddings) work well when the positional encodings are completely different due to shorter sequences?

šŸ” Mathematical Proof with Concrete Examples

šŸ“Š Performance Comparison