Explore how models trained on long contexts perform on shorter sequences with step-by-step mathematical analysis
A common question: As conversations get longer, are RoPE frequencies recalculated?
We want to understand: If a model is trained on very long contexts, how does it perform on much shorter contexts?