Now that you understand how images become patch tokens, let's explore the core mechanism that makes Vision Transformers so powerful: attention. This is where the magic happens - how every patch can "see" and relate to every other patch from the very first layer.
Think of attention as a search engine for image patches. When processing a patch (like "cat's eye"), the model searches through all other patches to find relevant information (like "cat's face", "whiskers", "fur texture"). The attention mechanism determines how much each patch should contribute to understanding the current patch.
Before diving into the interactive demo, let's understand the temperature parameter (โdk) in the attention formula. This isn't like the "creativity temperature" you might know from ChatGPT - it serves a completely different purpose.
Instructions: Click on any patch to see how it attends to all other patches. Watch the QueryรKey computation, softmax normalization, and final weighted aggregation!
Selected: None
"What am I looking for?"
"What do I contain?"
"What info do I provide?"
Single-head attention is like looking at an image with one eye. Multi-head attention is like having multiple specialized visual systems - one for detecting edges, another for colors, another for spatial relationships, etc. Each "head" learns to focus on different aspects of the image.
Vision Transformers learn progressively more sophisticated attention patterns as information flows through layers. Early layers focus on local patterns, middle layers discover object parts and spatial relationships, and late layers develop global semantic understanding.
| Layer Range | Attention Focus | Typical Patterns | Function |
|---|---|---|---|
| Early (0-3) | Local neighborhoods | Adjacent patches, edges | Low-level feature detection |
| Middle (4-8) | Object parts | Spatially related regions | Part-whole relationships |
| Late (9-11) | Global semantics | Semantic similarity | Scene understanding |
This is where Vision Transformers truly shine. While CNNs gradually expand their receptive field through layers, ViTs have global receptive fields from layer 1. Every patch can immediately access information from every other patch in the image.
Attention's global connectivity comes at a cost: quadratic complexity in sequence length. For a 224ร224 image with 16ร16 patches, that's 196 patches, requiring a 196ร196 attention matrix per head. Let's analyze the real-world implications.
One of the most powerful aspects of attention mechanisms is their interpretability. Unlike CNN feature maps, attention weights directly show us what the model is "looking at" when making decisions.
Moving from research to production requires careful attention to computational constraints. Here are key optimization strategies for deploying attention-based vision models at scale.
| Optimization | Memory Savings | Speed Improvement | Accuracy Impact |
|---|---|---|---|
| Mixed Precision | ~50% | 1.5-2x | Minimal |
| Gradient Checkpointing | ~75% | 0.8x (slower) | None |
| Attention Sparsity | 10-30% | 1.2-1.5x | Small |
| Linear Attention | ~60% | 2-3x | Moderate |