CLIP (Contrastive Language-Image Pre-training) revolutionized AI by learning to connect images and text in a shared embedding space. Unlike traditional vision models trained on fixed categories, CLIP learns from natural language descriptions, enabling zero-shot classification, image search with text, and the foundation for modern multimodal AI systems like GPT-4V and DALL-E.
CLIP uses a "dual encoder" architecture with separate but parallel processing towers for images and text. Both encoders map their inputs to the same high-dimensional space where similar concepts cluster together.
The Big Idea: CLIP's revolutionary insight is creating a shared mathematical space where both images and text live as vectors. Think of it like a universal translator that converts both "a photo of a dog" and an actual dog photo into the same mathematical language - vectors of numbers that can be directly compared.
The Big Idea: CLIP's revolutionary insight is creating a shared mathematical space where both images and text live as vectors. Think of it like a universal translator that converts both "a photo of a dog" and an actual dog photo into the same mathematical language - vectors of numbers that can be directly compared.
See the math in action: Enter different similarity values and see how normalization and temperature scaling affect the final predictions. This is exactly what happens inside CLIP millions of times during training!
The Problem: Without normalization, a vector like [1000, 2000, 3000] would have much higher similarity scores than [1, 2, 3] even if they point in the exact same direction!
What it does: Temperature (ฯ) controls how "confident" or "uncertain" the model's predictions become. It's like adjusting the contrast on a photo - lower temperature = higher contrast (more decisive), higher temperature = lower contrast (more uncertain).
Experiment: See how the same similarity scores produce very different probability distributions with different temperatures. This is crucial for CLIP's training stability!
See the math in action: Enter different similarity values and see how normalization and temperature scaling affect the final predictions. This is exactly what happens inside CLIP millions of times during training!
The Problem: Without normalization, a vector like [1000, 2000, 3000] would have much higher similarity scores than [1, 2, 3] even if they point in the exact same direction!
What it does: Temperature (ฯ) controls how "confident" or "uncertain" the model's predictions become. It's like adjusting the contrast on a photo - lower temperature = higher contrast (more decisive), higher temperature = lower contrast (more uncertain).
Experiment: See how the same similarity scores produce very different probability distributions with different temperatures. This is crucial for CLIP's training stability!
CLIP learns by seeing millions of (image, text) pairs from the internet and learning to maximize similarity between correct pairs while minimizing similarity between incorrect pairs. This creates a rich, semantic embedding space.
CLIP was trained on 400 million (image, text) pairs collected from the internet. This massive, diverse dataset is key to CLIP's remarkable generalization abilities.
CLIP can classify images into categories it has never explicitly seen during training. Instead of learning fixed categories, it learned the relationship between visual concepts and language. Give it any text description, and it can find matching images!
CLIP achieved unprecedented zero-shot performance, often matching or exceeding supervised models trained specifically on target datasets.
CLIP creates a rich embedding space where semantically similar images and text cluster together. This space is the foundation for all of CLIP's capabilities.
import torch
import clip
from PIL import Image
# Load model and preprocessing
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Load and preprocess image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
# Tokenize text
text = clip.tokenize(["a dog", "a cat", "a car"]).to(device)
# Get features
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize features
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Calculate similarities
similarities = (image_features @ text_features.T).squeeze(0)
probs = similarities.softmax(dim=-1)
print(f"Probabilities: {probs}")
for i, text_input in enumerate(["a dog", "a cat", "a car"]):
print(f"{text_input}: {probs[i].item():.3f}")
import torch
import clip
from PIL import Image
import numpy as np
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def zero_shot_classify(image_path, classes, template="A photo of a {}"):
"""
Perform zero-shot classification on an image
"""
# Load and preprocess image
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# Create text prompts for each class
text_prompts = [template.format(cls) for cls in classes]
text = clip.tokenize(text_prompts).to(device)
with torch.no_grad():
# Get features
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Calculate similarities and probabilities
similarities = (image_features @ text_features.T).squeeze(0)
probs = similarities.softmax(dim=-1)
# Return results
results = []
for i, cls in enumerate(classes):
results.append({
'class': cls,
'probability': probs[i].item(),
'similarity': similarities[i].item()
})
return sorted(results, key=lambda x: x['probability'], reverse=True)
# Example usage
classes = ['dog', 'cat', 'bird', 'car', 'airplane', 'ship']
results = zero_shot_classify('test_image.jpg', classes)
print("Zero-shot classification results:")
for result in results:
print(f"{result['class']}: {result['probability']:.3f}")
import torch
import clip
from PIL import Image
import os
from pathlib import Path
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
class CLIPSearchEngine:
def __init__(self, image_dir):
self.image_dir = Path(image_dir)
self.image_features = []
self.image_paths = []
self.build_index()
def build_index(self):
"""Build searchable index of image features"""
print("Building CLIP index...")
for img_path in self.image_dir.glob("*.jpg"):
try:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
features = model.encode_image(image)
features /= features.norm(dim=-1, keepdim=True)
self.image_features.append(features.cpu())
self.image_paths.append(img_path)
except Exception as e:
print(f"Error processing {img_path}: {e}")
self.image_features = torch.cat(self.image_features, dim=0)
print(f"Indexed {len(self.image_paths)} images")
def search(self, query_text, top_k=5):
"""Search for images matching text query"""
# Encode query text
text = clip.tokenize([query_text]).to(device)
with torch.no_grad():
text_features = model.encode_text(text)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Calculate similarities
similarities = (self.image_features @ text_features.T).squeeze()
top_indices = similarities.topk(top_k).indices
results = []
for idx in top_indices:
results.append({
'path': self.image_paths[idx],
'similarity': similarities[idx].item()
})
return results
# Example usage
search_engine = CLIPSearchEngine("./image_database/")
# Search for images
results = search_engine.search("a beautiful sunset over mountains", top_k=10)
print("Search results:")
for result in results:
print(f"{result['path']}: {result['similarity']:.3f}")
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import clip
class CLIPLoss(nn.Module):
def __init__(self, temperature=0.07):
super().__init__()
self.temperature = temperature
def forward(self, image_features, text_features):
# Normalize features
image_features = F.normalize(image_features, dim=1)
text_features = F.normalize(text_features, dim=1)
# Calculate similarity matrix
similarities = torch.matmul(image_features, text_features.T) / self.temperature
# Create labels (diagonal matrix)
batch_size = similarities.size(0)
labels = torch.arange(batch_size).to(similarities.device)
# Calculate contrastive losses
loss_img_to_text = F.cross_entropy(similarities, labels)
loss_text_to_img = F.cross_entropy(similarities.T, labels)
return (loss_img_to_text + loss_text_to_img) / 2
def train_clip_epoch(model, dataloader, optimizer, criterion, device):
model.train()
total_loss = 0
for batch_idx, (images, texts) in enumerate(dataloader):
images, texts = images.to(device), texts.to(device)
# Forward pass
image_features = model.encode_image(images)
text_features = model.encode_text(texts)
# Calculate loss
loss = criterion(image_features, text_features)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
if batch_idx % 100 == 0:
print(f'Batch {batch_idx}, Loss: {loss.item():.4f}')
return total_loss / len(dataloader)
# Training setup
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
criterion = CLIPLoss(temperature=0.07)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.1)
# Note: You'll need to implement your own dataset class
# that returns (image, text) pairs
# train_dataset = CustomCLIPDataset(...)
# train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
print("๐ CLIP training setup complete!")
print("Remember: CLIP training requires massive datasets (400M+ pairs)")
print("Consider fine-tuning on smaller domain-specific datasets instead")