CLIP — Contrastive Language-Image Pretraining

TL;DR

Introduced by Radford et al. at OpenAI in 'Learning Transferable Visual Models From Natural Language Supervision' (arXiv:2103.00020, February 2021) and released under MIT licence.
Two encoders — one image (ViT or ResNet), one text (transformer) — trained contrastively so matching pairs land close in a shared embedding space and mismatched pairs land far apart.
Trained on the proprietary WIT dataset of 400 million image-text pairs scraped from the web. Enabled true zero-shot image classification: rank class-name text embeddings against image embeddings, pick the highest similarity.
Architectural foundation for the multimodal era — drives image search, Stable Diffusion's conditioning, open-vocabulary detection (OWL-ViT, GLIP, Grounding DINO), and the vision tower of most multimodal LLMs.
Yobibyte's multimodal recipes route image inputs through CLIP-family vision towers; Yobitel customers building image-search products on Yobitel NeoCloud lean on OpenCLIP and SigLIP variants for the embedding layer.

Overview

CLIP — Contrastive Language-Image Pre-training — is the architecture pattern that brought vision and language into a shared embedding space at web scale. Before CLIP, transferable vision models were trained on ImageNet with a fixed 1000-class label vocabulary; anything outside that vocabulary required fine-tuning on labelled examples. CLIP swapped the supervision signal: instead of class labels, it used natural-language captions paired with images, scraped from the public web at scale (OpenAI's proprietary WIT dataset, 400 million pairs).

The training objective is symmetric contrastive — for each batch of N image-text pairs, compute an N x N similarity matrix of image and text embeddings, then apply a cross-entropy loss that pushes the diagonal up and the off-diagonal down. The model learns, without ever seeing an explicit class label, that 'a photo of a dog' should embed near a photo of a dog and far from a photo of a saxophone. At inference, classification becomes ranking: encode the candidate class names as text, encode the input image, and pick the highest cosine similarity.

This entry helps you decide when a CLIP-family encoder is the right tool — image search, zero-shot classification, multimodal LLM vision tower, generative model conditioning — and which variant (original CLIP, OpenCLIP, SigLIP, EVA-CLIP, DFN-CLIP) to pick for your workload. Yobibyte's multimodal recipes route image inputs through CLIP-family vision towers without exposing the implementation choice to the workspace customer; Yobitel customers building image-search and content-moderation products on NeoCloud typically self-deploy OpenCLIP or SigLIP behind Triton. Both consumption paths reduce to the same architectural pattern this entry describes.

How it works

CLIP is two encoders and one shared embedding space. The image encoder — either a modified ResNet (50, 101, 50x4, 50x16, 50x64) or a Vision Transformer (ViT-B/32, B/16, L/14, L/14@336px) — maps an image to a feature vector. The text encoder — a 12-layer transformer with 8 attention heads, 512-dim embeddings, BPE tokenisation, max sequence length 77 — maps a tokenised caption to a feature vector. Both vectors are then linearly projected to a shared embedding dimension (typically 512 or 768) and L2-normalised. Cosine similarity in that shared space is the model's notion of image-text alignment.

Training uses symmetric InfoNCE. For each minibatch of N pairs, the model computes an N x N similarity matrix S where S[i, j] is the dot product of image embedding i with text embedding j, scaled by a learned temperature tau. The loss is the average of two cross-entropy terms — one over rows (image-to-text contrastive), one over columns (text-to-image contrastive) — each treating the diagonal as the positive class and the off-diagonal as in-batch negatives. The temperature parameter is learned and typically settles around tau = 0.01 (logit_scale = log(1 / tau) ≈ 4.6).

The single mathematical line that captures all of it is the per-pair logit: z_ij = (image_i · text_j) / tau, with the softmax-normalised cross-entropy applied across each row and each column. Everything else — the choice of encoder, the choice of preprocessing, the resolution — is engineering scaffolding around this single objective.

# Zero-shot classification with CLIP — the canonical demonstration
import torch, clip
from PIL import Image

device = "cuda"
model, preprocess = clip.load("ViT-L/14", device=device)

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
labels = ["a photo of a dog", "a photo of a cat", "a photo of a bird"]
text = clip.tokenize(labels).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Cosine similarity x learned temperature, softmaxed
logits = (image_features @ text_features.T) * model.logit_scale.exp()
probs = logits.softmax(dim=-1)
print(dict(zip(labels, probs[0].tolist())))

Tip: Prompt engineering — 'a photo of a {label}' beats just '{label}', and ensembles of multiple prompts ('a photo of a {label}', 'a picture of a {label}', 'a close-up of a {label}') beat any single prompt — became its own micro-discipline because zero-shot accuracy is materially sensitive to the prompt template.

Variants and architectural choices

Every encoder family that followed CLIP made the same broad bet — dual encoders, shared embedding, contrastive loss — and tuned the data, the loss or the backbone. The table below is the practical decision matrix for 2026 deployments. All are dual-encoder; all produce a shared embedding space; all support zero-shot classification and embedding-based retrieval.

Variant	Loss	Data	When to choose
OpenAI CLIP	Symmetric InfoNCE	WIT 400M (proprietary)	Reproducibility against published results; pinned legacy systems.
OpenCLIP	Symmetric InfoNCE	LAION-2B / LAION-5B (open)	Default open CLIP for new work; many trained scales available.
SigLIP / SigLIP 2	Per-pair sigmoid	WebLI (proprietary)	Better scaling, small-batch friendly, multimodal LLM vision tower.
EVA-CLIP / EVA-02-CLIP	Symmetric InfoNCE on EVA backbone	Public mix	Largest open backbones (up to 18B); strongest zero-shot.
DFN5B-CLIP	Symmetric InfoNCE	DFN (filtered web)	Best open zero-shot when data quality matters more than quantity.
Chinese-CLIP / MetaCLIP-multilingual	Symmetric InfoNCE	Multilingual web	Non-English retrieval workloads.

Note: For new image-search or retrieval work in 2026 the default Yobitel recommendation is SigLIP 2 (better scaling, current public weights, smaller batch requirements). For research-reproduction or content-moderation workloads where ImageNet-style zero-shot accuracy is the bar, EVA-02-CLIP or DFN5B-CLIP win.

When to use vs alternatives

CLIP-family encoders are the right answer when you need joint vision-language understanding from a frozen feature extractor. They are the wrong answer when your task is vision-only with dense spatial structure, or when you need understanding richer than coarse alignment.

Use CLIP for — image search by text query, zero-shot classification across an open vocabulary, content moderation against natural-language policies, conditioning generative models (Stable Diffusion, Imagen, video generators), aesthetic scoring, vision tower for multimodal LLMs.
Use DINOv2 instead for — vision-only dense prediction, semantic segmentation, depth estimation, instance retrieval where text queries are irrelevant. DINOv2's self-supervised features have stronger patch-level spatial structure.
Use a captioning model (BLIP-2, InternVL) instead for — generating textual descriptions of images. CLIP scores existing text against images; it does not produce text.
Use a region-grounded model (Grounding DINO, OWL-ViT) instead for — open-vocabulary object detection. These use CLIP-style text encoders but add a detector head.
Use a multimodal LLM (LLaVA, Qwen-VL, InternVL) instead for — visual reasoning, document understanding, complex VQA. CLIP cannot reason about composition or perform multi-step inference.

Trade-offs and known limitations

CLIP's strengths are also its limits. The contrastive objective rewards coarse alignment, which is exactly what makes it transferable, but it leaves several systematic blind spots.

Short text — the text encoder is capped at 77 tokens. Long captions are truncated; long-form retrieval needs alternative encoders (e.g., T5-based) or chunked CLIP with score aggregation.
Fine-grained discrimination — CLIP is strong on coarse categories ('a dog' vs 'a saxophone') and weaker on near-duplicate variants ('a Welsh Corgi' vs 'a Pembroke Welsh Corgi'). Fine-tuned or domain-specific encoders win here.
Compositional understanding — 'a red cube on top of a blue sphere' often confuses object-attribute bindings. CLIP averages over the bag of words rather than respecting spatial composition.
Web bias — training data inherits the biases of public web scrapes. Demographic, geographic and aesthetic bias is documented and consequential for any downstream content-moderation deployment.
Resolution — most variants train at 224 or 336 pixels; details below that resolution can be lost. SigLIP 2 So400m at 384 or 448 is the modern higher-resolution choice.
Adversarial fragility — typographic attacks (a sticker reading 'iPod' on an apple) can fool CLIP into classifying the apple as an iPod. Documented; do not assume robustness on adversarial inputs.

Practical implementation notes

CLIP-family encoders are deployed today through a small set of well-maintained libraries. The choice between them is usually driven by which variant you are running and which serving framework owns the rest of your stack.

OpenAI CLIP (openai/CLIP) — MIT-licensed reference implementation. Useful for legacy reproducibility; rarely the right choice for new deployments.
OpenCLIP (mlfoundations/open_clip) — community-maintained, broad checkpoint catalogue (LAION-2B, LAION-5B, DataComp, DFN). The de facto default for open CLIP work.
Hugging Face Transformers — CLIPModel, SiglipModel, Owlv2Model and friends. The right choice when CLIP sits inside a larger pipeline that already uses Transformers.
ONNX / TensorRT export — both encoders are dense transformers that export cleanly to ONNX and TensorRT FP16. Production deployments behind Triton typically export both encoders separately and compose them in an ensemble.
Vector database integration — CLIP embeddings (512 to 1024 dim) drop directly into Milvus, Qdrant, Weaviate, pgvector, or any cosine-similarity index. Normalise embeddings before indexing.

Warning: OpenAI's original CLIP repository has not been actively maintained since 2022. New deployments should use OpenCLIP, SigLIP via Hugging Face, or EVA-CLIP for current weights, better tooling and continued security updates.

Where this fits in the Yobitel stack

CLIP-family encoders show up in two places on Yobitel infrastructure. Inside Yobibyte, multimodal recipes route image inputs through a CLIP-family vision tower — the platform picks the specific variant (SigLIP 2 by default for new workloads, OpenCLIP or EVA-CLIP when the customer pins a specific checkpoint) based on the workspace's stated task and SLO. Customers consuming a multimodal endpoint through Yobibyte do not see the encoder choice; they see the embedding endpoint and the downstream classifier or LLM.

Outside Yobibyte, Yobitel NeoCloud customers building image-search and content-moderation products typically self-deploy OpenCLIP or SigLIP behind Triton on L4 or L40S, with embeddings indexed in Milvus, Qdrant or pgvector. The encoder fits on a single accelerator; throughput scales linearly with replicas. For sovereign workloads, the NeoCloud UK region runs the same OpenCLIP checkpoints under NCSC OFFICIAL alignment.

Where CLIP relevance comes up in adjacent Yobitel surfaces: Grounding DINO (open-vocabulary detection used in MediQuery research pipelines) embeds CLIP-style text encoders; SAM 2 prompts can be composed with CLIP-based class proposals for fully-automated segmentation; generative video and image workspaces on Yobibyte condition their backbones on CLIP or T5 text embeddings. InferenceBench tracks SigLIP and OpenCLIP encoder throughput-per-dollar across Yobitel and peer providers — useful when planning a self-managed deployment against a managed Yobibyte alternative.

Yobibyte — multimodal recipes route image inputs through CLIP-family vision towers.
Yobitel NeoCloud — L4 / L40S / H100 capacity for self-deployed OpenCLIP and SigLIP.
InferenceBench — public encoder throughput-per-dollar tracking.
SigLIP and DINOv2 — the two most common encoder companions to CLIP in 2026 stacks.

References

TL;DR

Introduced by Radford et al. at OpenAI in 'Learning Transferable Visual Models From Natural Language Supervision' (arXiv:2103.00020, February 2021) and released under MIT licence.
Two encoders — one image (ViT or ResNet), one text (transformer) — trained contrastively so matching pairs land close in a shared embedding space and mismatched pairs land far apart.
Trained on the proprietary WIT dataset of 400 million image-text pairs scraped from the web. Enabled true zero-shot image classification: rank class-name text embeddings against image embeddings, pick the highest similarity.
Architectural foundation for the multimodal era — drives image search, Stable Diffusion's conditioning, open-vocabulary detection (OWL-ViT, GLIP, Grounding DINO), and the vision tower of most multimodal LLMs.
Yobibyte's multimodal recipes route image inputs through CLIP-family vision towers; Yobitel customers building image-search products on Yobitel NeoCloud lean on OpenCLIP and SigLIP variants for the embedding layer.

Overview

How it works

# Zero-shot classification with CLIP — the canonical demonstration
import torch, clip
from PIL import Image

device = "cuda"
model, preprocess = clip.load("ViT-L/14", device=device)

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
labels = ["a photo of a dog", "a photo of a cat", "a photo of a bird"]
text = clip.tokenize(labels).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Cosine similarity x learned temperature, softmaxed
logits = (image_features @ text_features.T) * model.logit_scale.exp()
probs = logits.softmax(dim=-1)
print(dict(zip(labels, probs[0].tolist())))

Tip: Prompt engineering — 'a photo of a {label}' beats just '{label}', and ensembles of multiple prompts ('a photo of a {label}', 'a picture of a {label}', 'a close-up of a {label}') beat any single prompt — became its own micro-discipline because zero-shot accuracy is materially sensitive to the prompt template.

Variants and architectural choices

Variant	Loss	Data	When to choose
OpenAI CLIP	Symmetric InfoNCE	WIT 400M (proprietary)	Reproducibility against published results; pinned legacy systems.
OpenCLIP	Symmetric InfoNCE	LAION-2B / LAION-5B (open)	Default open CLIP for new work; many trained scales available.
SigLIP / SigLIP 2	Per-pair sigmoid	WebLI (proprietary)	Better scaling, small-batch friendly, multimodal LLM vision tower.
EVA-CLIP / EVA-02-CLIP	Symmetric InfoNCE on EVA backbone	Public mix	Largest open backbones (up to 18B); strongest zero-shot.
DFN5B-CLIP	Symmetric InfoNCE	DFN (filtered web)	Best open zero-shot when data quality matters more than quantity.
Chinese-CLIP / MetaCLIP-multilingual	Symmetric InfoNCE	Multilingual web	Non-English retrieval workloads.

Note: For new image-search or retrieval work in 2026 the default Yobitel recommendation is SigLIP 2 (better scaling, current public weights, smaller batch requirements). For research-reproduction or content-moderation workloads where ImageNet-style zero-shot accuracy is the bar, EVA-02-CLIP or DFN5B-CLIP win.

When to use vs alternatives

Use CLIP for — image search by text query, zero-shot classification across an open vocabulary, content moderation against natural-language policies, conditioning generative models (Stable Diffusion, Imagen, video generators), aesthetic scoring, vision tower for multimodal LLMs.
Use DINOv2 instead for — vision-only dense prediction, semantic segmentation, depth estimation, instance retrieval where text queries are irrelevant. DINOv2's self-supervised features have stronger patch-level spatial structure.
Use a captioning model (BLIP-2, InternVL) instead for — generating textual descriptions of images. CLIP scores existing text against images; it does not produce text.
Use a region-grounded model (Grounding DINO, OWL-ViT) instead for — open-vocabulary object detection. These use CLIP-style text encoders but add a detector head.
Use a multimodal LLM (LLaVA, Qwen-VL, InternVL) instead for — visual reasoning, document understanding, complex VQA. CLIP cannot reason about composition or perform multi-step inference.

Trade-offs and known limitations

CLIP's strengths are also its limits. The contrastive objective rewards coarse alignment, which is exactly what makes it transferable, but it leaves several systematic blind spots.

Short text — the text encoder is capped at 77 tokens. Long captions are truncated; long-form retrieval needs alternative encoders (e.g., T5-based) or chunked CLIP with score aggregation.
Fine-grained discrimination — CLIP is strong on coarse categories ('a dog' vs 'a saxophone') and weaker on near-duplicate variants ('a Welsh Corgi' vs 'a Pembroke Welsh Corgi'). Fine-tuned or domain-specific encoders win here.
Compositional understanding — 'a red cube on top of a blue sphere' often confuses object-attribute bindings. CLIP averages over the bag of words rather than respecting spatial composition.
Web bias — training data inherits the biases of public web scrapes. Demographic, geographic and aesthetic bias is documented and consequential for any downstream content-moderation deployment.
Resolution — most variants train at 224 or 336 pixels; details below that resolution can be lost. SigLIP 2 So400m at 384 or 448 is the modern higher-resolution choice.
Adversarial fragility — typographic attacks (a sticker reading 'iPod' on an apple) can fool CLIP into classifying the apple as an iPod. Documented; do not assume robustness on adversarial inputs.

Practical implementation notes

OpenAI CLIP (openai/CLIP) — MIT-licensed reference implementation. Useful for legacy reproducibility; rarely the right choice for new deployments.
OpenCLIP (mlfoundations/open_clip) — community-maintained, broad checkpoint catalogue (LAION-2B, LAION-5B, DataComp, DFN). The de facto default for open CLIP work.
Hugging Face Transformers — CLIPModel, SiglipModel, Owlv2Model and friends. The right choice when CLIP sits inside a larger pipeline that already uses Transformers.
ONNX / TensorRT export — both encoders are dense transformers that export cleanly to ONNX and TensorRT FP16. Production deployments behind Triton typically export both encoders separately and compose them in an ensemble.
Vector database integration — CLIP embeddings (512 to 1024 dim) drop directly into Milvus, Qdrant, Weaviate, pgvector, or any cosine-similarity index. Normalise embeddings before indexing.

Warning: OpenAI's original CLIP repository has not been actively maintained since 2022. New deployments should use OpenCLIP, SigLIP via Hugging Face, or EVA-CLIP for current weights, better tooling and continued security updates.

Where this fits in the Yobitel stack

Yobibyte — multimodal recipes route image inputs through CLIP-family vision towers.
Yobitel NeoCloud — L4 / L40S / H100 capacity for self-deployed OpenCLIP and SigLIP.
InferenceBench — public encoder throughput-per-dollar tracking.
SigLIP and DINOv2 — the two most common encoder companions to CLIP in 2026 stacks.

CLIP — Contrastive Language-Image Pre-training

Overview

How it works

Variants and architectural choices

When to use vs alternatives

Trade-offs and known limitations

Practical implementation notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

CLIP — Contrastive Language-Image Pre-training

Overview

How it works

Variants and architectural choices

When to use vs alternatives

Trade-offs and known limitations

Practical implementation notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte