TL;DR
- Introduced by Ravi et al. at Meta FAIR in 'SAM 2: Segment Anything in Images and Videos' (arXiv:2408.00714, 29 July 2024) and released under Apache 2.0 with the SA-V dataset (~51K videos, ~643K masklets).
- Extends the original Segment Anything Model (Kirillov et al., 2023) to video by adding a memory bank and memory-attention block that propagate object identity across frames, while treating images as single-frame video to keep one model architecture.
- Four backbone variants — Tiny / Small / Base+ / Large — built on Hiera (Hierarchical ViT) image encoders. Tiny runs at video frame rates on L4; Large delivers the highest-quality masks on L40S / H100.
- Ships two predictors out of the box — `SAM2ImagePredictor` for images and `SAM2VideoPredictor` for video — both prompted by point clicks, bounding boxes or coarse masks on any frame.
- Yobitel's MediQuery uses SAM 2 as the segmentation engine for radiology annotation. Yobibyte exposes SAM 2 as a managed image- and video-segmentation endpoint, sized automatically per workspace SLO across Yobitel NeoCloud capacity.
Overview#
SAM 2 is the second-generation Segment Anything model from Meta FAIR. The first SAM (Kirillov et al., arXiv:2304.02643, April 2023) introduced the idea of a promptable foundation model for image segmentation — a single model that, given a sparse prompt (a point, a box, a coarse mask, or a text embedding), would return a high-quality mask for any object in any image. It was trained on SA-1B (11M images, 1.1B masks) and instantly became the default segmentation engine for annotation tooling, medical-imaging labelling, and many CV preprocessing pipelines.
SAM 2, released on 29 July 2024, generalises the same promptable paradigm to video. The core insight is that a video is just a sequence of images that share object identity over time. SAM 2 adds a memory bank that stores feature representations from previously prompted and predicted frames, and a memory-attention block that lets the per-frame decoder condition on that history. Prompt the object once on one frame and SAM 2 tracks the corresponding masklet through the rest of the clip; correct it mid-clip with a positive or negative click and the corrected state propagates forward.
On Yobitel infrastructure SAM 2 powers two production surfaces. Inside MediQuery — Yobitel's clinical-imaging AI application — SAM 2 is the segmentation engine clinicians use to annotate radiology series for downstream analytics and model training. On Yobibyte, SAM 2 is exposed as a managed image- and video-segmentation endpoint that customers consume from a workspace without touching CUDA or Triton. This entry helps you stand up SAM 2 in production — picking the right variant, flags and SKU for interactive annotation, offline batch masking or live video tracking — whether you are running raw upstream on your own cluster or routing through Yobibyte's managed endpoint.
Quick start#
The example below installs SAM 2, loads the Large variant, and demonstrates two flows: a single-image segmentation given a point click, and a video segmentation given a click on frame 0 with propagation through the rest of the clip. Both flows run on a single H100, L40S or L4 with appropriate VRAM headroom.
# 1. Install (from Meta's official package)
# pip install "sam2 @ git+https://github.com/facebookresearch/sam2"
# 2. Image segmentation
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.build_sam import build_sam2
from PIL import Image
import numpy as np
sam2 = build_sam2(
config_file="configs/sam2_hiera_l.yaml",
ckpt_path="checkpoints/sam2_hiera_large.pt",
device="cuda",
)
predictor = SAM2ImagePredictor(sam2)
image = np.array(Image.open("xray.png").convert("RGB"))
predictor.set_image(image)
masks, scores, logits = predictor.predict(
point_coords=np.array([[512, 384]]), # click on the object
point_labels=np.array([1]), # 1 = foreground
multimask_output=True,
)
# 3. Video segmentation with cross-frame propagation
from sam2.build_sam import build_sam2_video_predictor
video_predictor = build_sam2_video_predictor(
config_file="configs/sam2_hiera_l.yaml",
ckpt_path="checkpoints/sam2_hiera_large.pt",
)
state = video_predictor.init_state(video_path="clip.mp4")
# Click on the object in frame 0
_, _, _ = video_predictor.add_new_points_or_box(
inference_state=state,
frame_idx=0,
obj_id=1,
points=np.array([[320, 240]]),
labels=np.array([1]),
)
# Propagate the masklet through the rest of the video
for frame_idx, obj_ids, mask_logits in video_predictor.propagate_in_video(state):
pass # write per-frame masks to disk / queueHow it works#
SAM 2 is a five-component model wrapped in a per-frame inference loop with cross-frame memory. The image encoder (Hiera, a Hierarchical Vision Transformer pre-trained with Masked Autoencoder) processes each input frame into multi-scale feature maps. A memory-attention block lets the current frame cross-attend to features stored from earlier prompted and predicted frames. The prompt encoder turns user input (points, boxes, mask hints) into prompt embeddings, and a lightweight transformer mask decoder produces three candidate masks per object plus IoU confidence scores.
The memory bank is the architectural piece that makes video work. It is a FIFO buffer of recent frame features augmented with the features of prompted frames — small, typically a handful of frames worth — but enough to preserve object identity across motion, occlusion and brief out-of-view periods. When a user adds a corrective click mid-clip, the corrected frame enters the memory bank and the propagation continues from that corrected state. The result is interactive: every refinement is immediately reflected forward through the rest of the video.
For images, SAM 2 short-circuits the video machinery — the memory bank is empty, the memory-attention block is bypassed, and the model behaves like a faster, higher-quality version of original SAM. This is why the same model can serve both `SAM2ImagePredictor` and `SAM2VideoPredictor` from one weights file. On Yobibyte's managed endpoint, the platform picks the predictor based on the input type (single image vs MP4 / RTSP feed) so workspace customers do not need to think about it.
- Image encoder — Hiera ViT (T / S / B+ / L) with MAE pretraining. Produces multi-scale per-frame features.
- Memory attention — cross-attends current-frame features with the memory bank's stored features.
- Prompt encoder — sparse (points, boxes) and dense (mask) prompt encoders, inherited from SAM.
- Mask decoder — lightweight transformer decoder emitting three candidate masks plus IoU scores.
- Memory bank — FIFO buffer of recent frames plus prompted-frame features; small but sufficient for identity tracking.
- Memory encoder — distils mask + image features into compact memory tokens before storage.
`multimask_output=True` returns three candidates; pick the highest-IoU score for crisp outputs and the lowest-IoU score when you want a coarser whole-object selection.
Reference and specifications#
The table below is the canonical reference for the four SAM 2 variants and the most-used predictor flags. Parameter counts and FPS are from Meta's published numbers and Yobitel lab measurements on TensorRT-exported engines; treat throughput as planning anchors, not contractual.
| Variant | Image encoder | Total params | Use |
|---|---|---|---|
| SAM 2 Tiny | Hiera-T | ~39M | Edge devices, interactive annotation, mobile |
| SAM 2 Small | Hiera-S | ~46M | L4 video frame rates, lightweight server |
| SAM 2 Base+ | Hiera-B+ | ~81M | Production segmentation on L40S, MediQuery default |
| SAM 2 Large | Hiera-L | ~224M | Highest quality, offline batch annotation, H100 / L40S |
All four variants share the memory bank, prompt encoder and mask decoder design — only the Hiera image encoder scales. Real-time interactive segmentation on H100 is comfortable across all variants.
Workload patterns#
Three deployment shapes cover the bulk of production SAM 2 use: interactive annotation behind a UI, offline batch masking across a video archive, and live tracking on streamed video. Each pattern targets a different latency budget, batch size and quality bar. These are also the three shapes Yobibyte automates for managed customers — the flags below are what a team running raw upstream signs up to hand-tune; on Yobibyte the workspace SLO derives them.
Pattern A — interactive annotation. Single user, sub-200 ms response to each click. Use SAM 2 Base+ on L40S or SAM 2 Large on H100 with `multimask_output=True`. Pattern B — batch masking across a video archive. Throughput-optimised; per-clip latency irrelevant; pre-extract frames to local SSD; run SAM 2 Large with `propagate_in_video` in long sweeps. Pattern C — live RTSP tracking. Single-stream propagation behind a lightweight gateway; use SAM 2 Small or Base+ for sustained 25-30 FPS on L4 / L40S.
# Pattern A — interactive annotation flow (sub-200 ms per click)
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.build_sam import build_sam2
sam2 = build_sam2("configs/sam2_hiera_b+.yaml",
"checkpoints/sam2_hiera_base_plus.pt",
device="cuda")
predictor = SAM2ImagePredictor(sam2)
predictor.set_image(image) # encode once per image
# every subsequent click reuses the cached image embedding
masks, scores, _ = predictor.predict(
point_coords=clicks,
point_labels=labels,
multimask_output=True,
)
# Pattern B — offline batch masking across a video archive
import os
from sam2.build_sam import build_sam2_video_predictor
predictor = build_sam2_video_predictor(
"configs/sam2_hiera_l.yaml",
"checkpoints/sam2_hiera_large.pt",
)
for clip in os.listdir("archive/"):
state = predictor.init_state(video_path=f"archive/{clip}",
offload_video_to_cpu=True)
# seed object from a pre-computed bounding box per clip
predictor.add_new_points_or_box(state, frame_idx=0,
obj_id=1, box=seed_box[clip])
for fi, oid, ml in predictor.propagate_in_video(state):
save_mask(clip, fi, ml)
# Pattern C — live RTSP tracking with a fixed memory window
state = predictor.init_state(video_path="rtsp://camera/01",
async_loading_frames=True,
offload_state_to_cpu=False)`set_image()` and `init_state()` both run the encoder, which is the expensive part of SAM 2. For interactive workflows, encode once and re-predict on every click; for batch workflows, parallelise across multiple workers, not multiple encode calls.
Sizing and capacity planning#
SAM 2 sizing is governed by three quantities: encoder cost per frame, memory-bank size per active object, and concurrent stream / user count. The table below assumes FP16 inference on Yobitel-deployable accelerators at 1024x1024 image resolution; FP8 on Hopper roughly doubles per-stream throughput. Mask decode is cheap; the encoder is the bottleneck.
VRAM budgets are dominated by activations and the memory bank for video. SAM 2 Large with a 64-frame memory window holds roughly 4-6 GB of active state per video stream; SAM 2 Small holds roughly 1 GB. Multi-object tracking multiplies this by the number of tracked objects per frame. For annotation tooling on H100, SAM 2 Large supports 50+ concurrent annotators per GPU. For live RTSP tracking, plan one accelerator per 4-8 streams at SAM 2 Base+ depending on resolution.
| Variant | L4 (single stream FPS) | L40S (single stream FPS) | H100 (single stream FPS) |
|---|---|---|---|
| SAM 2 Tiny | ~42 FPS | ~95 FPS | ~140 FPS |
| SAM 2 Small | ~32 FPS | ~78 FPS | ~120 FPS |
| SAM 2 Base+ | ~22 FPS | ~55 FPS | ~85 FPS |
| SAM 2 Large | ~9 FPS | ~24 FPS | ~42 FPS |
On Yobitel NeoCloud, the canonical SAM 2 deployment is two-tier — an annotation tier on L40S / H100 for interactive response, and a batch propagation tier on L4 fleets for offline archive masking. InferenceBench publishes per-variant throughput-per-dollar against peer providers.
Limits and quotas#
Raw SAM 2 imposes no hard limits beyond available VRAM. The values below are operational ceilings observed in production and the corresponding defaults enforced on Yobibyte managed SAM 2 endpoints.
| Limit | Raw upstream | Yobibyte managed default | Notes |
|---|---|---|---|
| Max image resolution | VRAM-bound | 4096 x 4096 | Higher resolutions billed as a separate workspace tier. |
| Max video resolution | VRAM-bound | 1920 x 1080 (Full HD) | Sample to lower resolution for higher object counts. |
| Max objects tracked per video | GPU-bound | 32 | Memory bank grows linearly with object count. |
| Max memory window | GPU-bound | 64 frames | Longer windows improve identity recovery but cost VRAM. |
| Max concurrent annotators per endpoint | Worker-bound | 50 (H100), 20 (L40S) | Yobibyte autoscales replicas above the threshold. |
Observability#
Production SAM 2 monitoring focuses on encode time, memory-bank size and mask-quality scores rather than raw model latency. The minimum useful Prometheus surface for a self-managed deployment:
- Per-frame encode latency — histogram. Spikes indicate decoder contention or VRAM pressure.
- Per-click predict latency — histogram for interactive flows; target p95 below 200 ms.
- Memory bank size (frames) and (bytes) — per active video session.
- Mask IoU score distribution — sliding-window histogram; leftward drift signals model degradation or domain shift.
- Active sessions — gauge per endpoint; drives autoscaling decisions.
- Standard GPU telemetry — DCGM exporter for SM occupancy, memory pressure, NVLink utilisation.
On Yobibyte the same metrics surface in the workspace dashboard without additional instrumentation; on self-hosted Triton, scrape `nv_inference_request_duration_us` alongside per-frame encode timing emitted from the SAM 2 wrapper.
Cost and FinOps#
Most SAM 2 cost comes from accelerator-hours; the model weights and the SA-V dataset are free. The ranges below are Yobitel NeoCloud on-demand reference prices for common SAM 2 deployment SKUs; reserved pricing is materially lower at 12+ month commitment. All figures USD; treat as planning anchors.
| SKU | Yobitel NeoCloud on-demand | Right-sized SAM 2 deployment |
|---|---|---|
| NVIDIA L4 | ~$0.85 / GPU / hour | Batch masking with SAM 2 Tiny / Small; 4-8 live RTSP streams per GPU |
| NVIDIA L40S | ~$2.40 / GPU / hour | Annotation tier with SAM 2 Base+ for ~20 concurrent annotators |
| NVIDIA H100 SXM5 | ~$3.80 / GPU / hour | SAM 2 Large interactive; ~50 concurrent annotators per GPU |
| Yobibyte managed endpoint | Per workspace tier | Pay per request / per minute of video processed, no capacity planning |
For low-volume annotation workloads (under 1,000 clicks / day), Yobibyte's per-request pricing is typically cheaper than reserving a dedicated L40S. For sustained 24/7 video archive masking, reserved L4 capacity wins.
Security and compliance#
SAM 2's permissive licence makes the legal posture simple — the operational concerns are data residency and privacy. Inputs to SAM 2 are typically images or video that may contain identifiable people, sensitive medical content, or operational footage.
- Licence — SAM 2 weights and code are Apache 2.0. No copyleft concerns, suitable for closed-source SaaS.
- Data residency — for NCSC OFFICIAL workloads, pin the endpoint to Yobitel UK Sovereign region; for HIPAA / patient imaging, use MediQuery's dedicated SAM 2 deployment which routes through HIPAA-aligned Yobitel infrastructure.
- Input redaction — pre-blur or pre-redact faces before feeding video to SAM 2 if downstream consumers do not need identifiable subjects.
- Audit trail — Yobibyte logs per-prompt input hashes, predicted mask hashes, and predictor variant; sufficient evidence for most internal audit requirements.
- Model supply chain — pin a specific commit of the `sam2` package in production and verify checkpoint SHA against Meta's published values.
Migration and alternatives#
SAM 2 has clearly displaced the original SAM for general use; the meaningful alternatives in 2026 are domain-specific or different problem framings.
| Option | Licence | When to choose |
|---|---|---|
| SAM 2 (this entry) | Apache 2.0 | Default open promptable segmentation; image and video. |
| Original SAM | Apache 2.0 | Only if you are pinned to legacy SAM weights for reproducibility. |
| Mask R-CNN | Apache 2.0 | Closed-vocabulary instance segmentation where classes are known and stable. |
| YOLOv11-seg | AGPL-3.0 / Enterprise | Real-time, single-stage instance segmentation without a prompt UI. |
| Grounding DINO + SAM 2 | Mixed | Text-prompted segmentation pipelines — Grounding DINO yields boxes that prompt SAM 2. |
| Yobibyte managed SAM 2 | Yobitel Service Terms | Skip the runtime; consume a hosted segmentation endpoint with SLA. |
For MediQuery-grade clinical annotation, the right answer is SAM 2 inside MediQuery rather than raw SAM 2. The HIPAA posture, audit logging and clinician-grade UI are not part of the upstream package.
Troubleshooting#
The failure modes below cover roughly 80 percent of SAM 2 production tickets observed in Yobitel Managed Operations runbooks.
| Symptom | Likely cause | Remediation |
|---|---|---|
| Masklet drifts after occlusion | Memory bank window too short | Increase `max_obj_ptrs_in_encoder` (default 16) and `max_cond_frames_in_attn`. |
| Mask quality drops over time in video | Object appearance shifted beyond seed frame | Add a refinement click at the drift point; SAM 2 propagates the correction forward. |
| VRAM OOM on long video | `offload_video_to_cpu=False` with large clip | Set `offload_video_to_cpu=True` and `offload_state_to_cpu=True` for clips > 30 s. |
| Per-click latency spikes | `set_image()` re-running on each click | Cache the predictor's image state; only call `set_image()` when the image changes. |
| Masks under-segment (whole object missed) | Click placed on wrong scale | Use `multimask_output=True` and pick the higher-area candidate; or add a box prompt. |
| Tracking IDs swap between objects | Two visually similar objects in close proximity | Add a negative click on the wrong target to force separation; or track each object independently. |
Where it fits in the Yobitel stack#
SAM 2 sits inside two Yobitel surfaces. In MediQuery, SAM 2 is the segmentation engine clinicians use to annotate radiology series — CT slices, MR sequences, pathology slides — for downstream analytics and model training. The MediQuery deployment runs SAM 2 Base+ on L40S behind a HIPAA-aligned audit log and the clinician-facing UI; it is not a generic SAM 2 endpoint but a clinical product built on top of one. On Yobibyte, SAM 2 is exposed as a managed image- and video-segmentation endpoint: customers submit images or video to a workspace and Yobibyte picks the variant, the placement and the concurrency to meet the workspace SLO.
Yobitel NeoCloud provides the underlying L4 / L40S / H100 capacity for both surfaces and for any self-managed SAM 2 deployment customers prefer to run themselves. InferenceBench tracks SAM 2 throughput-per-dollar across Yobitel and peer providers for the four variants and three workload shapes — the same data informs Yobibyte's internal placement decisions via Omniscient Compute.
- [MediQuery](/products/ai-applications/mediquery) — clinical SAM 2 deployment for radiology annotation.
- [Yobibyte](/products/yobibyte) — managed SAM 2 image and video segmentation endpoint.
- [Yobitel NeoCloud](/services/neocloud) — L4 / L40S / H100 capacity for self-managed deployments.
- [InferenceBench](/products/inferencebench) — public SAM 2 throughput-per-dollar tracking.