SAM 2 — Segment Anything Model 2

TL;DR

Introduced by Ravi et al. at Meta FAIR in 'SAM 2: Segment Anything in Images and Videos' (arXiv:2408.00714, 29 July 2024) and released under Apache 2.0 with the SA-V dataset (~51K videos, ~643K masklets).
Extends the original Segment Anything Model (Kirillov et al., 2023) to video by adding a memory bank and memory-attention block that propagate object identity across frames, while treating images as single-frame video to keep one model architecture.
Four backbone variants — Tiny / Small / Base+ / Large — built on Hiera (Hierarchical ViT) image encoders. Tiny runs at video frame rates on L4; Large delivers the highest-quality masks on L40S / H100.
Ships two predictors out of the box — `SAM2ImagePredictor` for images and `SAM2VideoPredictor` for video — both prompted by point clicks, bounding boxes or coarse masks on any frame.
Yobitel's MediQuery uses SAM 2 as the segmentation engine for radiology annotation. Yobibyte exposes SAM 2 as a managed image- and video-segmentation endpoint, sized automatically per workspace SLO across Yobitel NeoCloud capacity.

Overview

SAM 2 is the second-generation Segment Anything model from Meta FAIR. The first SAM (Kirillov et al., arXiv:2304.02643, April 2023) introduced the idea of a promptable foundation model for image segmentation — a single model that, given a sparse prompt (a point, a box, a coarse mask, or a text embedding), would return a high-quality mask for any object in any image. It was trained on SA-1B (11M images, 1.1B masks) and instantly became the default segmentation engine for annotation tooling, medical-imaging labelling, and many CV preprocessing pipelines.

SAM 2, released on 29 July 2024, generalises the same promptable paradigm to video. The core insight is that a video is just a sequence of images that share object identity over time. SAM 2 adds a memory bank that stores feature representations from previously prompted and predicted frames, and a memory-attention block that lets the per-frame decoder condition on that history. Prompt the object once on one frame and SAM 2 tracks the corresponding masklet through the rest of the clip; correct it mid-clip with a positive or negative click and the corrected state propagates forward.

On Yobitel infrastructure SAM 2 powers two production surfaces. Inside MediQuery — Yobitel's clinical-imaging AI application — SAM 2 is the segmentation engine clinicians use to annotate radiology series for downstream analytics and model training. On Yobibyte, SAM 2 is exposed as a managed image- and video-segmentation endpoint that customers consume from a workspace without touching CUDA or Triton. This entry helps you stand up SAM 2 in production — picking the right variant, flags and SKU for interactive annotation, offline batch masking or live video tracking — whether you are running raw upstream on your own cluster or routing through Yobibyte's managed endpoint.

Quick start

The example below installs SAM 2, loads the Large variant, and demonstrates two flows: a single-image segmentation given a point click, and a video segmentation given a click on frame 0 with propagation through the rest of the clip. Both flows run on a single H100, L40S or L4 with appropriate VRAM headroom.

# 1. Install (from Meta's official package)
# pip install "sam2 @ git+https://github.com/facebookresearch/sam2"

# 2. Image segmentation
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.build_sam import build_sam2
from PIL import Image
import numpy as np

sam2 = build_sam2(
    config_file="configs/sam2_hiera_l.yaml",
    ckpt_path="checkpoints/sam2_hiera_large.pt",
    device="cuda",
)
predictor = SAM2ImagePredictor(sam2)

image = np.array(Image.open("xray.png").convert("RGB"))
predictor.set_image(image)
masks, scores, logits = predictor.predict(
    point_coords=np.array([[512, 384]]),  # click on the object
    point_labels=np.array([1]),            # 1 = foreground
    multimask_output=True,
)

# 3. Video segmentation with cross-frame propagation
from sam2.build_sam import build_sam2_video_predictor

video_predictor = build_sam2_video_predictor(
    config_file="configs/sam2_hiera_l.yaml",
    ckpt_path="checkpoints/sam2_hiera_large.pt",
)

state = video_predictor.init_state(video_path="clip.mp4")

# Click on the object in frame 0
_, _, _ = video_predictor.add_new_points_or_box(
    inference_state=state,
    frame_idx=0,
    obj_id=1,
    points=np.array([[320, 240]]),
    labels=np.array([1]),
)

# Propagate the masklet through the rest of the video
for frame_idx, obj_ids, mask_logits in video_predictor.propagate_in_video(state):
    pass  # write per-frame masks to disk / queue

How it works

SAM 2 is a five-component model wrapped in a per-frame inference loop with cross-frame memory. The image encoder (Hiera, a Hierarchical Vision Transformer pre-trained with Masked Autoencoder) processes each input frame into multi-scale feature maps. A memory-attention block lets the current frame cross-attend to features stored from earlier prompted and predicted frames. The prompt encoder turns user input (points, boxes, mask hints) into prompt embeddings, and a lightweight transformer mask decoder produces three candidate masks per object plus IoU confidence scores.

The memory bank is the architectural piece that makes video work. It is a FIFO buffer of recent frame features augmented with the features of prompted frames — small, typically a handful of frames worth — but enough to preserve object identity across motion, occlusion and brief out-of-view periods. When a user adds a corrective click mid-clip, the corrected frame enters the memory bank and the propagation continues from that corrected state. The result is interactive: every refinement is immediately reflected forward through the rest of the video.

For images, SAM 2 short-circuits the video machinery — the memory bank is empty, the memory-attention block is bypassed, and the model behaves like a faster, higher-quality version of original SAM. This is why the same model can serve both SAM2ImagePredictor and SAM2VideoPredictor from one weights file. On Yobibyte's managed endpoint, the platform picks the predictor based on the input type (single image vs MP4 / RTSP feed) so workspace customers do not need to think about it.

Image encoder — Hiera ViT (T / S / B+ / L) with MAE pretraining. Produces multi-scale per-frame features.
Memory attention — cross-attends current-frame features with the memory bank's stored features.
Prompt encoder — sparse (points, boxes) and dense (mask) prompt encoders, inherited from SAM.
Mask decoder — lightweight transformer decoder emitting three candidate masks plus IoU scores.
Memory bank — FIFO buffer of recent frames plus prompted-frame features; small but sufficient for identity tracking.
Memory encoder — distils mask + image features into compact memory tokens before storage.

Tip: multimask_output=True returns three candidates; pick the highest-IoU score for crisp outputs and the lowest-IoU score when you want a coarser whole-object selection.

Reference and specifications

The table below is the canonical reference for the four SAM 2 variants and the most-used predictor flags. Parameter counts and FPS are from Meta's published numbers and Yobitel lab measurements on TensorRT-exported engines; treat throughput as planning anchors, not contractual.

Variant	Image encoder	Total params	Use
SAM 2 Tiny	Hiera-T	~39M	Edge devices, interactive annotation, mobile
SAM 2 Small	Hiera-S	~46M	L4 video frame rates, lightweight server
SAM 2 Base+	Hiera-B+	~81M	Production segmentation on L40S, MediQuery default
SAM 2 Large	Hiera-L	~224M	Highest quality, offline batch annotation, H100 / L40S

Note: All four variants share the memory bank, prompt encoder and mask decoder design — only the Hiera image encoder scales. Real-time interactive segmentation on H100 is comfortable across all variants.

Workload patterns

Three deployment shapes cover the bulk of production SAM 2 use: interactive annotation behind a UI, offline batch masking across a video archive, and live tracking on streamed video. Each pattern targets a different latency budget, batch size and quality bar. These are also the three shapes Yobibyte automates for managed customers — the flags below are what a team running raw upstream signs up to hand-tune; on Yobibyte the workspace SLO derives them.

Pattern A — interactive annotation. Single user, sub-200 ms response to each click. Use SAM 2 Base+ on L40S or SAM 2 Large on H100 with multimask_output=True. Pattern B — batch masking across a video archive. Throughput-optimised; per-clip latency irrelevant; pre-extract frames to local SSD; run SAM 2 Large with propagate_in_video in long sweeps. Pattern C — live RTSP tracking. Single-stream propagation behind a lightweight gateway; use SAM 2 Small or Base+ for sustained 25-30 FPS on L4 / L40S.

# Pattern A — interactive annotation flow (sub-200 ms per click)
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.build_sam import build_sam2

sam2 = build_sam2("configs/sam2_hiera_b+.yaml",
                  "checkpoints/sam2_hiera_base_plus.pt",
                  device="cuda")
predictor = SAM2ImagePredictor(sam2)
predictor.set_image(image)  # encode once per image
# every subsequent click reuses the cached image embedding
masks, scores, _ = predictor.predict(
    point_coords=clicks,
    point_labels=labels,
    multimask_output=True,
)

# Pattern B — offline batch masking across a video archive
import os
from sam2.build_sam import build_sam2_video_predictor
predictor = build_sam2_video_predictor(
    "configs/sam2_hiera_l.yaml",
    "checkpoints/sam2_hiera_large.pt",
)
for clip in os.listdir("archive/"):
    state = predictor.init_state(video_path=f"archive/{clip}",
                                 offload_video_to_cpu=True)
    # seed object from a pre-computed bounding box per clip
    predictor.add_new_points_or_box(state, frame_idx=0,
                                    obj_id=1, box=seed_box[clip])
    for fi, oid, ml in predictor.propagate_in_video(state):
        save_mask(clip, fi, ml)

# Pattern C — live RTSP tracking with a fixed memory window
state = predictor.init_state(video_path="rtsp://camera/01",
                             async_loading_frames=True,
                             offload_state_to_cpu=False)

Warning: set_image() and init_state() both run the encoder, which is the expensive part of SAM 2. For interactive workflows, encode once and re-predict on every click; for batch workflows, parallelise across multiple workers, not multiple encode calls.

Sizing and capacity planning

SAM 2 sizing is governed by three quantities: encoder cost per frame, memory-bank size per active object, and concurrent stream / user count. The table below assumes FP16 inference on Yobitel-deployable accelerators at 1024x1024 image resolution; FP8 on Hopper roughly doubles per-stream throughput. Mask decode is cheap; the encoder is the bottleneck.

VRAM budgets are dominated by activations and the memory bank for video. SAM 2 Large with a 64-frame memory window holds roughly 4-6 GB of active state per video stream; SAM 2 Small holds roughly 1 GB. Multi-object tracking multiplies this by the number of tracked objects per frame. For annotation tooling on H100, SAM 2 Large supports 50+ concurrent annotators per GPU. For live RTSP tracking, plan one accelerator per 4-8 streams at SAM 2 Base+ depending on resolution.

Variant	L4 (single stream FPS)	L40S (single stream FPS)	H100 (single stream FPS)
SAM 2 Tiny	~42 FPS	~95 FPS	~140 FPS
SAM 2 Small	~32 FPS	~78 FPS	~120 FPS
SAM 2 Base+	~22 FPS	~55 FPS	~85 FPS
SAM 2 Large	~9 FPS	~24 FPS	~42 FPS

Note: On Yobitel NeoCloud, the canonical SAM 2 deployment is two-tier — an annotation tier on L40S / H100 for interactive response, and a batch propagation tier on L4 fleets for offline archive masking. InferenceBench publishes per-variant throughput-per-dollar against peer providers.

Limits and quotas

Raw SAM 2 imposes no hard limits beyond available VRAM. The values below are operational ceilings observed in production and the corresponding defaults enforced on Yobibyte managed SAM 2 endpoints.

Limit	Raw upstream	Yobibyte managed default	Notes
Max image resolution	VRAM-bound	4096 x 4096	Higher resolutions billed as a separate workspace tier.
Max video resolution	VRAM-bound	1920 x 1080 (Full HD)	Sample to lower resolution for higher object counts.
Max objects tracked per video	GPU-bound	32	Memory bank grows linearly with object count.
Max memory window	GPU-bound	64 frames	Longer windows improve identity recovery but cost VRAM.
Max concurrent annotators per endpoint	Worker-bound	50 (H100), 20 (L40S)	Yobibyte autoscales replicas above the threshold.

Observability

Production SAM 2 monitoring focuses on encode time, memory-bank size and mask-quality scores rather than raw model latency. The minimum useful Prometheus surface for a self-managed deployment:

Per-frame encode latency — histogram. Spikes indicate decoder contention or VRAM pressure.
Per-click predict latency — histogram for interactive flows; target p95 below 200 ms.
Memory bank size (frames) and (bytes) — per active video session.
Mask IoU score distribution — sliding-window histogram; leftward drift signals model degradation or domain shift.
Active sessions — gauge per endpoint; drives autoscaling decisions.
Standard GPU telemetry — DCGM exporter for SM occupancy, memory pressure, NVLink utilisation.

Tip: On Yobibyte the same metrics surface in the workspace dashboard without additional instrumentation; on self-hosted Triton, scrape nv_inference_request_duration_us alongside per-frame encode timing emitted from the SAM 2 wrapper.

Cost and FinOps

Most SAM 2 cost comes from accelerator-hours; the model weights and the SA-V dataset are free. The ranges below are Yobitel NeoCloud on-demand reference prices for common SAM 2 deployment SKUs; reserved pricing is materially lower at 12+ month commitment. All figures USD; treat as planning anchors.

SKU	Yobitel NeoCloud on-demand	Right-sized SAM 2 deployment
NVIDIA L4	~$0.85 / GPU / hour	Batch masking with SAM 2 Tiny / Small; 4-8 live RTSP streams per GPU
NVIDIA L40S	~$2.40 / GPU / hour	Annotation tier with SAM 2 Base+ for ~20 concurrent annotators
NVIDIA H100 SXM5	~$3.80 / GPU / hour	SAM 2 Large interactive; ~50 concurrent annotators per GPU
Yobibyte managed endpoint	Per workspace tier	Pay per request / per minute of video processed, no capacity planning

Note: For low-volume annotation workloads (under 1,000 clicks / day), Yobibyte's per-request pricing is typically cheaper than reserving a dedicated L40S. For sustained 24/7 video archive masking, reserved L4 capacity wins.

Security and compliance

SAM 2's permissive licence makes the legal posture simple — the operational concerns are data residency and privacy. Inputs to SAM 2 are typically images or video that may contain identifiable people, sensitive medical content, or operational footage.

Licence — SAM 2 weights and code are Apache 2.0. No copyleft concerns, suitable for closed-source SaaS.
Data residency — for NCSC OFFICIAL workloads, pin the endpoint to Yobitel UK Sovereign region; for HIPAA / patient imaging, use MediQuery's dedicated SAM 2 deployment which routes through HIPAA-aligned Yobitel infrastructure.
Input redaction — pre-blur or pre-redact faces before feeding video to SAM 2 if downstream consumers do not need identifiable subjects.
Audit trail — Yobibyte logs per-prompt input hashes, predicted mask hashes, and predictor variant; sufficient evidence for most internal audit requirements.
Model supply chain — pin a specific commit of the sam2 package in production and verify checkpoint SHA against Meta's published values.

Migration and alternatives

SAM 2 has clearly displaced the original SAM for general use; the meaningful alternatives in 2026 are domain-specific or different problem framings.

Option	Licence	When to choose
SAM 2 (this entry)	Apache 2.0	Default open promptable segmentation; image and video.
Original SAM	Apache 2.0	Only if you are pinned to legacy SAM weights for reproducibility.
Mask R-CNN	Apache 2.0	Closed-vocabulary instance segmentation where classes are known and stable.
YOLOv11-seg	AGPL-3.0 / Enterprise	Real-time, single-stage instance segmentation without a prompt UI.
Grounding DINO + SAM 2	Mixed	Text-prompted segmentation pipelines — Grounding DINO yields boxes that prompt SAM 2.
Yobibyte managed SAM 2	Yobitel Service Terms	Skip the runtime; consume a hosted segmentation endpoint with SLA.

Tip: For MediQuery-grade clinical annotation, the right answer is SAM 2 inside MediQuery rather than raw SAM 2. The HIPAA posture, audit logging and clinician-grade UI are not part of the upstream package.

Troubleshooting

The failure modes below cover roughly 80 percent of SAM 2 production tickets observed in Yobitel Managed Operations runbooks.

Symptom	Likely cause	Remediation
Masklet drifts after occlusion	Memory bank window too short	Increase `max_obj_ptrs_in_encoder` (default 16) and `max_cond_frames_in_attn`.
Mask quality drops over time in video	Object appearance shifted beyond seed frame	Add a refinement click at the drift point; SAM 2 propagates the correction forward.
VRAM OOM on long video	`offload_video_to_cpu=False` with large clip	Set `offload_video_to_cpu=True` and `offload_state_to_cpu=True` for clips > 30 s.
Per-click latency spikes	`set_image()` re-running on each click	Cache the predictor's image state; only call `set_image()` when the image changes.
Masks under-segment (whole object missed)	Click placed on wrong scale	Use `multimask_output=True` and pick the higher-area candidate; or add a box prompt.
Tracking IDs swap between objects	Two visually similar objects in close proximity	Add a negative click on the wrong target to force separation; or track each object independently.

Where it fits in the Yobitel stack

SAM 2 sits inside two Yobitel surfaces. In MediQuery, SAM 2 is the segmentation engine clinicians use to annotate radiology series — CT slices, MR sequences, pathology slides — for downstream analytics and model training. The MediQuery deployment runs SAM 2 Base+ on L40S behind a HIPAA-aligned audit log and the clinician-facing UI; it is not a generic SAM 2 endpoint but a clinical product built on top of one. On Yobibyte, SAM 2 is exposed as a managed image- and video-segmentation endpoint: customers submit images or video to a workspace and Yobibyte picks the variant, the placement and the concurrency to meet the workspace SLO.

Yobitel NeoCloud provides the underlying L4 / L40S / H100 capacity for both surfaces and for any self-managed SAM 2 deployment customers prefer to run themselves. InferenceBench tracks SAM 2 throughput-per-dollar across Yobitel and peer providers for the four variants and three workload shapes — the same data informs Yobibyte's internal placement decisions via Omniscient Compute.

MediQuery — clinical SAM 2 deployment for radiology annotation.
Yobibyte — managed SAM 2 image and video segmentation endpoint.
Yobitel NeoCloud — L4 / L40S / H100 capacity for self-managed deployments.
InferenceBench — public SAM 2 throughput-per-dollar tracking.

References

SAM 2: Segment Anything in Images and Videos (Ravi et al., 2024) · arXiv
Segment Anything (Kirillov et al., 2023) · arXiv
SAM 2 GitHub · GitHub
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles (Ryali et al., 2023) · arXiv

TL;DR

Introduced by Ravi et al. at Meta FAIR in 'SAM 2: Segment Anything in Images and Videos' (arXiv:2408.00714, 29 July 2024) and released under Apache 2.0 with the SA-V dataset (~51K videos, ~643K masklets).
Extends the original Segment Anything Model (Kirillov et al., 2023) to video by adding a memory bank and memory-attention block that propagate object identity across frames, while treating images as single-frame video to keep one model architecture.
Four backbone variants — Tiny / Small / Base+ / Large — built on Hiera (Hierarchical ViT) image encoders. Tiny runs at video frame rates on L4; Large delivers the highest-quality masks on L40S / H100.
Ships two predictors out of the box — `SAM2ImagePredictor` for images and `SAM2VideoPredictor` for video — both prompted by point clicks, bounding boxes or coarse masks on any frame.
Yobitel's MediQuery uses SAM 2 as the segmentation engine for radiology annotation. Yobibyte exposes SAM 2 as a managed image- and video-segmentation endpoint, sized automatically per workspace SLO across Yobitel NeoCloud capacity.

Overview

Quick start

# 1. Install (from Meta's official package)
# pip install "sam2 @ git+https://github.com/facebookresearch/sam2"

# 2. Image segmentation
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.build_sam import build_sam2
from PIL import Image
import numpy as np

sam2 = build_sam2(
    config_file="configs/sam2_hiera_l.yaml",
    ckpt_path="checkpoints/sam2_hiera_large.pt",
    device="cuda",
)
predictor = SAM2ImagePredictor(sam2)

image = np.array(Image.open("xray.png").convert("RGB"))
predictor.set_image(image)
masks, scores, logits = predictor.predict(
    point_coords=np.array([[512, 384]]),  # click on the object
    point_labels=np.array([1]),            # 1 = foreground
    multimask_output=True,
)

# 3. Video segmentation with cross-frame propagation
from sam2.build_sam import build_sam2_video_predictor

video_predictor = build_sam2_video_predictor(
    config_file="configs/sam2_hiera_l.yaml",
    ckpt_path="checkpoints/sam2_hiera_large.pt",
)

state = video_predictor.init_state(video_path="clip.mp4")

# Click on the object in frame 0
_, _, _ = video_predictor.add_new_points_or_box(
    inference_state=state,
    frame_idx=0,
    obj_id=1,
    points=np.array([[320, 240]]),
    labels=np.array([1]),
)

# Propagate the masklet through the rest of the video
for frame_idx, obj_ids, mask_logits in video_predictor.propagate_in_video(state):
    pass  # write per-frame masks to disk / queue

How it works

Image encoder — Hiera ViT (T / S / B+ / L) with MAE pretraining. Produces multi-scale per-frame features.
Memory attention — cross-attends current-frame features with the memory bank's stored features.
Prompt encoder — sparse (points, boxes) and dense (mask) prompt encoders, inherited from SAM.
Mask decoder — lightweight transformer decoder emitting three candidate masks plus IoU scores.
Memory bank — FIFO buffer of recent frames plus prompted-frame features; small but sufficient for identity tracking.
Memory encoder — distils mask + image features into compact memory tokens before storage.

Tip: multimask_output=True returns three candidates; pick the highest-IoU score for crisp outputs and the lowest-IoU score when you want a coarser whole-object selection.

Reference and specifications

Variant	Image encoder	Total params	Use
SAM 2 Tiny	Hiera-T	~39M	Edge devices, interactive annotation, mobile
SAM 2 Small	Hiera-S	~46M	L4 video frame rates, lightweight server
SAM 2 Base+	Hiera-B+	~81M	Production segmentation on L40S, MediQuery default
SAM 2 Large	Hiera-L	~224M	Highest quality, offline batch annotation, H100 / L40S

Note: All four variants share the memory bank, prompt encoder and mask decoder design — only the Hiera image encoder scales. Real-time interactive segmentation on H100 is comfortable across all variants.

Workload patterns

# Pattern A — interactive annotation flow (sub-200 ms per click)
from sam2.sam2_image_predictor import SAM2ImagePredictor
from sam2.build_sam import build_sam2

sam2 = build_sam2("configs/sam2_hiera_b+.yaml",
                  "checkpoints/sam2_hiera_base_plus.pt",
                  device="cuda")
predictor = SAM2ImagePredictor(sam2)
predictor.set_image(image)  # encode once per image
# every subsequent click reuses the cached image embedding
masks, scores, _ = predictor.predict(
    point_coords=clicks,
    point_labels=labels,
    multimask_output=True,
)

# Pattern B — offline batch masking across a video archive
import os
from sam2.build_sam import build_sam2_video_predictor
predictor = build_sam2_video_predictor(
    "configs/sam2_hiera_l.yaml",
    "checkpoints/sam2_hiera_large.pt",
)
for clip in os.listdir("archive/"):
    state = predictor.init_state(video_path=f"archive/{clip}",
                                 offload_video_to_cpu=True)
    # seed object from a pre-computed bounding box per clip
    predictor.add_new_points_or_box(state, frame_idx=0,
                                    obj_id=1, box=seed_box[clip])
    for fi, oid, ml in predictor.propagate_in_video(state):
        save_mask(clip, fi, ml)

# Pattern C — live RTSP tracking with a fixed memory window
state = predictor.init_state(video_path="rtsp://camera/01",
                             async_loading_frames=True,
                             offload_state_to_cpu=False)

Warning: set_image() and init_state() both run the encoder, which is the expensive part of SAM 2. For interactive workflows, encode once and re-predict on every click; for batch workflows, parallelise across multiple workers, not multiple encode calls.

Sizing and capacity planning

Variant	L4 (single stream FPS)	L40S (single stream FPS)	H100 (single stream FPS)
SAM 2 Tiny	~42 FPS	~95 FPS	~140 FPS
SAM 2 Small	~32 FPS	~78 FPS	~120 FPS
SAM 2 Base+	~22 FPS	~55 FPS	~85 FPS
SAM 2 Large	~9 FPS	~24 FPS	~42 FPS

Note: On Yobitel NeoCloud, the canonical SAM 2 deployment is two-tier — an annotation tier on L40S / H100 for interactive response, and a batch propagation tier on L4 fleets for offline archive masking. InferenceBench publishes per-variant throughput-per-dollar against peer providers.

Limits and quotas

Raw SAM 2 imposes no hard limits beyond available VRAM. The values below are operational ceilings observed in production and the corresponding defaults enforced on Yobibyte managed SAM 2 endpoints.

Limit	Raw upstream	Yobibyte managed default	Notes
Max image resolution	VRAM-bound	4096 x 4096	Higher resolutions billed as a separate workspace tier.
Max video resolution	VRAM-bound	1920 x 1080 (Full HD)	Sample to lower resolution for higher object counts.
Max objects tracked per video	GPU-bound	32	Memory bank grows linearly with object count.
Max memory window	GPU-bound	64 frames	Longer windows improve identity recovery but cost VRAM.
Max concurrent annotators per endpoint	Worker-bound	50 (H100), 20 (L40S)	Yobibyte autoscales replicas above the threshold.

Observability

Production SAM 2 monitoring focuses on encode time, memory-bank size and mask-quality scores rather than raw model latency. The minimum useful Prometheus surface for a self-managed deployment:

Per-frame encode latency — histogram. Spikes indicate decoder contention or VRAM pressure.
Per-click predict latency — histogram for interactive flows; target p95 below 200 ms.
Memory bank size (frames) and (bytes) — per active video session.
Mask IoU score distribution — sliding-window histogram; leftward drift signals model degradation or domain shift.
Active sessions — gauge per endpoint; drives autoscaling decisions.
Standard GPU telemetry — DCGM exporter for SM occupancy, memory pressure, NVLink utilisation.

Tip: On Yobibyte the same metrics surface in the workspace dashboard without additional instrumentation; on self-hosted Triton, scrape nv_inference_request_duration_us alongside per-frame encode timing emitted from the SAM 2 wrapper.

Cost and FinOps

SKU	Yobitel NeoCloud on-demand	Right-sized SAM 2 deployment
NVIDIA L4	~$0.85 / GPU / hour	Batch masking with SAM 2 Tiny / Small; 4-8 live RTSP streams per GPU
NVIDIA L40S	~$2.40 / GPU / hour	Annotation tier with SAM 2 Base+ for ~20 concurrent annotators
NVIDIA H100 SXM5	~$3.80 / GPU / hour	SAM 2 Large interactive; ~50 concurrent annotators per GPU
Yobibyte managed endpoint	Per workspace tier	Pay per request / per minute of video processed, no capacity planning

Note: For low-volume annotation workloads (under 1,000 clicks / day), Yobibyte's per-request pricing is typically cheaper than reserving a dedicated L40S. For sustained 24/7 video archive masking, reserved L4 capacity wins.

Security and compliance

Licence — SAM 2 weights and code are Apache 2.0. No copyleft concerns, suitable for closed-source SaaS.
Data residency — for NCSC OFFICIAL workloads, pin the endpoint to Yobitel UK Sovereign region; for HIPAA / patient imaging, use MediQuery's dedicated SAM 2 deployment which routes through HIPAA-aligned Yobitel infrastructure.
Input redaction — pre-blur or pre-redact faces before feeding video to SAM 2 if downstream consumers do not need identifiable subjects.
Audit trail — Yobibyte logs per-prompt input hashes, predicted mask hashes, and predictor variant; sufficient evidence for most internal audit requirements.
Model supply chain — pin a specific commit of the sam2 package in production and verify checkpoint SHA against Meta's published values.

Migration and alternatives

SAM 2 has clearly displaced the original SAM for general use; the meaningful alternatives in 2026 are domain-specific or different problem framings.

Option	Licence	When to choose
SAM 2 (this entry)	Apache 2.0	Default open promptable segmentation; image and video.
Original SAM	Apache 2.0	Only if you are pinned to legacy SAM weights for reproducibility.
Mask R-CNN	Apache 2.0	Closed-vocabulary instance segmentation where classes are known and stable.
YOLOv11-seg	AGPL-3.0 / Enterprise	Real-time, single-stage instance segmentation without a prompt UI.
Grounding DINO + SAM 2	Mixed	Text-prompted segmentation pipelines — Grounding DINO yields boxes that prompt SAM 2.
Yobibyte managed SAM 2	Yobitel Service Terms	Skip the runtime; consume a hosted segmentation endpoint with SLA.

Tip: For MediQuery-grade clinical annotation, the right answer is SAM 2 inside MediQuery rather than raw SAM 2. The HIPAA posture, audit logging and clinician-grade UI are not part of the upstream package.

Troubleshooting

The failure modes below cover roughly 80 percent of SAM 2 production tickets observed in Yobitel Managed Operations runbooks.

Symptom	Likely cause	Remediation
Masklet drifts after occlusion	Memory bank window too short	Increase `max_obj_ptrs_in_encoder` (default 16) and `max_cond_frames_in_attn`.
Mask quality drops over time in video	Object appearance shifted beyond seed frame	Add a refinement click at the drift point; SAM 2 propagates the correction forward.
VRAM OOM on long video	`offload_video_to_cpu=False` with large clip	Set `offload_video_to_cpu=True` and `offload_state_to_cpu=True` for clips > 30 s.
Per-click latency spikes	`set_image()` re-running on each click	Cache the predictor's image state; only call `set_image()` when the image changes.
Masks under-segment (whole object missed)	Click placed on wrong scale	Use `multimask_output=True` and pick the higher-area candidate; or add a box prompt.
Tracking IDs swap between objects	Two visually similar objects in close proximity	Add a negative click on the wrong target to force separation; or track each object independently.

Where it fits in the Yobitel stack

MediQuery — clinical SAM 2 deployment for radiology annotation.
Yobibyte — managed SAM 2 image and video segmentation endpoint.
Yobitel NeoCloud — L4 / L40S / H100 capacity for self-managed deployments.
InferenceBench — public SAM 2 throughput-per-dollar tracking.

References

SAM 2: Segment Anything in Images and Videos (Ravi et al., 2024) · arXiv
Segment Anything (Kirillov et al., 2023) · arXiv
SAM 2 GitHub · GitHub
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles (Ryali et al., 2023) · arXiv

SAM 2 — Segment Anything Model 2

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where it fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

SAM 2 — Segment Anything Model 2

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where it fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte