Whisper

TL;DR

Whisper is an encoder-decoder Transformer trained by OpenAI on weakly-supervised multilingual audio (680k hours for the original release, scaled to ~5M hours for Large-v3). MIT-licensed, weights on Hugging Face, 99-language transcription plus any-to-English translation in one set of weights (Radford et al., arXiv:2212.04356).
Eight size points ship: tiny (39M), base (74M), small (244M), medium (769M), large / large-v2 / large-v3 (~1.55B), and large-v3-turbo (~798M, Sep 2024) — a four-decoder-layer distillation that delivers near-large-v3 quality at roughly 8x the speed.
Production almost never runs upstream PyTorch. The default runtimes are faster-whisper (CTranslate2, ~4x throughput, FP16 / INT8), whisper.cpp (CPU / Apple Silicon / edge), WhisperX (word-level alignment + diarisation), Distil-Whisper (English-only distillation), and NVIDIA NIM / TensorRT-LLM Whisper engines for low-latency GPU serving.
Real-time factor (RTF) is the planning anchor. Large-v3 on a single H100 SXM5 runs at RTF ~0.03 (a 60-second clip transcribes in ~1.8 s); on L4 RTF is ~0.5 (~30 s for the same clip); large-v3-turbo lifts both numbers by 5-8x. A single L4 hosts dozens of concurrent batch streams cheaply; H100 is reserved for low-latency or very high concurrency.
Yobitel angle: MediQuery uses Whisper Large-v3 for clinical dictation on NeoCloud UK; Yobibyte exposes Whisper Large-v3 and Large-v3-turbo as managed transcription endpoints with autoscaling and prompt-billing; Yobitel Edge AI ships faster-whisper on Jetson AGX Orin for on-device transcription in air-gapped settings.

Overview

Whisper is an automatic speech recognition (ASR) model and the de facto open-weights baseline for speech-to-text since its September 2022 release. It is a vanilla encoder-decoder Transformer trained on a very large, noisy corpus of (audio, transcript) pairs scraped from the public web — 680,000 hours for the original release, expanded to roughly 5 million hours for Large-v3 (Radford et al., 'Robust Speech Recognition via Large-Scale Weak Supervision', arXiv:2212.04356). A single set of weights performs transcription in 99 languages, any-to-English translation, language identification, voice activity detection, and segment-level timestamp prediction; the active task is selected by special control tokens prepended to the decoder sequence.

Whisper is significant for two reasons. Architecturally, it demonstrated that weak supervision at internet scale beats clean small-corpus self-supervision (the wav2vec 2.0 paradigm) on robustness — the model handles accents, background noise, codec degradation, and domain shift dramatically better than its predecessors. Practically, MIT licensing of the weights and an excellent runtime ecosystem (faster-whisper, whisper.cpp, WhisperX, Distil-Whisper, NVIDIA NIM) made it the default starting point for every open-source transcription stack built after 2023 — from podcast indexing services to clinical scribing to subtitle generation.

The current production lineup is Large-v3 (November 2023, ~1.55B parameters, 128-bin mel input) and Large-v3-turbo (September 2024, ~798M parameters, four decoder layers, near-Large-v3 quality at roughly 8x the inference speed). On Yobitel NeoCloud, both models are first-class managed endpoints in Yobibyte; on Yobitel Edge AI, faster-whisper builds of medium / large-v3-turbo ship on Jetson AGX Orin for air-gapped on-device transcription. This entry helps you choose the right Whisper size and runtime for your workload, size GPU capacity using real-time factor as the planning anchor, decide between self-hosting and the Yobibyte managed alternative, and avoid the well-known failure modes — chiefly hallucination on silence and chunk-boundary drift in streaming.

Quick start

The example below installs the production runtime (faster-whisper, CTranslate2-backed) on any CUDA 12+ host, transcribes a meeting recording with VAD-filtered word-level timestamps, and shows the equivalent call against a Yobibyte-managed Whisper endpoint using the OpenAI-compatible audio API. Self-hosting takes care of upstream weights; the Yobibyte path takes care of weights, autoscaling, GPU allocation, regional residency and billing.

# 1. Install faster-whisper (the production default) and run Large-v3 on a GPU host
pip install "faster-whisper>=1.1.0"

# Single-shot CLI transcription via the openai-whisper reference CLI
pip install openai-whisper
whisper meeting.wav --model large-v3 --device cuda --language en --output_format srt

# 2. Production-shape Python using faster-whisper
python - <<'PY'
from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",                # or "large-v3-turbo" for 5-8x speed at ~equal quality
    device="cuda",
    compute_type="float16",    # "int8_float16" on L4; "int8" for CPU
)

segments, info = model.transcribe(
    "meeting.wav",
    language="en",             # omit to auto-detect from first 30 s
    vad_filter=True,           # silero VAD — strongly recommended
    vad_parameters={"min_silence_duration_ms": 500},
    beam_size=5,
    word_timestamps=True,
    condition_on_previous_text=False,  # reduces hallucination drift
)

print(f"Detected language: {info.language} (p={info.language_probability:.2f})")
for s in segments:
    print(f"[{s.start:7.2f} -> {s.end:7.2f}] {s.text}")
PY

# 3. Same job via a Yobibyte-managed Whisper endpoint — OpenAI-compatible audio API
curl https://YOUR-WORKSPACE.yobibyte.yobitel.com/v1/audio/transcriptions \
    -H "Authorization: Bearer $YOBIBYTE_API_KEY" \
    -F "model=whisper-large-v3" \
    -F "file=@meeting.wav" \
    -F "response_format=verbose_json" \
    -F "timestamp_granularities[]=word"

Tip: Always enable vad_filter and set condition_on_previous_text=False as your first two production flags. They together eliminate the majority of Whisper's notorious hallucination cases (repeated phrases during silence, drift on long files) with negligible quality cost.

How it works

Whisper is a vanilla encoder-decoder Transformer with no domain-specific tricks — its strength comes from data scale and a clever multi-task control-token scheme, not architectural novelty. The encoder consumes 30-second chunks of log-mel spectrogram (80 bins through Large-v2, 128 bins from Large-v3 onward), sampled at 16 kHz and hop 10 ms, padded with zeros if the audio is shorter. Two convolutional layers downsample, then sinusoidal position embeddings are added and the result passes through a stack of Transformer encoder blocks. The decoder is an autoregressive BPE language model conditioned on encoder states via cross-attention; it generates one token at a time until an end-of-transcript token or the 30-second window is exhausted.

Multi-task behaviour is selected by a small vocabulary of control tokens prepended to the decoder sequence. The pattern is <|startoftranscript|> <|language|> <|task|> <|notimestamps|>?. The language token (e.g. <|en|>, <|fr|>, <|zh|>) selects one of 99 languages or can be omitted to trigger language identification. The task token is either <|transcribe|> (output in the source language) or <|translate|> (output in English). Without <|notimestamps|>, the decoder also emits <|0.00|> … <|29.98|> timestamp tokens delimiting each segment. A <|nospeech|> token at the first decoder position signals empty audio. The same weights serve every combination.

For audio longer than 30 seconds — which is essentially all real workloads — Whisper is run in a sliding-window pattern. The reference implementation transcribes the first 30-second window, takes the timestamp of the last decoded segment, advances the window to that point, optionally conditions the next window's decoder on the previous transcript (the condition_on_previous_text flag), and repeats. This is the source of the model's reputation for hallucination on long files: a confident-but-wrong segment poisons the conditioning context for every subsequent window, producing repeating loops or invented content. Disabling previous-text conditioning is the standard mitigation; pairing with an external VAD that pre-segments speech is the production-grade fix.

Large-v3-turbo (September 2024) is a distillation of Large-v3 that reduces the decoder from 32 layers to 4 while keeping the encoder unchanged. Because decoding cost dominates inference (the encoder runs once per 30-second window; the decoder runs per token), the speedup is dramatic — roughly 8x faster end-to-end on a single GPU at WER within 1-2 percentage points of Large-v3 on most benchmarks. Turbo does not support translation, only transcription. For most production transcription workloads on Yobitel NeoCloud, Large-v3-turbo is now the recommended default; Large-v3 stays for translation and for the small set of languages where the quality gap is material.

Input: 30-second log-mel spectrogram, 80 bins (≤v2) or 128 bins (v3+), 16 kHz, 10 ms hop.
Encoder: convolutional downsample + N Transformer blocks (N = 4 / 6 / 12 / 24 / 32 by size).
Decoder: autoregressive BPE LM with cross-attention to encoder states; multilingual BPE tokeniser (~51,865 tokens including 99 language tokens and timestamp tokens).
Control tokens: <|startoftranscript|> <|lang|> <|task|> <|notimestamps|>? select task and language.
Long-form: sliding 30 s windows advanced by last decoded timestamp; conditioning on previous text is optional and a known hallucination source.
Turbo distillation: same encoder, 4-layer decoder, ~8x faster decode at near-equal quality (transcription only).

Reference and specifications

Eight size points are published. All are MIT-licensed with weights on Hugging Face under openai/whisper-*. The .en suffixes (tiny.en, base.en, small.en, medium.en) are English-only variants that drop the multilingual heads — slightly higher English accuracy at the same parameter count and useful when the workload is monolingual.

Size	Parameters	Encoder / Decoder layers	VRAM (FP16)	RTF on H100 SXM5	RTF on L4	Multilingual
tiny	39M	4 / 4	~1 GB	~0.003	~0.04	Yes (+ tiny.en)
base	74M	6 / 6	~1 GB	~0.005	~0.07	Yes (+ base.en)
small	244M	12 / 12	~2 GB	~0.010	~0.15	Yes (+ small.en)
medium	769M	24 / 24	~5 GB	~0.018	~0.30	Yes (+ medium.en)
large-v1 / v2	1.55B	32 / 32	~10 GB	~0.030	~0.50	Yes
large-v3	1.55B	32 / 32 (128-mel)	~10 GB	~0.030	~0.50	Yes
large-v3-turbo	~798M	32 / 4 (128-mel)	~6 GB	~0.005	~0.08	Yes (transcribe only)
distil-large-v3 (HF)	~756M	32 / 2 (En-only)	~6 GB	~0.004	~0.07	No (En)

Note: RTF figures are mid-range measurements with faster-whisper at FP16, beam_size=5, VAD on, batch=1. RTF scales near-linearly with batch on H100/H200/L40S up to the point KV cache saturates; on L4 it scales sublinearly because of memory bandwidth limits. Use these as planning anchors, not contractual SLOs.

Implementation variants

OpenAI's reference repository (openai/whisper) is rarely used in production — it is single-stream, pure PyTorch, and emits one Python process per file. The production ecosystem has converged on four runtimes plus NVIDIA's vendor-optimised engines. Yobibyte's managed Whisper endpoints sit on top of this same set, selected per-workload.

faster-whisper is the right default for almost everyone self-hosting on Yobitel NeoCloud — it covers GPU, FP16/INT8 quantisation, VAD integration, word-level timestamps and batch concurrency in a single pip install.
whisper.cpp is the default for Yobitel Edge AI deployments on Jetson AGX Orin and customer-owned edge boxes — its GGML quantised builds (Q4_K, Q5_K) fit Large-v3 into ~3-5 GB and run real-time on CPU.
WhisperX adds wav2vec 2.0-based forced alignment for accurate word-level timestamps plus pyannote diarisation — the right starting point for meeting / call-centre transcription that needs speaker labels.
Distil-Whisper trades multilingual coverage for speed; for English-only batch indexing the distil-large-v3 + Large-v3-turbo combination is hard to beat on $/hour transcribed.
NVIDIA NIM packages Whisper as a TensorRT-LLM + Triton container with prebuilt FP16 / FP8 engines for H100 / H200 / L40S / L4 — the right choice when you want low-latency real-time ASR on Hopper without building TensorRT engines yourself.

Runtime	Backend	Target hardware	Speedup vs reference	When to use
faster-whisper	CTranslate2 (C++ + CUDA / oneDNN)	NVIDIA GPU, CPU, AMD ROCm (preview)	~4x on GPU, ~10x on CPU	Default for self-hosted GPU serving; FP16 / INT8 / INT8_FLOAT16.
whisper.cpp	GGML (pure C/C++)	CPU, Apple Silicon (Metal), CUDA	~1-2x CPU, real-time on M-series	Edge, mobile, desktop; offline transcription.
WhisperX	faster-whisper + wav2vec 2.0 + pyannote	NVIDIA GPU	~2-3x with diarisation	When you need word-level timestamps and speaker labels in one pass.
Distil-Whisper	Transformers / faster-whisper	NVIDIA GPU	~6x at small WER cost	English-only batch workloads; podcast / call-centre indexing.
NVIDIA NIM Whisper	TensorRT-LLM + Triton	H100 / H200 / L40S / L4	~6-10x with batching	High-concurrency low-latency serving; enterprise NVAIE customers.
TensorRT-LLM Whisper	TensorRT-LLM engine builds	Hopper / Blackwell / Ada	~5-8x	Self-built engines when you have the TensorRT expertise.
openai/whisper (reference)	PyTorch	NVIDIA GPU, CPU	1x baseline	Research; never for production.

Workload patterns

Whisper shows up in four workload shapes across Yobitel customers, each with different runtime, hardware and flag choices. These are the same four shapes Yobibyte derives its managed endpoint configurations from.

Workload	Latency target	Runtime	Hardware	Key flags
Batch indexing (podcast / archive)	Throughput-bound, hours OK	faster-whisper or Distil-Whisper	L4 / L40S	FP16/INT8, beam=5, large-v3-turbo, batch=8-16.
Meeting transcription (post-call)	Minutes per hour of audio	WhisperX	L40S / L4	Large-v3 + wav2vec alignment + pyannote.
Clinical / call-centre near-real-time	1-3 s end-to-end	faster-whisper (chunked) or NIM Whisper	H100 / L40S	Large-v3-turbo, VAD-aligned 5-10 s chunks.
Streaming voice agent	Sub-300 ms first hypothesis	Whisper-streaming or native Conformer	L4 / L40S	Pair Whisper with a streaming front-end; or switch to Parakeet/Canary.
Edge / air-gapped dictation	Real-time on-device	whisper.cpp or faster-whisper	Jetson AGX Orin, edge CPU	INT8 / Q5_K, medium or large-v3-turbo.

Warning: Whisper is not natively a streaming model. For voice-agent latency budgets under 500 ms first-byte, do not try to make Whisper stream — switch to a streaming-native architecture (Conformer + RNN-T, Parakeet, Canary). Whisper-streaming and similar wrappers are workable for live captioning at 1-2 s latency but not for interactive voice UX. The Yobibyte real-time ASR endpoint defaults to Parakeet RNN-T for sub-300 ms streaming and reserves Whisper for batch and chunked near-real-time.

Sizing and capacity planning

Whisper sizing is governed by real-time factor (RTF — wall-clock seconds per second of audio). Throughput is 1 / RTF streams per GPU; cost per hour transcribed is (GPU hourly rate) x RTF. The figures below are mid-range observed with faster-whisper at FP16, beam_size=5, VAD on; the table uses Yobitel NeoCloud UK list pricing (June 2026) for the $/hour-of-audio column. Substitute spot rates for ~40-60 percent reduction.

Default cost-optimal SKU for batch transcription on Yobitel NeoCloud: L4 with large-v3-turbo INT8 — roughly $0.07 per hour of audio with comfortable concurrency.
Default cost-optimal SKU for high-volume near-real-time: H100 SXM5 with large-v3-turbo FP16 — roughly $0.017 per hour of audio at ~200 concurrent streams.
For translation workloads (any-to-English), Large-v3 is still required — Turbo dropped the translation head. Plan ~5x higher cost than transcription-only.
Long-file workloads (>1 hour) benefit from longer VAD-min-silence and disabled previous-text conditioning more than from any flag tuning.
Yobibyte managed Whisper endpoints bill per minute of audio transcribed at a flat USD rate, eliminating the need to size GPU capacity manually — useful when load is spiky.

Model	Runtime	GPU	RTF (batch=1)	Concurrent streams (RTF=1)	GPU rate ($/h)	$/hour transcribed
large-v3	faster-whisper FP16	1x H100 SXM5	~0.03	~30	$3.20	~$0.11
large-v3	faster-whisper FP16	1x L40S 48GB	~0.08	~12	$1.80	~$0.15
large-v3	faster-whisper INT8_F16	1x L4 24GB	~0.50	~2	$0.85	~$0.43
large-v3-turbo	faster-whisper FP16	1x H100 SXM5	~0.005	~200	$3.20	~$0.017
large-v3-turbo	faster-whisper FP16	1x L40S 48GB	~0.012	~80	$1.80	~$0.024
large-v3-turbo	faster-whisper INT8_F16	1x L4 24GB	~0.08	~12	$0.85	~$0.072
large-v3	NIM Whisper (TRT-LLM)	1x H100 SXM5	~0.005 (batched)	~200	$3.20	~$0.017
distil-large-v3	faster-whisper FP16	1x L4 24GB	~0.07	~14	$0.85	~$0.062
large-v3	whisper.cpp Q5_K	Jetson AGX Orin 64GB	~0.9	~1	Edge (CapEx)	n/a (on-device)
large-v3-turbo	whisper.cpp Q5_K	Jetson AGX Orin 64GB	~0.15	~6	Edge (CapEx)	n/a (on-device)

Limits and quotas

Whisper's hard limits come from its architecture; soft limits come from the chosen runtime. The constraints below are the ones that catch teams in production.

Limit	Default / Ceiling	Cause	Workaround
Audio window per pass	30 seconds	Trained context length	Chunk with VAD; never pass >30 s in one call.
Decoder generation length	224 tokens (default)	max_new_tokens / no_repeat_ngram	Raise --max-len; pre-segment audio.
Sample rate	16 kHz (resampled internally)	Mel feature extractor	Pre-resample to avoid CPU overhead on hot paths.
Languages supported	99 transcribe / 96 to-En translate	Training data coverage	Fine-tune or rescore for unsupported languages.
Speaker labels	Not built in	Single-speaker model	Combine with pyannote / WhisperX.
Streaming	Not native	30-second window architecture	Use Whisper-streaming, or switch to Parakeet / Canary.
Word-level timestamps (reference)	Approximate / unstable	Decoder timestamp tokens are segment-level	Use WhisperX (wav2vec forced alignment) or faster-whisper word_timestamps=True.
Diacritic / casing accuracy	Variable per language	BPE tokenisation + weak supervision	Domain fine-tune; post-process with a punctuation model.
Long-form hallucination	Documented failure mode	Conditioning on previous decoded text	Set condition_on_previous_text=False; use VAD.
Numerical / proper-noun recall	Mid-tier	Web-scraped training data	Use initial_prompt= to bias vocabulary; fine-tune on domain.

Warning: Whisper has a well-documented tendency to hallucinate during silence and low-information audio — entire invented sentences, often repeating. This is the single largest production failure mode. Mandatory mitigations: (1) VAD-filter before inference; (2) condition_on_previous_text=False; (3) reject segments where avg_logprob is below a threshold (typically -1.0); (4) reject segments where compression_ratio exceeds ~2.4 (repeated text inflates the ratio).

Observability

Production Whisper deployments need three metric families: throughput (streams / RTF), quality signals (avg_logprob and compression_ratio distributions), and audio-domain signals (VAD-detected silence ratio, language-ID distribution). faster-whisper and NIM Whisper both surface these natively; whisper.cpp emits them in JSON output. The Prometheus rules below are what Yobibyte uses internally and what Omniscient Compute scrapes from customer endpoints under the managed observability tier.

whisper:active_streams — concurrent in-flight transcriptions; alert when sustained at runtime capacity.
whisper:real_time_factor — moving average RTF; should sit at or below the planning anchor for the model+SKU.
whisper:avg_logprob — mean per-token log probability per segment; a sustained drop below -0.7 signals a domain shift or audio-quality regression.
whisper:compression_ratio — ratio of decoded text length to gzipped length; values above 2.4 indicate the decoder is looping or repeating (hallucination signal).
whisper:nospeech_prob — distribution of the no-speech probability emitted by the decoder; useful for tuning the VAD threshold.
whisper:language_detected_total{language=...} — counter of detected source languages; sudden distribution shifts are the most common production-incident leading indicator.
DCGM GPU utilisation and memory bandwidth — pair with RTF to distinguish encoder-bound (memory-bandwidth limited) from decoder-bound (FLOPs limited) regimes.

# Prometheus rules for a self-hosted faster-whisper deployment
groups:
  - name: whisper-sla
    interval: 30s
    rules:
      - alert: WhisperRTFRegression
        expr: avg_over_time(whisper:real_time_factor[10m]) > 0.6
        for: 10m
        labels: { severity: warning, team: speech }
        annotations:
          summary: "RTF >0.6 — capacity headroom gone or batching broken"

      - alert: WhisperHallucinationSignal
        expr: histogram_quantile(0.95,
                sum by (le) (rate(whisper:compression_ratio_bucket[10m]))) > 2.4
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Compression ratio p95 >2.4 — repeated-text hallucination spike"

      - alert: WhisperLogProbCollapse
        expr: avg_over_time(whisper:avg_logprob[15m]) < -0.9
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Average log-prob collapsed — audio quality or domain regression"

      - alert: WhisperLanguageDrift
        expr: |
          (sum by (language) (rate(whisper:language_detected_total[1h])))
          / ignoring(language) group_left
          sum(rate(whisper:language_detected_total[1h])) > 0.3
          unless on(language) (label_replace(vector(1), "language", "$1", "language", "(en|es|fr|de|pt)"))
        for: 30m
        labels: { severity: info }
        annotations:
          summary: "Unexpected language >30 percent of traffic — workload shape changed"

Tip: compression_ratio is the single highest-signal hallucination detector. Reject or flag any segment with ratio >2.4 at the application layer; combined with VAD pre-segmentation it eliminates the majority of repeated-phrase hallucinations from your final transcript.

Cost and FinOps

Whisper cost economics are unusually clean: cost per hour transcribed equals GPU hourly rate times RTF, minus any batching gains. The table in 'Sizing' is the planning model. Three FinOps levers move the number materially.

Model choice: Large-v3-turbo is roughly 6-8x cheaper per hour transcribed than Large-v3 at near-equal quality for transcription. For new deployments, Turbo is the default; reserve Large-v3 for translation and the small set of languages with measurable quality gap.
SKU choice: L4 is the unit-cost winner for batch / asynchronous workloads; H100 wins for high-concurrency or low-latency workloads where RTF needs to stay below ~0.01. L40S is a useful middle ground.
Spot capacity: Yobitel NeoCloud spot rates run 40-60 percent below list. Whisper is a particularly good spot workload because every audio file is independent and re-runnable — no checkpoint state to preserve.
Self-host vs Yobibyte managed: self-hosting wins per-unit at sustained >40 percent GPU utilisation; Yobibyte's per-minute pricing wins for spiky or low-volume traffic where idle GPUs dominate cost. The break-even at June 2026 list pricing is roughly 18,000 minutes of audio per month per H100.
FOCUS 1.1 billing exports from Yobitel NeoCloud and Omniscient Compute tag Whisper inference by model, runtime and tenant, so $/hour-transcribed can be sliced by application or customer in any FinOps tool that consumes FOCUS.

Security and compliance

Whisper itself is a model — security and compliance posture is set entirely by the deployment surface around it. Three considerations matter in practice. First, audio data is often more sensitive than text data: voice carries biometric identifiers and incidental personal information that text transcripts do not. Treat raw audio storage as personal data under GDPR regardless of whether you store the transcript.

Second, model weights are public — the supply-chain risk is the runtime, not the weights. Pin runtime container digests, verify Hugging Face checksum on weight pulls, and run on a registry mirror inside the sovereign perimeter. Yobitel NeoCloud's private model registry mirrors Whisper weights into UK and EU regions for customers who require provenance-controlled inference.

Third, sovereign deployments — UK NCSC OFFICIAL and OFFICIAL-SENSITIVE, EU GDPR with data-residency commitments, US HIPAA for clinical audio — require that the audio never leaves the regulated region. Yobibyte's managed Whisper endpoints pin model serving and audio buffering to the workspace's stated sovereignty region (UK, EU, US). MediQuery's clinical dictation deployment runs Whisper in a Yobibyte UK Sovereign tenancy aligned to NCSC Cloud Security Principles, with audio held in volatile GPU memory only — no on-disk persistence beyond the transcript output that the customer chooses to retain.

Warning: Whisper transcripts can leak via three channels that ASR-naive teams miss: (1) prompt-injection through audio that triggers the model to follow spoken instructions (rare but documented); (2) initial_prompt= bias that exposes a previous tenant's vocabulary if the prompt is reused across tenants; (3) prefix-style caching at the gateway layer. Production multi-tenant deployments must enforce per-tenant request isolation and per-tenant initial_prompt scoping — Yobibyte does both by default.

Migration and alternatives

Most production migrations to Whisper come from one of four origins: a commercial cloud ASR (AWS Transcribe, Google Cloud Speech-to-Text, Azure Speech), an earlier open ASR (Kaldi, DeepSpeech, wav2vec 2.0 fine-tuned), a managed SaaS (AssemblyAI, Deepgram, Speechmatics), or — in 2026 — from older Whisper deployments still on Large-v2. Counter-migrations away from Whisper exist too, primarily to streaming-native architectures (Parakeet, Canary) when latency becomes the binding constraint.

vs AWS Transcribe / GCP Speech / Azure Speech: Whisper is competitive-to-better on accent and noise, dramatically better on low-resource languages, materially cheaper per hour when self-hosted, and unconstrained by hyperscaler data-residency assumptions. The break-even is roughly 1,000 hours of audio per month on a single L4.
vs Deepgram / AssemblyAI / Speechmatics: managed-SaaS competitors lead on streaming latency, diarisation polish, and SDK ergonomics; Whisper leads on multilingual breadth, raw weight access for fine-tuning, and sovereignty.
vs Conformer-based streaming (Parakeet, Canary, Google USM): Conformer wins for streaming latency budgets under 500 ms; Whisper wins for offline / chunked workloads where the 30-second window is acceptable and translation is needed.
Yobibyte managed alternative: a one-line endpoint URL change swaps self-hosted Whisper for a Yobibyte-managed endpoint that handles model rotation (Large-v2 → v3 → Turbo), autoscaling, regional residency pinning and FOCUS billing — useful when the operational overhead of self-hosting outweighs the per-unit cost gain.

From	Effort	Quality change	Cost change	Operational notes
AWS Transcribe (standard)	Low — replace SDK call	Comparable on EN; Whisper often wins on accent/noise	~30-70 percent lower	Sovereignty and data-residency control gained.
Google Cloud Speech-to-Text	Low	Comparable on EN; Whisper wins on low-resource languages	~30-60 percent lower	Lose Google's word-level confidence; gain MIT licence.
Azure Speech	Low	Comparable	~30-50 percent lower	Same as GCP migration shape.
AssemblyAI / Deepgram / Speechmatics	Low — same API surface	Variable; vendors lead on speaker diarisation polish	~50-80 percent lower (self-host)	Lose vendor diarisation; gain via WhisperX + pyannote.
Kaldi / DeepSpeech / older OSS	Medium	Substantial uplift on noisy/accented	Generally lower (modern GPU runtimes)	Lose Kaldi's WFST / LM rescoring options; gain robustness.
Whisper Large-v2 (in-place upgrade)	Trivial — change model id	1-3 WER points improvement on average	Equal	Re-validate domain-specific accuracy.
Whisper Large-v3 → Large-v3-turbo	Trivial	Within 1-2 WER points on most benchmarks	~6-8x cheaper	Loses translation head; transcription only.
Whisper → Parakeet / Canary (streaming)	Medium	Comparable WER, much lower latency	Often lower (smaller models)	When sub-300 ms streaming latency is the requirement.
Self-hosted Whisper → Yobibyte managed	Low — change endpoint URL	Equal (same weights)	Equal at high utilisation; cheaper at low utilisation	Gain regional residency, autoscaling, FOCUS billing.

Troubleshooting

Six failure modes account for the vast majority of Whisper production incidents. The runbook below covers them.

Symptom	Likely cause	First check	Fix
Transcript contains invented repeated phrases	Hallucination on silence / low-info audio	compression_ratio per segment	Enable VAD; condition_on_previous_text=False; reject ratio >2.4.
Transcript drifts to a different topic mid-file	Previous-text conditioning poisoned context	condition_on_previous_text setting	Set to False; re-run.
Wrong language detected	Auto-detect ran on 30 s of silence or music	info.language_probability	Force --language; or detect on a VAD-filtered sample.
Word-level timestamps drift	Decoder timestamp tokens are segment-level only	Runtime used for timestamps	Switch to WhisperX (wav2vec forced alignment) or faster-whisper word_timestamps=True.
GPU memory OOM with batch >1	KV cache + activations exceed VRAM at batch	GPU memory at peak	Lower batch_size; use INT8_FLOAT16 on L4; or move to L40S/H100.
RTF much higher than planning anchor	Python overhead, missing CUDA graphs, or CPU-bound VAD	DCGM GPU utilisation	Use faster-whisper not reference; ensure CUDA build; offload VAD to dedicated CPU.
Domain-specific terms transcribed wrong	Vocabulary not in training distribution	Distribution of misrecognised terms	Use initial_prompt= to seed vocabulary; or domain fine-tune.
Latency spikes on long files	Sliding-window restarts and re-prefill	Per-window latency log	Pre-segment with VAD into <30 s windows; batch.
Translation output is poor	Used large-v3-turbo (no translation head)	Model id	Switch to large-v3 for translation; turbo is transcribe-only.
Punctuation / casing inconsistent across segments	BPE decoding + weak training signal	Per-segment punctuation distribution	Post-process with a dedicated punctuation/casing model.

Tip: When debugging an in-production Whisper deployment, always grab avg_logprob, compression_ratio, no_speech_prob and language_probability from the faster-whisper segment output and log them alongside the transcript. Without these four numbers it is essentially impossible to distinguish 'the audio was bad' from 'the model misbehaved' after the fact. Yobibyte's managed endpoints log all four by default in the audit trail.

Where Whisper fits in the Yobitel stack

Whisper is the workhorse speech recognition model across the Yobitel platform. It appears in four distinct places. First, MediQuery — Yobitel's clinical dictation and ambient-scribing AI Application — uses Whisper Large-v3 for the clinical narrative path on Yobitel NeoCloud UK, with the deployment held to NCSC Cloud Security Principles and clinician-validated transcripts wired into the EHR connector. Second, Yobibyte exposes Whisper Large-v3 and Large-v3-turbo as managed transcription endpoints under the OpenAI-compatible audio API; customers point existing OpenAI-SDK clients at their Yobibyte workspace URL and get sovereign-region inference with per-minute billing and FOCUS-conformant cost export.

Third, Yobitel Edge AI ships faster-whisper builds and whisper.cpp Q5_K builds on Jetson AGX Orin and customer-owned edge boxes for on-device transcription in air-gapped settings (defence, regulated healthcare, broadcast field kit). Fourth, Omniscient Compute and InferenceBench publish ongoing Whisper benchmark scores — RTF, WER on FLEURS / LibriSpeech / domain corpora, cost per hour transcribed — across NVIDIA H100, H200, L40S, L4 and Blackwell SKUs so customers planning capacity have current measured anchors rather than vendor marketing numbers.

If you are choosing between self-hosting Whisper on Yobitel NeoCloud GPU instances and consuming it as a Yobibyte managed endpoint, the deciding factor is usually sustained utilisation: self-hosting wins per-unit above roughly 40 percent GPU utilisation; Yobibyte's per-minute pricing wins below that. Either way, the model, weights and behaviour are identical — Yobibyte is a thinner consumption surface, not a different product.

MediQuery clinical dictation — Whisper Large-v3 on Yobibyte UK Sovereign, NCSC OFFICIAL alignment.
Yobibyte managed Whisper endpoints — Large-v3 and Large-v3-turbo, OpenAI-compatible audio API, UK / EU / US regions.
Yobitel Edge AI — faster-whisper / whisper.cpp on Jetson AGX Orin and edge CPUs.
InferenceBench — published RTF, WER and $/hour-transcribed across NeoCloud SKUs.
Omniscient Compute — Whisper inference billed under FOCUS 1.1 with model, runtime and tenant tags.

References

Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., 2022) · arXiv
openai/whisper (reference implementation) · GitHub
Whisper Large-v3 model card · Hugging Face
Whisper Large-v3-turbo model card · Hugging Face
SYSTRAN/faster-whisper (production runtime) · GitHub
ggerganov/whisper.cpp (edge runtime) · GitHub
WhisperX (forced alignment + diarisation) · GitHub
Distil-Whisper (Hugging Face distillation) · GitHub
NVIDIA NIM for Whisper · NVIDIA

TL;DR

Whisper is an encoder-decoder Transformer trained by OpenAI on weakly-supervised multilingual audio (680k hours for the original release, scaled to ~5M hours for Large-v3). MIT-licensed, weights on Hugging Face, 99-language transcription plus any-to-English translation in one set of weights (Radford et al., arXiv:2212.04356).
Eight size points ship: tiny (39M), base (74M), small (244M), medium (769M), large / large-v2 / large-v3 (~1.55B), and large-v3-turbo (~798M, Sep 2024) — a four-decoder-layer distillation that delivers near-large-v3 quality at roughly 8x the speed.
Production almost never runs upstream PyTorch. The default runtimes are faster-whisper (CTranslate2, ~4x throughput, FP16 / INT8), whisper.cpp (CPU / Apple Silicon / edge), WhisperX (word-level alignment + diarisation), Distil-Whisper (English-only distillation), and NVIDIA NIM / TensorRT-LLM Whisper engines for low-latency GPU serving.
Real-time factor (RTF) is the planning anchor. Large-v3 on a single H100 SXM5 runs at RTF ~0.03 (a 60-second clip transcribes in ~1.8 s); on L4 RTF is ~0.5 (~30 s for the same clip); large-v3-turbo lifts both numbers by 5-8x. A single L4 hosts dozens of concurrent batch streams cheaply; H100 is reserved for low-latency or very high concurrency.
Yobitel angle: MediQuery uses Whisper Large-v3 for clinical dictation on NeoCloud UK; Yobibyte exposes Whisper Large-v3 and Large-v3-turbo as managed transcription endpoints with autoscaling and prompt-billing; Yobitel Edge AI ships faster-whisper on Jetson AGX Orin for on-device transcription in air-gapped settings.

Overview

Quick start

# 1. Install faster-whisper (the production default) and run Large-v3 on a GPU host
pip install "faster-whisper>=1.1.0"

# Single-shot CLI transcription via the openai-whisper reference CLI
pip install openai-whisper
whisper meeting.wav --model large-v3 --device cuda --language en --output_format srt

# 2. Production-shape Python using faster-whisper
python - <<'PY'
from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",                # or "large-v3-turbo" for 5-8x speed at ~equal quality
    device="cuda",
    compute_type="float16",    # "int8_float16" on L4; "int8" for CPU
)

segments, info = model.transcribe(
    "meeting.wav",
    language="en",             # omit to auto-detect from first 30 s
    vad_filter=True,           # silero VAD — strongly recommended
    vad_parameters={"min_silence_duration_ms": 500},
    beam_size=5,
    word_timestamps=True,
    condition_on_previous_text=False,  # reduces hallucination drift
)

print(f"Detected language: {info.language} (p={info.language_probability:.2f})")
for s in segments:
    print(f"[{s.start:7.2f} -> {s.end:7.2f}] {s.text}")
PY

# 3. Same job via a Yobibyte-managed Whisper endpoint — OpenAI-compatible audio API
curl https://YOUR-WORKSPACE.yobibyte.yobitel.com/v1/audio/transcriptions \
    -H "Authorization: Bearer $YOBIBYTE_API_KEY" \
    -F "model=whisper-large-v3" \
    -F "file=@meeting.wav" \
    -F "response_format=verbose_json" \
    -F "timestamp_granularities[]=word"

Tip: Always enable vad_filter and set condition_on_previous_text=False as your first two production flags. They together eliminate the majority of Whisper's notorious hallucination cases (repeated phrases during silence, drift on long files) with negligible quality cost.

How it works

Input: 30-second log-mel spectrogram, 80 bins (≤v2) or 128 bins (v3+), 16 kHz, 10 ms hop.
Encoder: convolutional downsample + N Transformer blocks (N = 4 / 6 / 12 / 24 / 32 by size).
Decoder: autoregressive BPE LM with cross-attention to encoder states; multilingual BPE tokeniser (~51,865 tokens including 99 language tokens and timestamp tokens).
Control tokens: <|startoftranscript|> <|lang|> <|task|> <|notimestamps|>? select task and language.
Long-form: sliding 30 s windows advanced by last decoded timestamp; conditioning on previous text is optional and a known hallucination source.
Turbo distillation: same encoder, 4-layer decoder, ~8x faster decode at near-equal quality (transcription only).

Reference and specifications

Size	Parameters	Encoder / Decoder layers	VRAM (FP16)	RTF on H100 SXM5	RTF on L4	Multilingual
tiny	39M	4 / 4	~1 GB	~0.003	~0.04	Yes (+ tiny.en)
base	74M	6 / 6	~1 GB	~0.005	~0.07	Yes (+ base.en)
small	244M	12 / 12	~2 GB	~0.010	~0.15	Yes (+ small.en)
medium	769M	24 / 24	~5 GB	~0.018	~0.30	Yes (+ medium.en)
large-v1 / v2	1.55B	32 / 32	~10 GB	~0.030	~0.50	Yes
large-v3	1.55B	32 / 32 (128-mel)	~10 GB	~0.030	~0.50	Yes
large-v3-turbo	~798M	32 / 4 (128-mel)	~6 GB	~0.005	~0.08	Yes (transcribe only)
distil-large-v3 (HF)	~756M	32 / 2 (En-only)	~6 GB	~0.004	~0.07	No (En)

Note: RTF figures are mid-range measurements with faster-whisper at FP16, beam_size=5, VAD on, batch=1. RTF scales near-linearly with batch on H100/H200/L40S up to the point KV cache saturates; on L4 it scales sublinearly because of memory bandwidth limits. Use these as planning anchors, not contractual SLOs.

Implementation variants

faster-whisper is the right default for almost everyone self-hosting on Yobitel NeoCloud — it covers GPU, FP16/INT8 quantisation, VAD integration, word-level timestamps and batch concurrency in a single pip install.
whisper.cpp is the default for Yobitel Edge AI deployments on Jetson AGX Orin and customer-owned edge boxes — its GGML quantised builds (Q4_K, Q5_K) fit Large-v3 into ~3-5 GB and run real-time on CPU.
WhisperX adds wav2vec 2.0-based forced alignment for accurate word-level timestamps plus pyannote diarisation — the right starting point for meeting / call-centre transcription that needs speaker labels.
Distil-Whisper trades multilingual coverage for speed; for English-only batch indexing the distil-large-v3 + Large-v3-turbo combination is hard to beat on $/hour transcribed.
NVIDIA NIM packages Whisper as a TensorRT-LLM + Triton container with prebuilt FP16 / FP8 engines for H100 / H200 / L40S / L4 — the right choice when you want low-latency real-time ASR on Hopper without building TensorRT engines yourself.

Runtime	Backend	Target hardware	Speedup vs reference	When to use
faster-whisper	CTranslate2 (C++ + CUDA / oneDNN)	NVIDIA GPU, CPU, AMD ROCm (preview)	~4x on GPU, ~10x on CPU	Default for self-hosted GPU serving; FP16 / INT8 / INT8_FLOAT16.
whisper.cpp	GGML (pure C/C++)	CPU, Apple Silicon (Metal), CUDA	~1-2x CPU, real-time on M-series	Edge, mobile, desktop; offline transcription.
WhisperX	faster-whisper + wav2vec 2.0 + pyannote	NVIDIA GPU	~2-3x with diarisation	When you need word-level timestamps and speaker labels in one pass.
Distil-Whisper	Transformers / faster-whisper	NVIDIA GPU	~6x at small WER cost	English-only batch workloads; podcast / call-centre indexing.
NVIDIA NIM Whisper	TensorRT-LLM + Triton	H100 / H200 / L40S / L4	~6-10x with batching	High-concurrency low-latency serving; enterprise NVAIE customers.
TensorRT-LLM Whisper	TensorRT-LLM engine builds	Hopper / Blackwell / Ada	~5-8x	Self-built engines when you have the TensorRT expertise.
openai/whisper (reference)	PyTorch	NVIDIA GPU, CPU	1x baseline	Research; never for production.

Workload patterns

Workload	Latency target	Runtime	Hardware	Key flags
Batch indexing (podcast / archive)	Throughput-bound, hours OK	faster-whisper or Distil-Whisper	L4 / L40S	FP16/INT8, beam=5, large-v3-turbo, batch=8-16.
Meeting transcription (post-call)	Minutes per hour of audio	WhisperX	L40S / L4	Large-v3 + wav2vec alignment + pyannote.
Clinical / call-centre near-real-time	1-3 s end-to-end	faster-whisper (chunked) or NIM Whisper	H100 / L40S	Large-v3-turbo, VAD-aligned 5-10 s chunks.
Streaming voice agent	Sub-300 ms first hypothesis	Whisper-streaming or native Conformer	L4 / L40S	Pair Whisper with a streaming front-end; or switch to Parakeet/Canary.
Edge / air-gapped dictation	Real-time on-device	whisper.cpp or faster-whisper	Jetson AGX Orin, edge CPU	INT8 / Q5_K, medium or large-v3-turbo.

Warning: Whisper is not natively a streaming model. For voice-agent latency budgets under 500 ms first-byte, do not try to make Whisper stream — switch to a streaming-native architecture (Conformer + RNN-T, Parakeet, Canary). Whisper-streaming and similar wrappers are workable for live captioning at 1-2 s latency but not for interactive voice UX. The Yobibyte real-time ASR endpoint defaults to Parakeet RNN-T for sub-300 ms streaming and reserves Whisper for batch and chunked near-real-time.

Sizing and capacity planning

Default cost-optimal SKU for batch transcription on Yobitel NeoCloud: L4 with large-v3-turbo INT8 — roughly $0.07 per hour of audio with comfortable concurrency.
Default cost-optimal SKU for high-volume near-real-time: H100 SXM5 with large-v3-turbo FP16 — roughly $0.017 per hour of audio at ~200 concurrent streams.
For translation workloads (any-to-English), Large-v3 is still required — Turbo dropped the translation head. Plan ~5x higher cost than transcription-only.
Long-file workloads (>1 hour) benefit from longer VAD-min-silence and disabled previous-text conditioning more than from any flag tuning.
Yobibyte managed Whisper endpoints bill per minute of audio transcribed at a flat USD rate, eliminating the need to size GPU capacity manually — useful when load is spiky.

Model	Runtime	GPU	RTF (batch=1)	Concurrent streams (RTF=1)	GPU rate ($/h)	$/hour transcribed
large-v3	faster-whisper FP16	1x H100 SXM5	~0.03	~30	$3.20	~$0.11
large-v3	faster-whisper FP16	1x L40S 48GB	~0.08	~12	$1.80	~$0.15
large-v3	faster-whisper INT8_F16	1x L4 24GB	~0.50	~2	$0.85	~$0.43
large-v3-turbo	faster-whisper FP16	1x H100 SXM5	~0.005	~200	$3.20	~$0.017
large-v3-turbo	faster-whisper FP16	1x L40S 48GB	~0.012	~80	$1.80	~$0.024
large-v3-turbo	faster-whisper INT8_F16	1x L4 24GB	~0.08	~12	$0.85	~$0.072
large-v3	NIM Whisper (TRT-LLM)	1x H100 SXM5	~0.005 (batched)	~200	$3.20	~$0.017
distil-large-v3	faster-whisper FP16	1x L4 24GB	~0.07	~14	$0.85	~$0.062
large-v3	whisper.cpp Q5_K	Jetson AGX Orin 64GB	~0.9	~1	Edge (CapEx)	n/a (on-device)
large-v3-turbo	whisper.cpp Q5_K	Jetson AGX Orin 64GB	~0.15	~6	Edge (CapEx)	n/a (on-device)

Limits and quotas

Whisper's hard limits come from its architecture; soft limits come from the chosen runtime. The constraints below are the ones that catch teams in production.

Limit	Default / Ceiling	Cause	Workaround
Audio window per pass	30 seconds	Trained context length	Chunk with VAD; never pass >30 s in one call.
Decoder generation length	224 tokens (default)	max_new_tokens / no_repeat_ngram	Raise --max-len; pre-segment audio.
Sample rate	16 kHz (resampled internally)	Mel feature extractor	Pre-resample to avoid CPU overhead on hot paths.
Languages supported	99 transcribe / 96 to-En translate	Training data coverage	Fine-tune or rescore for unsupported languages.
Speaker labels	Not built in	Single-speaker model	Combine with pyannote / WhisperX.
Streaming	Not native	30-second window architecture	Use Whisper-streaming, or switch to Parakeet / Canary.
Word-level timestamps (reference)	Approximate / unstable	Decoder timestamp tokens are segment-level	Use WhisperX (wav2vec forced alignment) or faster-whisper word_timestamps=True.
Diacritic / casing accuracy	Variable per language	BPE tokenisation + weak supervision	Domain fine-tune; post-process with a punctuation model.
Long-form hallucination	Documented failure mode	Conditioning on previous decoded text	Set condition_on_previous_text=False; use VAD.
Numerical / proper-noun recall	Mid-tier	Web-scraped training data	Use initial_prompt= to bias vocabulary; fine-tune on domain.

Warning: Whisper has a well-documented tendency to hallucinate during silence and low-information audio — entire invented sentences, often repeating. This is the single largest production failure mode. Mandatory mitigations: (1) VAD-filter before inference; (2) condition_on_previous_text=False; (3) reject segments where avg_logprob is below a threshold (typically -1.0); (4) reject segments where compression_ratio exceeds ~2.4 (repeated text inflates the ratio).

Observability

whisper:active_streams — concurrent in-flight transcriptions; alert when sustained at runtime capacity.
whisper:real_time_factor — moving average RTF; should sit at or below the planning anchor for the model+SKU.
whisper:avg_logprob — mean per-token log probability per segment; a sustained drop below -0.7 signals a domain shift or audio-quality regression.
whisper:compression_ratio — ratio of decoded text length to gzipped length; values above 2.4 indicate the decoder is looping or repeating (hallucination signal).
whisper:nospeech_prob — distribution of the no-speech probability emitted by the decoder; useful for tuning the VAD threshold.
whisper:language_detected_total{language=...} — counter of detected source languages; sudden distribution shifts are the most common production-incident leading indicator.
DCGM GPU utilisation and memory bandwidth — pair with RTF to distinguish encoder-bound (memory-bandwidth limited) from decoder-bound (FLOPs limited) regimes.

# Prometheus rules for a self-hosted faster-whisper deployment
groups:
  - name: whisper-sla
    interval: 30s
    rules:
      - alert: WhisperRTFRegression
        expr: avg_over_time(whisper:real_time_factor[10m]) > 0.6
        for: 10m
        labels: { severity: warning, team: speech }
        annotations:
          summary: "RTF >0.6 — capacity headroom gone or batching broken"

      - alert: WhisperHallucinationSignal
        expr: histogram_quantile(0.95,
                sum by (le) (rate(whisper:compression_ratio_bucket[10m]))) > 2.4
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Compression ratio p95 >2.4 — repeated-text hallucination spike"

      - alert: WhisperLogProbCollapse
        expr: avg_over_time(whisper:avg_logprob[15m]) < -0.9
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Average log-prob collapsed — audio quality or domain regression"

      - alert: WhisperLanguageDrift
        expr: |
          (sum by (language) (rate(whisper:language_detected_total[1h])))
          / ignoring(language) group_left
          sum(rate(whisper:language_detected_total[1h])) > 0.3
          unless on(language) (label_replace(vector(1), "language", "$1", "language", "(en|es|fr|de|pt)"))
        for: 30m
        labels: { severity: info }
        annotations:
          summary: "Unexpected language >30 percent of traffic — workload shape changed"

Tip: compression_ratio is the single highest-signal hallucination detector. Reject or flag any segment with ratio >2.4 at the application layer; combined with VAD pre-segmentation it eliminates the majority of repeated-phrase hallucinations from your final transcript.

Cost and FinOps

Model choice: Large-v3-turbo is roughly 6-8x cheaper per hour transcribed than Large-v3 at near-equal quality for transcription. For new deployments, Turbo is the default; reserve Large-v3 for translation and the small set of languages with measurable quality gap.
SKU choice: L4 is the unit-cost winner for batch / asynchronous workloads; H100 wins for high-concurrency or low-latency workloads where RTF needs to stay below ~0.01. L40S is a useful middle ground.
Spot capacity: Yobitel NeoCloud spot rates run 40-60 percent below list. Whisper is a particularly good spot workload because every audio file is independent and re-runnable — no checkpoint state to preserve.
Self-host vs Yobibyte managed: self-hosting wins per-unit at sustained >40 percent GPU utilisation; Yobibyte's per-minute pricing wins for spiky or low-volume traffic where idle GPUs dominate cost. The break-even at June 2026 list pricing is roughly 18,000 minutes of audio per month per H100.
FOCUS 1.1 billing exports from Yobitel NeoCloud and Omniscient Compute tag Whisper inference by model, runtime and tenant, so $/hour-transcribed can be sliced by application or customer in any FinOps tool that consumes FOCUS.

Security and compliance

Warning: Whisper transcripts can leak via three channels that ASR-naive teams miss: (1) prompt-injection through audio that triggers the model to follow spoken instructions (rare but documented); (2) initial_prompt= bias that exposes a previous tenant's vocabulary if the prompt is reused across tenants; (3) prefix-style caching at the gateway layer. Production multi-tenant deployments must enforce per-tenant request isolation and per-tenant initial_prompt scoping — Yobibyte does both by default.

Migration and alternatives

vs AWS Transcribe / GCP Speech / Azure Speech: Whisper is competitive-to-better on accent and noise, dramatically better on low-resource languages, materially cheaper per hour when self-hosted, and unconstrained by hyperscaler data-residency assumptions. The break-even is roughly 1,000 hours of audio per month on a single L4.
vs Deepgram / AssemblyAI / Speechmatics: managed-SaaS competitors lead on streaming latency, diarisation polish, and SDK ergonomics; Whisper leads on multilingual breadth, raw weight access for fine-tuning, and sovereignty.
vs Conformer-based streaming (Parakeet, Canary, Google USM): Conformer wins for streaming latency budgets under 500 ms; Whisper wins for offline / chunked workloads where the 30-second window is acceptable and translation is needed.
Yobibyte managed alternative: a one-line endpoint URL change swaps self-hosted Whisper for a Yobibyte-managed endpoint that handles model rotation (Large-v2 → v3 → Turbo), autoscaling, regional residency pinning and FOCUS billing — useful when the operational overhead of self-hosting outweighs the per-unit cost gain.

From	Effort	Quality change	Cost change	Operational notes
AWS Transcribe (standard)	Low — replace SDK call	Comparable on EN; Whisper often wins on accent/noise	~30-70 percent lower	Sovereignty and data-residency control gained.
Google Cloud Speech-to-Text	Low	Comparable on EN; Whisper wins on low-resource languages	~30-60 percent lower	Lose Google's word-level confidence; gain MIT licence.
Azure Speech	Low	Comparable	~30-50 percent lower	Same as GCP migration shape.
AssemblyAI / Deepgram / Speechmatics	Low — same API surface	Variable; vendors lead on speaker diarisation polish	~50-80 percent lower (self-host)	Lose vendor diarisation; gain via WhisperX + pyannote.
Kaldi / DeepSpeech / older OSS	Medium	Substantial uplift on noisy/accented	Generally lower (modern GPU runtimes)	Lose Kaldi's WFST / LM rescoring options; gain robustness.
Whisper Large-v2 (in-place upgrade)	Trivial — change model id	1-3 WER points improvement on average	Equal	Re-validate domain-specific accuracy.
Whisper Large-v3 → Large-v3-turbo	Trivial	Within 1-2 WER points on most benchmarks	~6-8x cheaper	Loses translation head; transcription only.
Whisper → Parakeet / Canary (streaming)	Medium	Comparable WER, much lower latency	Often lower (smaller models)	When sub-300 ms streaming latency is the requirement.
Self-hosted Whisper → Yobibyte managed	Low — change endpoint URL	Equal (same weights)	Equal at high utilisation; cheaper at low utilisation	Gain regional residency, autoscaling, FOCUS billing.

Troubleshooting

Six failure modes account for the vast majority of Whisper production incidents. The runbook below covers them.

Symptom	Likely cause	First check	Fix
Transcript contains invented repeated phrases	Hallucination on silence / low-info audio	compression_ratio per segment	Enable VAD; condition_on_previous_text=False; reject ratio >2.4.
Transcript drifts to a different topic mid-file	Previous-text conditioning poisoned context	condition_on_previous_text setting	Set to False; re-run.
Wrong language detected	Auto-detect ran on 30 s of silence or music	info.language_probability	Force --language; or detect on a VAD-filtered sample.
Word-level timestamps drift	Decoder timestamp tokens are segment-level only	Runtime used for timestamps	Switch to WhisperX (wav2vec forced alignment) or faster-whisper word_timestamps=True.
GPU memory OOM with batch >1	KV cache + activations exceed VRAM at batch	GPU memory at peak	Lower batch_size; use INT8_FLOAT16 on L4; or move to L40S/H100.
RTF much higher than planning anchor	Python overhead, missing CUDA graphs, or CPU-bound VAD	DCGM GPU utilisation	Use faster-whisper not reference; ensure CUDA build; offload VAD to dedicated CPU.
Domain-specific terms transcribed wrong	Vocabulary not in training distribution	Distribution of misrecognised terms	Use initial_prompt= to seed vocabulary; or domain fine-tune.
Latency spikes on long files	Sliding-window restarts and re-prefill	Per-window latency log	Pre-segment with VAD into <30 s windows; batch.
Translation output is poor	Used large-v3-turbo (no translation head)	Model id	Switch to large-v3 for translation; turbo is transcribe-only.
Punctuation / casing inconsistent across segments	BPE decoding + weak training signal	Per-segment punctuation distribution	Post-process with a dedicated punctuation/casing model.

Tip: When debugging an in-production Whisper deployment, always grab avg_logprob, compression_ratio, no_speech_prob and language_probability from the faster-whisper segment output and log them alongside the transcript. Without these four numbers it is essentially impossible to distinguish 'the audio was bad' from 'the model misbehaved' after the fact. Yobibyte's managed endpoints log all four by default in the audit trail.

Where Whisper fits in the Yobitel stack

MediQuery clinical dictation — Whisper Large-v3 on Yobibyte UK Sovereign, NCSC OFFICIAL alignment.
Yobibyte managed Whisper endpoints — Large-v3 and Large-v3-turbo, OpenAI-compatible audio API, UK / EU / US regions.
Yobitel Edge AI — faster-whisper / whisper.cpp on Jetson AGX Orin and edge CPUs.
InferenceBench — published RTF, WER and $/hour-transcribed across NeoCloud SKUs.
Omniscient Compute — Whisper inference billed under FOCUS 1.1 with model, runtime and tenant tags.

References

Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., 2022) · arXiv
openai/whisper (reference implementation) · GitHub
Whisper Large-v3 model card · Hugging Face
Whisper Large-v3-turbo model card · Hugging Face
SYSTRAN/faster-whisper (production runtime) · GitHub
ggerganov/whisper.cpp (edge runtime) · GitHub
WhisperX (forced alignment + diarisation) · GitHub
Distil-Whisper (Hugging Face distillation) · GitHub
NVIDIA NIM for Whisper · NVIDIA

Whisper

Overview

Quick start

How it works

Reference and specifications

Implementation variants

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where Whisper fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

Whisper

Overview

Quick start

How it works

Reference and specifications

Implementation variants

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where Whisper fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte