TL;DR
- Whisper is an encoder-decoder Transformer trained by OpenAI on weakly-supervised multilingual audio (680k hours for the original release, scaled to ~5M hours for Large-v3). MIT-licensed, weights on Hugging Face, 99-language transcription plus any-to-English translation in one set of weights (Radford et al., arXiv:2212.04356).
- Eight size points ship: tiny (39M), base (74M), small (244M), medium (769M), large / large-v2 / large-v3 (~1.55B), and large-v3-turbo (~798M, Sep 2024) — a four-decoder-layer distillation that delivers near-large-v3 quality at roughly 8x the speed.
- Production almost never runs upstream PyTorch. The default runtimes are faster-whisper (CTranslate2, ~4x throughput, FP16 / INT8), whisper.cpp (CPU / Apple Silicon / edge), WhisperX (word-level alignment + diarisation), Distil-Whisper (English-only distillation), and NVIDIA NIM / TensorRT-LLM Whisper engines for low-latency GPU serving.
- Real-time factor (RTF) is the planning anchor. Large-v3 on a single H100 SXM5 runs at RTF ~0.03 (a 60-second clip transcribes in ~1.8 s); on L4 RTF is ~0.5 (~30 s for the same clip); large-v3-turbo lifts both numbers by 5-8x. A single L4 hosts dozens of concurrent batch streams cheaply; H100 is reserved for low-latency or very high concurrency.
- Yobitel angle: MediQuery uses Whisper Large-v3 for clinical dictation on NeoCloud UK; Yobibyte exposes Whisper Large-v3 and Large-v3-turbo as managed transcription endpoints with autoscaling and prompt-billing; Yobitel Edge AI ships faster-whisper on Jetson AGX Orin for on-device transcription in air-gapped settings.
Overview#
Whisper is an automatic speech recognition (ASR) model and the de facto open-weights baseline for speech-to-text since its September 2022 release. It is a vanilla encoder-decoder Transformer trained on a very large, noisy corpus of (audio, transcript) pairs scraped from the public web — 680,000 hours for the original release, expanded to roughly 5 million hours for Large-v3 (Radford et al., 'Robust Speech Recognition via Large-Scale Weak Supervision', arXiv:2212.04356). A single set of weights performs transcription in 99 languages, any-to-English translation, language identification, voice activity detection, and segment-level timestamp prediction; the active task is selected by special control tokens prepended to the decoder sequence.
Whisper is significant for two reasons. Architecturally, it demonstrated that weak supervision at internet scale beats clean small-corpus self-supervision (the wav2vec 2.0 paradigm) on robustness — the model handles accents, background noise, codec degradation, and domain shift dramatically better than its predecessors. Practically, MIT licensing of the weights and an excellent runtime ecosystem (faster-whisper, whisper.cpp, WhisperX, Distil-Whisper, NVIDIA NIM) made it the default starting point for every open-source transcription stack built after 2023 — from podcast indexing services to clinical scribing to subtitle generation.
The current production lineup is Large-v3 (November 2023, ~1.55B parameters, 128-bin mel input) and Large-v3-turbo (September 2024, ~798M parameters, four decoder layers, near-Large-v3 quality at roughly 8x the inference speed). On Yobitel NeoCloud, both models are first-class managed endpoints in Yobibyte; on Yobitel Edge AI, faster-whisper builds of medium / large-v3-turbo ship on Jetson AGX Orin for air-gapped on-device transcription. This entry helps you choose the right Whisper size and runtime for your workload, size GPU capacity using real-time factor as the planning anchor, decide between self-hosting and the Yobibyte managed alternative, and avoid the well-known failure modes — chiefly hallucination on silence and chunk-boundary drift in streaming.
Quick start#
The example below installs the production runtime (faster-whisper, CTranslate2-backed) on any CUDA 12+ host, transcribes a meeting recording with VAD-filtered word-level timestamps, and shows the equivalent call against a Yobibyte-managed Whisper endpoint using the OpenAI-compatible audio API. Self-hosting takes care of upstream weights; the Yobibyte path takes care of weights, autoscaling, GPU allocation, regional residency and billing.
# 1. Install faster-whisper (the production default) and run Large-v3 on a GPU host
pip install "faster-whisper>=1.1.0"
# Single-shot CLI transcription via the openai-whisper reference CLI
pip install openai-whisper
whisper meeting.wav --model large-v3 --device cuda --language en --output_format srt
# 2. Production-shape Python using faster-whisper
python - <<'PY'
from faster_whisper import WhisperModel
model = WhisperModel(
"large-v3", # or "large-v3-turbo" for 5-8x speed at ~equal quality
device="cuda",
compute_type="float16", # "int8_float16" on L4; "int8" for CPU
)
segments, info = model.transcribe(
"meeting.wav",
language="en", # omit to auto-detect from first 30 s
vad_filter=True, # silero VAD — strongly recommended
vad_parameters={"min_silence_duration_ms": 500},
beam_size=5,
word_timestamps=True,
condition_on_previous_text=False, # reduces hallucination drift
)
print(f"Detected language: {info.language} (p={info.language_probability:.2f})")
for s in segments:
print(f"[{s.start:7.2f} -> {s.end:7.2f}] {s.text}")
PY
# 3. Same job via a Yobibyte-managed Whisper endpoint — OpenAI-compatible audio API
curl https://YOUR-WORKSPACE.yobibyte.yobitel.com/v1/audio/transcriptions \
-H "Authorization: Bearer $YOBIBYTE_API_KEY" \
-F "model=whisper-large-v3" \
-F "file=@meeting.wav" \
-F "response_format=verbose_json" \
-F "timestamp_granularities[]=word"Always enable vad_filter and set condition_on_previous_text=False as your first two production flags. They together eliminate the majority of Whisper's notorious hallucination cases (repeated phrases during silence, drift on long files) with negligible quality cost.
How it works#
Whisper is a vanilla encoder-decoder Transformer with no domain-specific tricks — its strength comes from data scale and a clever multi-task control-token scheme, not architectural novelty. The encoder consumes 30-second chunks of log-mel spectrogram (80 bins through Large-v2, 128 bins from Large-v3 onward), sampled at 16 kHz and hop 10 ms, padded with zeros if the audio is shorter. Two convolutional layers downsample, then sinusoidal position embeddings are added and the result passes through a stack of Transformer encoder blocks. The decoder is an autoregressive BPE language model conditioned on encoder states via cross-attention; it generates one token at a time until an end-of-transcript token or the 30-second window is exhausted.
Multi-task behaviour is selected by a small vocabulary of control tokens prepended to the decoder sequence. The pattern is `<|startoftranscript|> <|language|> <|task|> <|notimestamps|>?`. The language token (e.g. `<|en|>`, `<|fr|>`, `<|zh|>`) selects one of 99 languages or can be omitted to trigger language identification. The task token is either `<|transcribe|>` (output in the source language) or `<|translate|>` (output in English). Without `<|notimestamps|>`, the decoder also emits `<|0.00|>` … `<|29.98|>` timestamp tokens delimiting each segment. A `<|nospeech|>` token at the first decoder position signals empty audio. The same weights serve every combination.
For audio longer than 30 seconds — which is essentially all real workloads — Whisper is run in a sliding-window pattern. The reference implementation transcribes the first 30-second window, takes the timestamp of the last decoded segment, advances the window to that point, optionally conditions the next window's decoder on the previous transcript (the `condition_on_previous_text` flag), and repeats. This is the source of the model's reputation for hallucination on long files: a confident-but-wrong segment poisons the conditioning context for every subsequent window, producing repeating loops or invented content. Disabling previous-text conditioning is the standard mitigation; pairing with an external VAD that pre-segments speech is the production-grade fix.
Large-v3-turbo (September 2024) is a distillation of Large-v3 that reduces the decoder from 32 layers to 4 while keeping the encoder unchanged. Because decoding cost dominates inference (the encoder runs once per 30-second window; the decoder runs per token), the speedup is dramatic — roughly 8x faster end-to-end on a single GPU at WER within 1-2 percentage points of Large-v3 on most benchmarks. Turbo does not support translation, only transcription. For most production transcription workloads on Yobitel NeoCloud, Large-v3-turbo is now the recommended default; Large-v3 stays for translation and for the small set of languages where the quality gap is material.
- Input: 30-second log-mel spectrogram, 80 bins (≤v2) or 128 bins (v3+), 16 kHz, 10 ms hop.
- Encoder: convolutional downsample + N Transformer blocks (N = 4 / 6 / 12 / 24 / 32 by size).
- Decoder: autoregressive BPE LM with cross-attention to encoder states; multilingual BPE tokeniser (~51,865 tokens including 99 language tokens and timestamp tokens).
- Control tokens: `<|startoftranscript|>` `<|lang|>` `<|task|>` `<|notimestamps|>?` select task and language.
- Long-form: sliding 30 s windows advanced by last decoded timestamp; conditioning on previous text is optional and a known hallucination source.
- Turbo distillation: same encoder, 4-layer decoder, ~8x faster decode at near-equal quality (transcription only).
Reference and specifications#
Eight size points are published. All are MIT-licensed with weights on Hugging Face under `openai/whisper-*`. The `.en` suffixes (tiny.en, base.en, small.en, medium.en) are English-only variants that drop the multilingual heads — slightly higher English accuracy at the same parameter count and useful when the workload is monolingual.
| Size | Parameters | Encoder / Decoder layers | VRAM (FP16) | RTF on H100 SXM5 | RTF on L4 | Multilingual |
|---|---|---|---|---|---|---|
| tiny | 39M | 4 / 4 | ~1 GB | ~0.003 | ~0.04 | Yes (+ tiny.en) |
| base | 74M | 6 / 6 | ~1 GB | ~0.005 | ~0.07 | Yes (+ base.en) |
| small | 244M | 12 / 12 | ~2 GB | ~0.010 | ~0.15 | Yes (+ small.en) |
| medium | 769M | 24 / 24 | ~5 GB | ~0.018 | ~0.30 | Yes (+ medium.en) |
| large-v1 / v2 | 1.55B | 32 / 32 | ~10 GB | ~0.030 | ~0.50 | Yes |
| large-v3 | 1.55B | 32 / 32 (128-mel) | ~10 GB | ~0.030 | ~0.50 | Yes |
| large-v3-turbo | ~798M | 32 / 4 (128-mel) | ~6 GB | ~0.005 | ~0.08 | Yes (transcribe only) |
| distil-large-v3 (HF) | ~756M | 32 / 2 (En-only) | ~6 GB | ~0.004 | ~0.07 | No (En) |
RTF figures are mid-range measurements with faster-whisper at FP16, beam_size=5, VAD on, batch=1. RTF scales near-linearly with batch on H100/H200/L40S up to the point KV cache saturates; on L4 it scales sublinearly because of memory bandwidth limits. Use these as planning anchors, not contractual SLOs.
Implementation variants#
OpenAI's reference repository (`openai/whisper`) is rarely used in production — it is single-stream, pure PyTorch, and emits one Python process per file. The production ecosystem has converged on four runtimes plus NVIDIA's vendor-optimised engines. Yobibyte's managed Whisper endpoints sit on top of this same set, selected per-workload.
- faster-whisper is the right default for almost everyone self-hosting on Yobitel NeoCloud — it covers GPU, FP16/INT8 quantisation, VAD integration, word-level timestamps and batch concurrency in a single pip install.
- whisper.cpp is the default for Yobitel Edge AI deployments on Jetson AGX Orin and customer-owned edge boxes — its GGML quantised builds (Q4_K, Q5_K) fit Large-v3 into ~3-5 GB and run real-time on CPU.
- WhisperX adds wav2vec 2.0-based forced alignment for accurate word-level timestamps plus pyannote diarisation — the right starting point for meeting / call-centre transcription that needs speaker labels.
- Distil-Whisper trades multilingual coverage for speed; for English-only batch indexing the distil-large-v3 + Large-v3-turbo combination is hard to beat on $/hour transcribed.
- NVIDIA NIM packages Whisper as a TensorRT-LLM + Triton container with prebuilt FP16 / FP8 engines for H100 / H200 / L40S / L4 — the right choice when you want low-latency real-time ASR on Hopper without building TensorRT engines yourself.
| Runtime | Backend | Target hardware | Speedup vs reference | When to use |
|---|---|---|---|---|
| faster-whisper | CTranslate2 (C++ + CUDA / oneDNN) | NVIDIA GPU, CPU, AMD ROCm (preview) | ~4x on GPU, ~10x on CPU | Default for self-hosted GPU serving; FP16 / INT8 / INT8_FLOAT16. |
| whisper.cpp | GGML (pure C/C++) | CPU, Apple Silicon (Metal), CUDA | ~1-2x CPU, real-time on M-series | Edge, mobile, desktop; offline transcription. |
| WhisperX | faster-whisper + wav2vec 2.0 + pyannote | NVIDIA GPU | ~2-3x with diarisation | When you need word-level timestamps and speaker labels in one pass. |
| Distil-Whisper | Transformers / faster-whisper | NVIDIA GPU | ~6x at small WER cost | English-only batch workloads; podcast / call-centre indexing. |
| NVIDIA NIM Whisper | TensorRT-LLM + Triton | H100 / H200 / L40S / L4 | ~6-10x with batching | High-concurrency low-latency serving; enterprise NVAIE customers. |
| TensorRT-LLM Whisper | TensorRT-LLM engine builds | Hopper / Blackwell / Ada | ~5-8x | Self-built engines when you have the TensorRT expertise. |
| openai/whisper (reference) | PyTorch | NVIDIA GPU, CPU | 1x baseline | Research; never for production. |
Workload patterns#
Whisper shows up in four workload shapes across Yobitel customers, each with different runtime, hardware and flag choices. These are the same four shapes Yobibyte derives its managed endpoint configurations from.
| Workload | Latency target | Runtime | Hardware | Key flags |
|---|---|---|---|---|
| Batch indexing (podcast / archive) | Throughput-bound, hours OK | faster-whisper or Distil-Whisper | L4 / L40S | FP16/INT8, beam=5, large-v3-turbo, batch=8-16. |
| Meeting transcription (post-call) | Minutes per hour of audio | WhisperX | L40S / L4 | Large-v3 + wav2vec alignment + pyannote. |
| Clinical / call-centre near-real-time | 1-3 s end-to-end | faster-whisper (chunked) or NIM Whisper | H100 / L40S | Large-v3-turbo, VAD-aligned 5-10 s chunks. |
| Streaming voice agent | Sub-300 ms first hypothesis | Whisper-streaming or native Conformer | L4 / L40S | Pair Whisper with a streaming front-end; or switch to Parakeet/Canary. |
| Edge / air-gapped dictation | Real-time on-device | whisper.cpp or faster-whisper | Jetson AGX Orin, edge CPU | INT8 / Q5_K, medium or large-v3-turbo. |
Whisper is not natively a streaming model. For voice-agent latency budgets under 500 ms first-byte, do not try to make Whisper stream — switch to a streaming-native architecture (Conformer + RNN-T, Parakeet, Canary). Whisper-streaming and similar wrappers are workable for live captioning at 1-2 s latency but not for interactive voice UX. The Yobibyte real-time ASR endpoint defaults to Parakeet RNN-T for sub-300 ms streaming and reserves Whisper for batch and chunked near-real-time.
Sizing and capacity planning#
Whisper sizing is governed by real-time factor (RTF — wall-clock seconds per second of audio). Throughput is `1 / RTF` streams per GPU; cost per hour transcribed is `(GPU hourly rate) x RTF`. The figures below are mid-range observed with faster-whisper at FP16, beam_size=5, VAD on; the table uses Yobitel NeoCloud UK list pricing (June 2026) for the $/hour-of-audio column. Substitute spot rates for ~40-60 percent reduction.
- Default cost-optimal SKU for batch transcription on Yobitel NeoCloud: L4 with large-v3-turbo INT8 — roughly $0.07 per hour of audio with comfortable concurrency.
- Default cost-optimal SKU for high-volume near-real-time: H100 SXM5 with large-v3-turbo FP16 — roughly $0.017 per hour of audio at ~200 concurrent streams.
- For translation workloads (any-to-English), Large-v3 is still required — Turbo dropped the translation head. Plan ~5x higher cost than transcription-only.
- Long-file workloads (>1 hour) benefit from longer VAD-min-silence and disabled previous-text conditioning more than from any flag tuning.
- Yobibyte managed Whisper endpoints bill per minute of audio transcribed at a flat USD rate, eliminating the need to size GPU capacity manually — useful when load is spiky.
| Model | Runtime | GPU | RTF (batch=1) | Concurrent streams (RTF=1) | GPU rate ($/h) | $/hour transcribed |
|---|---|---|---|---|---|---|
| large-v3 | faster-whisper FP16 | 1x H100 SXM5 | ~0.03 | ~30 | $3.20 | ~$0.11 |
| large-v3 | faster-whisper FP16 | 1x L40S 48GB | ~0.08 | ~12 | $1.80 | ~$0.15 |
| large-v3 | faster-whisper INT8_F16 | 1x L4 24GB | ~0.50 | ~2 | $0.85 | ~$0.43 |
| large-v3-turbo | faster-whisper FP16 | 1x H100 SXM5 | ~0.005 | ~200 | $3.20 | ~$0.017 |
| large-v3-turbo | faster-whisper FP16 | 1x L40S 48GB | ~0.012 | ~80 | $1.80 | ~$0.024 |
| large-v3-turbo | faster-whisper INT8_F16 | 1x L4 24GB | ~0.08 | ~12 | $0.85 | ~$0.072 |
| large-v3 | NIM Whisper (TRT-LLM) | 1x H100 SXM5 | ~0.005 (batched) | ~200 | $3.20 | ~$0.017 |
| distil-large-v3 | faster-whisper FP16 | 1x L4 24GB | ~0.07 | ~14 | $0.85 | ~$0.062 |
| large-v3 | whisper.cpp Q5_K | Jetson AGX Orin 64GB | ~0.9 | ~1 | Edge (CapEx) | n/a (on-device) |
| large-v3-turbo | whisper.cpp Q5_K | Jetson AGX Orin 64GB | ~0.15 | ~6 | Edge (CapEx) | n/a (on-device) |
Limits and quotas#
Whisper's hard limits come from its architecture; soft limits come from the chosen runtime. The constraints below are the ones that catch teams in production.
| Limit | Default / Ceiling | Cause | Workaround |
|---|---|---|---|
| Audio window per pass | 30 seconds | Trained context length | Chunk with VAD; never pass >30 s in one call. |
| Decoder generation length | 224 tokens (default) | max_new_tokens / no_repeat_ngram | Raise --max-len; pre-segment audio. |
| Sample rate | 16 kHz (resampled internally) | Mel feature extractor | Pre-resample to avoid CPU overhead on hot paths. |
| Languages supported | 99 transcribe / 96 to-En translate | Training data coverage | Fine-tune or rescore for unsupported languages. |
| Speaker labels | Not built in | Single-speaker model | Combine with pyannote / WhisperX. |
| Streaming | Not native | 30-second window architecture | Use Whisper-streaming, or switch to Parakeet / Canary. |
| Word-level timestamps (reference) | Approximate / unstable | Decoder timestamp tokens are segment-level | Use WhisperX (wav2vec forced alignment) or faster-whisper word_timestamps=True. |
| Diacritic / casing accuracy | Variable per language | BPE tokenisation + weak supervision | Domain fine-tune; post-process with a punctuation model. |
| Long-form hallucination | Documented failure mode | Conditioning on previous decoded text | Set condition_on_previous_text=False; use VAD. |
| Numerical / proper-noun recall | Mid-tier | Web-scraped training data | Use initial_prompt= to bias vocabulary; fine-tune on domain. |
Whisper has a well-documented tendency to hallucinate during silence and low-information audio — entire invented sentences, often repeating. This is the single largest production failure mode. Mandatory mitigations: (1) VAD-filter before inference; (2) condition_on_previous_text=False; (3) reject segments where avg_logprob is below a threshold (typically -1.0); (4) reject segments where compression_ratio exceeds ~2.4 (repeated text inflates the ratio).
Observability#
Production Whisper deployments need three metric families: throughput (streams / RTF), quality signals (avg_logprob and compression_ratio distributions), and audio-domain signals (VAD-detected silence ratio, language-ID distribution). faster-whisper and NIM Whisper both surface these natively; whisper.cpp emits them in JSON output. The Prometheus rules below are what Yobibyte uses internally and what Omniscient Compute scrapes from customer endpoints under the managed observability tier.
- whisper:active_streams — concurrent in-flight transcriptions; alert when sustained at runtime capacity.
- whisper:real_time_factor — moving average RTF; should sit at or below the planning anchor for the model+SKU.
- whisper:avg_logprob — mean per-token log probability per segment; a sustained drop below -0.7 signals a domain shift or audio-quality regression.
- whisper:compression_ratio — ratio of decoded text length to gzipped length; values above 2.4 indicate the decoder is looping or repeating (hallucination signal).
- whisper:nospeech_prob — distribution of the no-speech probability emitted by the decoder; useful for tuning the VAD threshold.
- whisper:language_detected_total{language=...} — counter of detected source languages; sudden distribution shifts are the most common production-incident leading indicator.
- DCGM GPU utilisation and memory bandwidth — pair with RTF to distinguish encoder-bound (memory-bandwidth limited) from decoder-bound (FLOPs limited) regimes.
# Prometheus rules for a self-hosted faster-whisper deployment
groups:
- name: whisper-sla
interval: 30s
rules:
- alert: WhisperRTFRegression
expr: avg_over_time(whisper:real_time_factor[10m]) > 0.6
for: 10m
labels: { severity: warning, team: speech }
annotations:
summary: "RTF >0.6 — capacity headroom gone or batching broken"
- alert: WhisperHallucinationSignal
expr: histogram_quantile(0.95,
sum by (le) (rate(whisper:compression_ratio_bucket[10m]))) > 2.4
for: 10m
labels: { severity: warning }
annotations:
summary: "Compression ratio p95 >2.4 — repeated-text hallucination spike"
- alert: WhisperLogProbCollapse
expr: avg_over_time(whisper:avg_logprob[15m]) < -0.9
for: 15m
labels: { severity: warning }
annotations:
summary: "Average log-prob collapsed — audio quality or domain regression"
- alert: WhisperLanguageDrift
expr: |
(sum by (language) (rate(whisper:language_detected_total[1h])))
/ ignoring(language) group_left
sum(rate(whisper:language_detected_total[1h])) > 0.3
unless on(language) (label_replace(vector(1), "language", "$1", "language", "(en|es|fr|de|pt)"))
for: 30m
labels: { severity: info }
annotations:
summary: "Unexpected language >30 percent of traffic — workload shape changed"compression_ratio is the single highest-signal hallucination detector. Reject or flag any segment with ratio >2.4 at the application layer; combined with VAD pre-segmentation it eliminates the majority of repeated-phrase hallucinations from your final transcript.
Cost and FinOps#
Whisper cost economics are unusually clean: cost per hour transcribed equals GPU hourly rate times RTF, minus any batching gains. The table in 'Sizing' is the planning model. Three FinOps levers move the number materially.
- Model choice: Large-v3-turbo is roughly 6-8x cheaper per hour transcribed than Large-v3 at near-equal quality for transcription. For new deployments, Turbo is the default; reserve Large-v3 for translation and the small set of languages with measurable quality gap.
- SKU choice: L4 is the unit-cost winner for batch / asynchronous workloads; H100 wins for high-concurrency or low-latency workloads where RTF needs to stay below ~0.01. L40S is a useful middle ground.
- Spot capacity: Yobitel NeoCloud spot rates run 40-60 percent below list. Whisper is a particularly good spot workload because every audio file is independent and re-runnable — no checkpoint state to preserve.
- Self-host vs Yobibyte managed: self-hosting wins per-unit at sustained >40 percent GPU utilisation; Yobibyte's per-minute pricing wins for spiky or low-volume traffic where idle GPUs dominate cost. The break-even at June 2026 list pricing is roughly 18,000 minutes of audio per month per H100.
- FOCUS 1.1 billing exports from Yobitel NeoCloud and Omniscient Compute tag Whisper inference by `model`, `runtime` and `tenant`, so $/hour-transcribed can be sliced by application or customer in any FinOps tool that consumes FOCUS.
Security and compliance#
Whisper itself is a model — security and compliance posture is set entirely by the deployment surface around it. Three considerations matter in practice. First, audio data is often more sensitive than text data: voice carries biometric identifiers and incidental personal information that text transcripts do not. Treat raw audio storage as personal data under GDPR regardless of whether you store the transcript.
Second, model weights are public — the supply-chain risk is the runtime, not the weights. Pin runtime container digests, verify Hugging Face checksum on weight pulls, and run on a registry mirror inside the sovereign perimeter. Yobitel NeoCloud's private model registry mirrors Whisper weights into UK and EU regions for customers who require provenance-controlled inference.
Third, sovereign deployments — UK NCSC OFFICIAL and OFFICIAL-SENSITIVE, EU GDPR with data-residency commitments, US HIPAA for clinical audio — require that the audio never leaves the regulated region. Yobibyte's managed Whisper endpoints pin model serving and audio buffering to the workspace's stated sovereignty region (UK, EU, US). MediQuery's clinical dictation deployment runs Whisper in a Yobibyte UK Sovereign tenancy aligned to NCSC Cloud Security Principles, with audio held in volatile GPU memory only — no on-disk persistence beyond the transcript output that the customer chooses to retain.
Whisper transcripts can leak via three channels that ASR-naive teams miss: (1) prompt-injection through audio that triggers the model to follow spoken instructions (rare but documented); (2) initial_prompt= bias that exposes a previous tenant's vocabulary if the prompt is reused across tenants; (3) prefix-style caching at the gateway layer. Production multi-tenant deployments must enforce per-tenant request isolation and per-tenant initial_prompt scoping — Yobibyte does both by default.
Migration and alternatives#
Most production migrations to Whisper come from one of four origins: a commercial cloud ASR (AWS Transcribe, Google Cloud Speech-to-Text, Azure Speech), an earlier open ASR (Kaldi, DeepSpeech, wav2vec 2.0 fine-tuned), a managed SaaS (AssemblyAI, Deepgram, Speechmatics), or — in 2026 — from older Whisper deployments still on Large-v2. Counter-migrations away from Whisper exist too, primarily to streaming-native architectures (Parakeet, Canary) when latency becomes the binding constraint.
- vs AWS Transcribe / GCP Speech / Azure Speech: Whisper is competitive-to-better on accent and noise, dramatically better on low-resource languages, materially cheaper per hour when self-hosted, and unconstrained by hyperscaler data-residency assumptions. The break-even is roughly 1,000 hours of audio per month on a single L4.
- vs Deepgram / AssemblyAI / Speechmatics: managed-SaaS competitors lead on streaming latency, diarisation polish, and SDK ergonomics; Whisper leads on multilingual breadth, raw weight access for fine-tuning, and sovereignty.
- vs Conformer-based streaming (Parakeet, Canary, Google USM): Conformer wins for streaming latency budgets under 500 ms; Whisper wins for offline / chunked workloads where the 30-second window is acceptable and translation is needed.
- Yobibyte managed alternative: a one-line endpoint URL change swaps self-hosted Whisper for a Yobibyte-managed endpoint that handles model rotation (Large-v2 → v3 → Turbo), autoscaling, regional residency pinning and FOCUS billing — useful when the operational overhead of self-hosting outweighs the per-unit cost gain.
| From | Effort | Quality change | Cost change | Operational notes |
|---|---|---|---|---|
| AWS Transcribe (standard) | Low — replace SDK call | Comparable on EN; Whisper often wins on accent/noise | ~30-70 percent lower | Sovereignty and data-residency control gained. |
| Google Cloud Speech-to-Text | Low | Comparable on EN; Whisper wins on low-resource languages | ~30-60 percent lower | Lose Google's word-level confidence; gain MIT licence. |
| Azure Speech | Low | Comparable | ~30-50 percent lower | Same as GCP migration shape. |
| AssemblyAI / Deepgram / Speechmatics | Low — same API surface | Variable; vendors lead on speaker diarisation polish | ~50-80 percent lower (self-host) | Lose vendor diarisation; gain via WhisperX + pyannote. |
| Kaldi / DeepSpeech / older OSS | Medium | Substantial uplift on noisy/accented | Generally lower (modern GPU runtimes) | Lose Kaldi's WFST / LM rescoring options; gain robustness. |
| Whisper Large-v2 (in-place upgrade) | Trivial — change model id | 1-3 WER points improvement on average | Equal | Re-validate domain-specific accuracy. |
| Whisper Large-v3 → Large-v3-turbo | Trivial | Within 1-2 WER points on most benchmarks | ~6-8x cheaper | Loses translation head; transcription only. |
| Whisper → Parakeet / Canary (streaming) | Medium | Comparable WER, much lower latency | Often lower (smaller models) | When sub-300 ms streaming latency is the requirement. |
| Self-hosted Whisper → Yobibyte managed | Low — change endpoint URL | Equal (same weights) | Equal at high utilisation; cheaper at low utilisation | Gain regional residency, autoscaling, FOCUS billing. |
Troubleshooting#
Six failure modes account for the vast majority of Whisper production incidents. The runbook below covers them.
| Symptom | Likely cause | First check | Fix |
|---|---|---|---|
| Transcript contains invented repeated phrases | Hallucination on silence / low-info audio | compression_ratio per segment | Enable VAD; condition_on_previous_text=False; reject ratio >2.4. |
| Transcript drifts to a different topic mid-file | Previous-text conditioning poisoned context | condition_on_previous_text setting | Set to False; re-run. |
| Wrong language detected | Auto-detect ran on 30 s of silence or music | info.language_probability | Force --language; or detect on a VAD-filtered sample. |
| Word-level timestamps drift | Decoder timestamp tokens are segment-level only | Runtime used for timestamps | Switch to WhisperX (wav2vec forced alignment) or faster-whisper word_timestamps=True. |
| GPU memory OOM with batch >1 | KV cache + activations exceed VRAM at batch | GPU memory at peak | Lower batch_size; use INT8_FLOAT16 on L4; or move to L40S/H100. |
| RTF much higher than planning anchor | Python overhead, missing CUDA graphs, or CPU-bound VAD | DCGM GPU utilisation | Use faster-whisper not reference; ensure CUDA build; offload VAD to dedicated CPU. |
| Domain-specific terms transcribed wrong | Vocabulary not in training distribution | Distribution of misrecognised terms | Use initial_prompt= to seed vocabulary; or domain fine-tune. |
| Latency spikes on long files | Sliding-window restarts and re-prefill | Per-window latency log | Pre-segment with VAD into <30 s windows; batch. |
| Translation output is poor | Used large-v3-turbo (no translation head) | Model id | Switch to large-v3 for translation; turbo is transcribe-only. |
| Punctuation / casing inconsistent across segments | BPE decoding + weak training signal | Per-segment punctuation distribution | Post-process with a dedicated punctuation/casing model. |
When debugging an in-production Whisper deployment, always grab `avg_logprob`, `compression_ratio`, `no_speech_prob` and `language_probability` from the faster-whisper segment output and log them alongside the transcript. Without these four numbers it is essentially impossible to distinguish 'the audio was bad' from 'the model misbehaved' after the fact. Yobibyte's managed endpoints log all four by default in the audit trail.
Where Whisper fits in the Yobitel stack#
Whisper is the workhorse speech recognition model across the Yobitel platform. It appears in four distinct places. First, MediQuery — Yobitel's clinical dictation and ambient-scribing AI Application — uses Whisper Large-v3 for the clinical narrative path on Yobitel NeoCloud UK, with the deployment held to NCSC Cloud Security Principles and clinician-validated transcripts wired into the EHR connector. Second, Yobibyte exposes Whisper Large-v3 and Large-v3-turbo as managed transcription endpoints under the OpenAI-compatible audio API; customers point existing OpenAI-SDK clients at their Yobibyte workspace URL and get sovereign-region inference with per-minute billing and FOCUS-conformant cost export.
Third, Yobitel Edge AI ships faster-whisper builds and whisper.cpp Q5_K builds on Jetson AGX Orin and customer-owned edge boxes for on-device transcription in air-gapped settings (defence, regulated healthcare, broadcast field kit). Fourth, Omniscient Compute and InferenceBench publish ongoing Whisper benchmark scores — RTF, WER on FLEURS / LibriSpeech / domain corpora, cost per hour transcribed — across NVIDIA H100, H200, L40S, L4 and Blackwell SKUs so customers planning capacity have current measured anchors rather than vendor marketing numbers.
If you are choosing between self-hosting Whisper on Yobitel NeoCloud GPU instances and consuming it as a Yobibyte managed endpoint, the deciding factor is usually sustained utilisation: self-hosting wins per-unit above roughly 40 percent GPU utilisation; Yobibyte's per-minute pricing wins below that. Either way, the model, weights and behaviour are identical — Yobibyte is a thinner consumption surface, not a different product.
- MediQuery clinical dictation — Whisper Large-v3 on Yobibyte UK Sovereign, NCSC OFFICIAL alignment.
- Yobibyte managed Whisper endpoints — Large-v3 and Large-v3-turbo, OpenAI-compatible audio API, UK / EU / US regions.
- Yobitel Edge AI — faster-whisper / whisper.cpp on Jetson AGX Orin and edge CPUs.
- InferenceBench — published RTF, WER and $/hour-transcribed across NeoCloud SKUs.
- Omniscient Compute — Whisper inference billed under FOCUS 1.1 with `model`, `runtime` and `tenant` tags.
References
- Robust Speech Recognition via Large-Scale Weak Supervision (Radford et al., 2022) · arXiv
- openai/whisper (reference implementation) · GitHub
- Whisper Large-v3 model card · Hugging Face
- Whisper Large-v3-turbo model card · Hugging Face
- SYSTRAN/faster-whisper (production runtime) · GitHub
- ggerganov/whisper.cpp (edge runtime) · GitHub
- WhisperX (forced alignment + diarisation) · GitHub
- Distil-Whisper (Hugging Face distillation) · GitHub
- NVIDIA NIM for Whisper · NVIDIA