NVIDIA L40S GPU

TL;DR

Ada Lovelace data centre GPU (AD102, TSMC 4N, 76 billion transistors) launched at SIGGRAPH August 2023 as the AI-rebalanced sibling of L40 and the workhorse of the 'inference + media' tier — the default cloud SKU below H100 for 7B-13B serving, SDXL image generation, video transcoding and small-model fine-tuning through 2026.
48 GB GDDR6 ECC at 864 GB/s — no HBM, no NVLink. Fourth-generation Tensor Core delivers 366 TFLOPS BF16/FP16 (sparse), 733 TFLOPS dense BF16/FP16, 1,466 TFLOPS FP8 (sparse), 733 TFLOPS dense FP8; third-generation RT cores at 212 TFLOPS for ray tracing; full NVENC/NVDEC pair for AV1/H.264/H.265 transcoding.
350 W TDP in a dual-slot FHFL PCIe Gen4 card — fits standard 2U/4U inference servers without exotic cooling. No NVLink means tensor parallelism cannot scale across cards over a high-bandwidth link; replicas above 48 GB are single-card or pipeline-parallel over PCIe.
Sweet spot: 7B-13B BF16 serving, 34B AWQ INT4 inference, SDXL/Flux image generation pipelines, batch transcoding workloads. Not the right card for 70B+ models, 128K+ contexts, or any workload that depends on multi-card NVLink collectives.
Pricing in 2026: on-demand $0.95-$1.20 / GPU-hr at hyperscalers, $0.70-$0.90 1-year reserved, $0.55-$0.70 3-year, $0.35-$0.45 spot. Roughly 40-55 % cheaper than H100 on $/GPU-hr while delivering 35-50 % of H100 throughput on compute-bound workloads — usually wins on $/token for sub-30B inference.

Overview

The L40S is the Ada Lovelace data centre GPU built specifically for AI inference and accelerated graphics. Announced at SIGGRAPH 2023 as an AI-rebalanced refresh of the visualisation-focused L40, it shares the AD102 die with the RTX 6000 Ada and L40 but ships with higher sustained Tensor Core clocks, AI-tuned driver paths and an FP8-first software story. Where Hopper went to HBM and NVLink to chase frontier training, Ada/L40S stayed on GDDR6 and PCIe to chase inference economics — the trade-off that defines its position in the rack.

Through 2024-2026, L40S has become the default 'cheap-but-capable' inference SKU on every major hyperscaler (AWS g6e, Azure NCadsH100v5 has H100; L40S sits in NCads variants and on Lambda/CoreWeave/Crusoe), most Tier-1 neoclouds, and the on-prem AI-factory reference designs from Dell, Supermicro, HPE and Lenovo. The reason is simple: at $0.95-$1.20 / GPU-hour on-demand, with 48 GB of VRAM and first-class FP8 support, it serves 7B-13B BF16 traffic at roughly 40-60 % of H100 throughput while costing 40-55 % less. For workloads that fit, L40S wins on $/token by a clear margin.

This entry is the 2026 reference for teams sizing L40S fleets: the AD102 silicon, the full per-SKU spec sheet, where L40S sits relative to L4/L40 and to H100/H200, the workloads it dominates and the workloads it fails on, current cost ranges in USD, and the migration matrix in and out. Yobitel NeoCloud offers L40S capacity broadly across UK and EU regions with NCSC OFFICIAL alignment, and L40S is the default landing zone Yobibyte schedules sub-34B serving workloads onto with FP8 calibration baked into the model-onboarding pipeline. This entry helps you decide when L40S is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.

How it works: AD102 and the L40S binning

AD102 is the largest Ada Lovelace die — 76 billion transistors on TSMC's custom 4N process, 144 SMs (142 active on L40S), 18,176 CUDA cores, 568 fourth-generation Tensor Cores, and 96 MB L2 cache. The same silicon ships in three personalities: the RTX 6000 Ada (workstation, 48 GB), L40 (visualisation, 48 GB, lower clocks), and L40S (AI inference, 48 GB, higher sustained clocks). L40S is the AI binning — same die, same memory, but boost-clocked for sustained tensor throughput where L40 was tuned for ray-tracing and viewport latency.

The fourth-generation Tensor Core is the same generation as Hopper's, including native FP8 (E4M3 forward, E5M2 gradient) support at twice the FP16 throughput. What L40S does not have is the Hopper Transformer Engine's runtime amax tracking — FP8 calibration on Ada is a build-time step in TensorRT-LLM or a one-shot calibration pass in vLLM, not a per-layer runtime decision. Calibrated correctly, FP8 inference on L40S reaches 1,466 TFLOPS sparse / 733 TFLOPS dense, putting it within a factor of 2.5-3x of H100 FP8 throughput on compute-bound workloads.

The single architectural choice that defines L40S is GDDR6 over HBM. 48 GB at 864 GB/s is genuinely a lot of memory — twice an A100 40 GB, equal to A100 80 GB minus 32 GB — but the bandwidth ceiling is roughly 43 % of A100 HBM2e (2.0 TB/s), 26 % of H100 HBM3 (3.35 TB/s), and 18 % of H200 HBM3e (4.8 TB/s). For decode-bound LLM inference (long contexts, large KV caches), this is the binding constraint. For compute-bound workloads (image generation denoising, prefill, batched 7B BF16), the gap closes substantially. Pick L40S where compute dominates; pick HBM-class cards where memory bandwidth dominates.

The other architectural choices follow the inference brief. No NVLink — multi-card collectives go over PCIe Gen4 (64 GB/s) only, which makes tensor parallelism across cards roughly 8-12x slower than on NVLink and limits L40S to single-card replicas or pipeline-parallel splits. No MIG — multi-tenant isolation relies on vGPU licensing (NVIDIA AI Enterprise) or container-level scheduling, neither of which is hardware-isolated. Third-generation RT cores at 212 TFLOPS for ray tracing remain useful for 3D rendering and Omniverse workloads, but most AI deployments leave them idle.

AD102 die: 142 active SMs on L40S (144 physical, harvested), 568 fourth-generation Tensor Cores, 96 MB L2 cache, 128 KB L1/SMEM per SM.
Compute capability sm_89 (Ada Lovelace) — distinct from sm_80 (Ampere) and sm_90 (Hopper); kernels compiled for sm_89 are L4/L40/L40S/RTX 6000 Ada and the RTX 40-series consumer cards.
Memory: 24 GDDR6 ECC chips, 384-bit bus, 18 Gbps per pin, 864 GB/s aggregate. No HBM. No EDR/ECC variants — all L40S ship with ECC enabled.
Third-generation RT cores: 212 TFLOPS for BVH traversal — relevant for Omniverse, ray-traced denoising in image pipelines.
NVENC/NVDEC: third-generation; 3 NVENC + 3 NVDEC engines per card; AV1 encode/decode, H.265 10-bit; sustains roughly 6x simultaneous 4K60 streams for transcoding.
No FP4 (Blackwell-only), no Transformer Engine runtime amax (Hopper-only), no NVLink (PCIe-only), no MIG (Hopper/Ampere-only), no Confidential Compute (Hopper-only).

Reference: full specification sheet

Authoritative per-card figures. Sparse Tensor figures assume 2:4 structured sparsity; dense throughput is half the sparse figure. L40S ships in a single SKU — there is no SXM L40S, no 24 GB variant, no NVLink bridge. The closely related cards (L40, L4, RTX 6000 Ada) are shown alongside for sizing context.

Metric	L40S	L40 (compare)	L4 (compare)	RTX 6000 Ada (compare)
Architecture	Ada Lovelace AD102	Ada Lovelace AD102	Ada Lovelace AD104	Ada Lovelace AD102
Process	TSMC 4N	TSMC 4N	TSMC 4N	TSMC 4N
Transistors	76 billion	76 billion	35.8 billion	76 billion
Active SMs	142	142	60	142
CUDA cores	18,176	18,176	7,680	18,176
Tensor cores	568	568	240	568
RT cores	142 (gen 3)	142 (gen 3)	60 (gen 3)	142 (gen 3)
Compute capability	sm_89	sm_89	sm_89	sm_89
FP32	91.6 TFLOPS	90.5 TFLOPS	30.3 TFLOPS	91.1 TFLOPS
TF32 (Tensor, sparse)	366 TFLOPS	362 TFLOPS	120 TFLOPS	364 TFLOPS
BF16 / FP16 (Tensor, sparse)	733 TFLOPS	362 TFLOPS	242 TFLOPS	728 TFLOPS
BF16 / FP16 (Tensor, dense)	366 TFLOPS	181 TFLOPS	121 TFLOPS	364 TFLOPS
FP8 (Tensor, sparse)	1,466 TFLOPS	724 TFLOPS	485 TFLOPS	1,457 TFLOPS
INT8 (Tensor, sparse)	1,466 TOPS	724 TOPS	485 TOPS	1,457 TOPS
RT-core throughput	212 TFLOPS	210 TFLOPS	73 TFLOPS	210 TFLOPS
Memory	48 GB GDDR6 ECC	48 GB GDDR6 ECC	24 GB GDDR6	48 GB GDDR6 ECC
Memory bandwidth	864 GB/s	864 GB/s	300 GB/s	960 GB/s
L2 cache	96 MB	96 MB	48 MB	96 MB
NVENC / NVDEC	3 / 3 (gen 3)	3 / 3 (gen 3)	2 / 4 (gen 3)	3 / 3 (gen 3)
NVLink	Not supported	Not supported	Not supported	Not supported
MIG	Not supported	Not supported	Not supported	Not supported
PCIe	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)
TDP	350 W	300 W	72 W	300 W
Form factor	FHFL dual-slot	FHFL dual-slot	LP single-slot	FHFL dual-slot
Cooling	Passive (server)	Passive (server)	Passive (server)	Active (workstation)
Confidential Compute	Not supported	Not supported	Not supported	Not supported
Minimum driver	R535	R525	R525	R525
Recommended driver (2026)	R570 stable	R570	R570	R570
Minimum CUDA	12.0	11.8	11.8	11.8

Warning: L40S is PCIe-only with no NVLink bridge. Multi-card workloads (tensor-parallel above 48 GB) communicate over PCIe Gen4 at 64 GB/s — roughly 13x slower than NVLink 4.0 — and tensor-parallel inference at TP=2 typically loses 35-55 % throughput versus the same model on a single H100. If a workload needs more than 48 GB, the choice is usually 'move to H100/H200' rather than 'use 2x L40S'.

Interconnect and form factor: PCIe-only, dual-slot, 350 W passive

L40S exposes only PCIe Gen4 x16 — 64 GB/s bidirectional — to the host. There is no NVLink, no NVLink bridge, no SXM variant. Multi-card workloads (tensor-parallel, pipeline-parallel, all-reduce-heavy collectives) communicate over PCIe and through host memory. NCCL on PCIe-only L40S clusters falls back to ring or tree algorithms over PCIe peer-to-peer where the BIOS allows it; on hosts without P2P (most cloud instances), traffic transits host RAM with another order of magnitude of latency.

The form factor is a full-height full-length dual-slot PCIe card with passive cooling — designed for server airflow at 350 W. This fits cleanly in standard 2U/4U inference chassis from every major OEM (Dell PowerEdge R760xa/XE9680, HPE ProLiant DL380a Gen11, Supermicro AS-4125GS, Lenovo ThinkSystem SR675 V3); most reference designs slot 4 or 8 L40S cards per chassis. The 350 W envelope is meaningful — older Ampere generations of inference cards (T4 at 70 W, A10 at 150 W) often went into denser 1U servers; L40S requires 2U or 4U airflow at minimum and cannot be deployed in fanless edge enclosures.

PCIe bifurcation is supported (8 cards per dual-CPU host with PCIe Gen4 x16 to each card requires PCIe switches, typically Broadcom PEX 88096); without switches, most dual-socket hosts cap at 4 cards with x16 lanes each. This matters for sizing: an 8x L40S server is invariably a PCIe-switch design with potential cross-switch hops between cards.

PCIe Gen4 x16: 64 GB/s bidirectional per card — the only inter-card path.
Form factor: FHFL dual-slot passive — server cooling required, no fanless or LP options.
Dense chassis target: 4-card (no switches needed) or 8-card (Broadcom PEX 88096 PCIe switch required).
Power: 1x EPS12V 16-pin (CEM5) connector per card on 2024+ chassis; legacy CEM4 8+8-pin variants exist on early production.
Cooling envelope: 350 W passive, requires sustained 400+ LFM airflow at <35 C inlet — verify in chassis qualification.

Sizing and capacity planning

Sizing tables we use internally to scope L40S inference fleets. All figures assume L40S in a server with PCIe Gen4 P2P enabled, vLLM 0.6+ with paged KV cache and prefix caching, FP8 weights where supported (calibration step required), and AWQ INT4 for memory-pressured workloads. Output tokens per second is per replica at moderate concurrency (16-32 sessions); compute-bound workloads (SDXL, transcoding) are listed in their own units.

Single-card ceiling: weights + KV cache + activations + cuBLAS scratch under ~46 GB; above that, OOMs even with paged KV.
70B BF16 is infeasible on a single L40S (140 GB weights) and slow on 2x L40S over PCIe — use AWQ INT4 for single-card 70B inference, or move to H100/H200 if BF16 is required.
FP8 calibration step on L40S (TensorRT-LLM quantize.py or vLLM's nm-vllm toolchain) takes 30-90 minutes for 7B-13B models; budget the time in the deployment pipeline.
Long-context decode (32K+) regresses faster on L40S than on HBM cards — KV cache bandwidth is the binding constraint. Cap deployed context at 16K-32K and treat above-32K traffic as a separate H200/B200 tier.
Multi-tenant isolation without MIG: use vGPU (NVIDIA AI Enterprise licence) or container-level isolation with --gpus device=N and per-pod resource limits; neither is hardware-enforced.
Cost-per-million-output-tokens on Llama 3.1 8B FP8 at $1.00/GPU-hr and 4,000 TPS sustained: roughly $0.07 per million — the lowest mainstream NVIDIA SKU for sub-13B serving.

Workload	Precision	Context	Cards per replica	Approx throughput	VRAM headroom
Llama 3.1 8B serving	FP8	8K	1x L40S	3,400-4,500 TPS	30 GB free
Llama 3.1 8B serving	BF16	8K	1x L40S	2,200-2,800 TPS	20 GB free
Mistral 7B / Qwen 7B serving	FP8	16K	1x L40S	3,200-4,100 TPS	28 GB free
Llama 3 13B serving	FP8	8K	1x L40S	1,800-2,400 TPS	15 GB free
Codestral 22B / Yi 34B serving	AWQ INT4	8K	1x L40S	900-1,300 TPS	20 GB free
Codestral 22B serving	FP8	8K	1x L40S	650-900 TPS	8 GB free
Llama 3 70B serving	AWQ INT4	4K	1x L40S	180-260 TPS	5 GB free
Llama 3 70B serving	BF16	4K	2x L40S (PCIe TP)	120-180 TPS	PCIe-limited
SDXL 1.0 (1024x1024, 25 steps)	BF16 + torch.compile	n/a	1x L40S	0.9-1.2 images/s	30 GB free
Flux.1-dev (1024x1024)	BF16	n/a	1x L40S	0.45-0.65 images/s	15 GB free
7B QLoRA fine-tune	NF4 base + BF16 LoRA	4K	1x L40S	~3,200 tokens/s training	12 GB free
13B QLoRA fine-tune	NF4 base + BF16 LoRA	4K	1x L40S	~1,700 tokens/s training	8 GB free
Whisper Large v3 batch ASR	BF16	30s clip	1x L40S	~38 RTF (real-time-factor)	40 GB free
Video transcoding (H.264 -> AV1 4K60)	n/a (NVENC)	n/a	1x L40S	5-6 streams concurrent	n/a

Cost and TCO

L40S pricing settled in a tight band through 2025-2026 as supply caught up with demand. The card is genuinely the cheapest-per-token NVIDIA SKU for inference of models that fit in 48 GB at FP8 or AWQ INT4, and the on-prem capex story is equally favourable — a 4x L40S inference server lands at roughly $40,000-$55,000 capital plus power, versus $180,000-$240,000 for an equivalent 4x H100 PCIe server.

Cost-per-million-output-tokens on Llama 3.1 8B FP8 at $1.00/GPU-hr and 4,000 TPS: roughly $0.07 per million tokens — typically 30-50 % cheaper than H100 on the same workload.
Cost-per-million-output-tokens on Codestral 22B AWQ INT4 at $1.00/GPU-hr and 1,100 TPS: roughly $0.25 per million — competitive with H100 70B FP8 once H100 is amortised over higher-throughput sessions.
SDXL image generation: at 1.1 images/s and $1.00/GPU-hr, cost per image lands at roughly $0.00025 — the floor across NVIDIA's inference fleet.
3-year reservation typically cuts effective $/GPU-hr by 40-50 % versus on-demand; commit only when steady-state utilisation exceeds 60 %.
On-prem TCO break-even versus 3-year reserved cloud: roughly 18-24 months on 4x L40S inference servers at typical UK/EU power prices ($0.12-$0.20/kWh).
Egress: less of a factor than on H100 fleets because L40S workloads are typically lower-throughput per session; still budget 6-10 % of total bill at hyperscalers.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Spot
Hyperscaler (AWS g6e, Azure NCadsv5)	L40S	$1.05-$1.25	$0.75-$0.95	$0.60-$0.75	$0.40-$0.55
Tier-1 neocloud (CoreWeave, Lambda)	L40S	$0.90-$1.10	$0.65-$0.85	$0.50-$0.65	$0.35-$0.45
Tier-2 neocloud	L40S	$0.70-$0.95	$0.55-$0.75	$0.45-$0.60	$0.25-$0.40
On-prem 4x L40S server (capex amortised)	L40S	$0.55-$0.75 amortised	n/a	n/a	n/a
Yobitel NeoCloud (UK + EU)	L40S	$0.80-$1.00	$0.60-$0.80	$0.48-$0.62	n/a
Yobitel Omniscient Compute	L40S multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	n/a

Tip: L40S is the most over-recommended H100 alternative — and also the most genuinely correct one when the workload actually fits. The discipline is: verify FP8 calibration accuracy, verify that the model + KV cache fits in 48 GB at production context length, verify that the workload does not need NVLink. Three for three -> ship on L40S; any one miss -> H100/H200.

Software ecosystem

L40S is a first-class target across the modern AI inference stack. TensorRT-LLM treats sm_89 as a fully supported architecture with FP8 engine builds (--gemm_plugin fp8 --gpt_attention_plugin fp8 --workers 1), and the engine cache is portable across L40, L40S and RTX 6000 Ada at the same compute capability. vLLM supports FP8 weight quantisation on L40S via the nm-vllm toolchain (one-shot calibration) and AWQ/GPTQ INT4 via the standard --quantization awq or --quantization gptq paths. SGLang and TGI both ship L40S-tuned kernels for 7B-34B serving.

Image generation has the deepest L40S story: diffusers with torch.compile reaches advertised throughput on SDXL/Flux without custom work, and the TensorRT 10 UNet engine path adds another 1.5-2x on top. NVIDIA Omniverse and Isaac Sim run natively, taking advantage of the third-generation RT cores. Video pipelines use the NVENC/NVDEC engines via NVIDIA's Video Codec SDK 12+ for AV1 encode/decode at 4K60.

Triton Inference Server hosts L40S without special configuration; the standard tensorrt_llm_backend and vllm_backend both work. NVIDIA AI Enterprise treats L40S as the recommended L-class platform for production inference, and the NVIDIA NIM (NVIDIA Inference Microservices) catalogue ships pre-built containers for popular open-weight models targeting L40S as the cost-optimised tier.

The notable absences: no Confidential Compute (Hopper-only — sovereign-attested workloads must target H100/H200), no Transformer Engine runtime amax tracking (FP8 calibration is build-time only, not adaptive), no MIG (multi-tenant isolation via vGPU licensing only), no FP4 (Blackwell-only).

Migration and alternatives

When L40S is the right choice and when it isn't. The two heuristics: (1) the workload fits in 48 GB at FP8 or INT4 and is compute-bound or moderately memory-bound -> L40S wins on cost; (2) the workload exceeds 48 GB, requires NVLink, or has tail-latency sensitivity to KV-cache bandwidth -> move up to H100/H200. Below the L40S tier, L4 is the right answer for low-power inference under 24 GB.

From / to	When it pays	Migration effort	Key incompatibility
L4 -> L40S	Need more memory (24 GB -> 48 GB) or Tensor throughput	Low (same sm_89 software)	350 W TDP — chassis cooling redesign
L40 -> L40S	AI inference rather than visualisation primary load	Trivial (same drivers, same kernels)	None — clock binning only
A100 -> L40S	Sub-30B inference, NVLink not needed, $/token priority	Medium (BF16 -> FP8 calibration step)	No MIG; GDDR6 long-context tail latency
A10G -> L40S	Need FP8 or 48 GB memory	Low (sm_86 -> sm_89 kernel recompile)	350 W vs 250 W chassis envelope
L40S -> H100	Workload exceeds 48 GB or needs NVLink	Medium (FP8 calibration step on Hopper TE)	Cost roughly doubles per GPU-hour
L40S -> H200	Long-context (32K+) or 70B BF16 in-place	Medium (same software as L40S -> H100)	HBM3e cost premium
L40S -> RTX 6000 Ada	Workstation single-card use case (not server)	Trivial (same silicon)	Active cooling — not server-friendly
L40S -> MI300X	Want HBM3 capacity without H100 price	High (CUDA -> ROCm rewrite)	CUDA kernels not portable; vLLM ROCm path lags

Pitfalls and operational notes

GDDR6 is the binding constraint on long-context inference — 864 GB/s vs A100's 2.0 TB/s vs H100's 3.35 TB/s. Benchmark KV-cache-heavy decodes (32K+ context) on L40S explicitly before promising production SLAs.
No NVLink — TP=2 inference at 70B BF16 loses 35-55 % throughput vs single-card H100. If the model does not fit in 48 GB at FP8/INT4, plan to move up rather than splitting across L40S cards.
No MIG — multi-tenant isolation requires vGPU licensing (NVIDIA AI Enterprise) or container-level scheduling. Hardware isolation is not available.
FP8 calibration is build-time only — no Transformer Engine runtime amax on Ada. Recalibrate after any model update or significant change in prompt distribution.
350 W passive in a dual-slot card needs ~400 LFM sustained airflow at <35 C inlet; some chassis (especially older 2U designs) throttle L40S under sustained load. Verify in chassis qualification, not on the datasheet.
PCIe Gen4 (64 GB/s) is half of Gen5 — on hosts with Gen5 NICs (CX-7 NDR 400 Gb/s), the L40S PCIe link becomes the bottleneck for streaming inference inputs at multi-replica scale.
CEM5 16-pin power connector on 2024+ production — earlier cards use CEM4 8+8-pin adapters. Inventory mixing causes power-rail surprises; tag both in your asset DB.
No Confidential Compute — sovereign-attested deployments requiring SPDM attestation must target H100/H200, not L40S.
Driver R570 is the 2026 recommended baseline; R535/R545 builds are missing important NCCL-over-PCIe and DCGM fixes for L40S clusters.

Where this fits in the Yobitel stack

L40S is the cost-optimised inference SKU across the Yobitel stack in 2026. Yobibyte — our AI-native platform — schedules sub-34B serving workloads onto L40S pools by default, with FP8 calibration baked into the model-onboarding pipeline and automatic fallback to H100/H200 for workloads that exceed L40S's 48 GB / 864 GB/s envelope. The platform's placement layer tags every L40S replica with the chassis it landed on (4-card no-switch, 8-card PEX 88096) so multi-card pipeline-parallel jobs can prefer same-switch placements.

Omniscient Compute — our cross-cloud capacity broker — indexes L40S across AWS g6e, Azure NCads-v5 variants, CoreWeave, Lambda, Crusoe, Civo, and a long tail of regional neoclouds, and normalises pricing onto the FinOps Foundation FOCUS spec. Because L40S supply expanded faster than H100 through 2025, the broker frequently surfaces L40S as the cost-leading SKU for 7B-13B serving — sometimes 50-60 % below H100 at parity throughput on workloads that fit.

InferenceBench — our public, reproducible benchmarking harness — publishes L40S throughput, latency and cost-per-token numbers for every major open-weight model under 34B on vLLM, TensorRT-LLM, SGLang and TGI, including SDXL/Flux image generation throughput and Whisper batch ASR figures. The L40S sizing tables in this entry are anchored on InferenceBench runs; if you are sizing a 2026 L40S footprint, start with InferenceBench, lift the platform configuration into a Yobibyte manifest, and let Omniscient Compute pick the region.

References

NVIDIA L40S Datasheet · NVIDIA
Ada Lovelace Architecture Whitepaper · NVIDIA
TensorRT-LLM Ada FP8 engine builds · NVIDIA
vLLM quantisation (FP8, AWQ, GPTQ) · vLLM
NVIDIA Video Codec SDK (NVENC/NVDEC) · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation

TL;DR

Ada Lovelace data centre GPU (AD102, TSMC 4N, 76 billion transistors) launched at SIGGRAPH August 2023 as the AI-rebalanced sibling of L40 and the workhorse of the 'inference + media' tier — the default cloud SKU below H100 for 7B-13B serving, SDXL image generation, video transcoding and small-model fine-tuning through 2026.
48 GB GDDR6 ECC at 864 GB/s — no HBM, no NVLink. Fourth-generation Tensor Core delivers 366 TFLOPS BF16/FP16 (sparse), 733 TFLOPS dense BF16/FP16, 1,466 TFLOPS FP8 (sparse), 733 TFLOPS dense FP8; third-generation RT cores at 212 TFLOPS for ray tracing; full NVENC/NVDEC pair for AV1/H.264/H.265 transcoding.
350 W TDP in a dual-slot FHFL PCIe Gen4 card — fits standard 2U/4U inference servers without exotic cooling. No NVLink means tensor parallelism cannot scale across cards over a high-bandwidth link; replicas above 48 GB are single-card or pipeline-parallel over PCIe.
Sweet spot: 7B-13B BF16 serving, 34B AWQ INT4 inference, SDXL/Flux image generation pipelines, batch transcoding workloads. Not the right card for 70B+ models, 128K+ contexts, or any workload that depends on multi-card NVLink collectives.
Pricing in 2026: on-demand $0.95-$1.20 / GPU-hr at hyperscalers, $0.70-$0.90 1-year reserved, $0.55-$0.70 3-year, $0.35-$0.45 spot. Roughly 40-55 % cheaper than H100 on $/GPU-hr while delivering 35-50 % of H100 throughput on compute-bound workloads — usually wins on $/token for sub-30B inference.

Overview

How it works: AD102 and the L40S binning

AD102 die: 142 active SMs on L40S (144 physical, harvested), 568 fourth-generation Tensor Cores, 96 MB L2 cache, 128 KB L1/SMEM per SM.
Compute capability sm_89 (Ada Lovelace) — distinct from sm_80 (Ampere) and sm_90 (Hopper); kernels compiled for sm_89 are L4/L40/L40S/RTX 6000 Ada and the RTX 40-series consumer cards.
Memory: 24 GDDR6 ECC chips, 384-bit bus, 18 Gbps per pin, 864 GB/s aggregate. No HBM. No EDR/ECC variants — all L40S ship with ECC enabled.
Third-generation RT cores: 212 TFLOPS for BVH traversal — relevant for Omniverse, ray-traced denoising in image pipelines.
NVENC/NVDEC: third-generation; 3 NVENC + 3 NVDEC engines per card; AV1 encode/decode, H.265 10-bit; sustains roughly 6x simultaneous 4K60 streams for transcoding.
No FP4 (Blackwell-only), no Transformer Engine runtime amax (Hopper-only), no NVLink (PCIe-only), no MIG (Hopper/Ampere-only), no Confidential Compute (Hopper-only).

Reference: full specification sheet

Metric	L40S	L40 (compare)	L4 (compare)	RTX 6000 Ada (compare)
Architecture	Ada Lovelace AD102	Ada Lovelace AD102	Ada Lovelace AD104	Ada Lovelace AD102
Process	TSMC 4N	TSMC 4N	TSMC 4N	TSMC 4N
Transistors	76 billion	76 billion	35.8 billion	76 billion
Active SMs	142	142	60	142
CUDA cores	18,176	18,176	7,680	18,176
Tensor cores	568	568	240	568
RT cores	142 (gen 3)	142 (gen 3)	60 (gen 3)	142 (gen 3)
Compute capability	sm_89	sm_89	sm_89	sm_89
FP32	91.6 TFLOPS	90.5 TFLOPS	30.3 TFLOPS	91.1 TFLOPS
TF32 (Tensor, sparse)	366 TFLOPS	362 TFLOPS	120 TFLOPS	364 TFLOPS
BF16 / FP16 (Tensor, sparse)	733 TFLOPS	362 TFLOPS	242 TFLOPS	728 TFLOPS
BF16 / FP16 (Tensor, dense)	366 TFLOPS	181 TFLOPS	121 TFLOPS	364 TFLOPS
FP8 (Tensor, sparse)	1,466 TFLOPS	724 TFLOPS	485 TFLOPS	1,457 TFLOPS
INT8 (Tensor, sparse)	1,466 TOPS	724 TOPS	485 TOPS	1,457 TOPS
RT-core throughput	212 TFLOPS	210 TFLOPS	73 TFLOPS	210 TFLOPS
Memory	48 GB GDDR6 ECC	48 GB GDDR6 ECC	24 GB GDDR6	48 GB GDDR6 ECC
Memory bandwidth	864 GB/s	864 GB/s	300 GB/s	960 GB/s
L2 cache	96 MB	96 MB	48 MB	96 MB
NVENC / NVDEC	3 / 3 (gen 3)	3 / 3 (gen 3)	2 / 4 (gen 3)	3 / 3 (gen 3)
NVLink	Not supported	Not supported	Not supported	Not supported
MIG	Not supported	Not supported	Not supported	Not supported
PCIe	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)	Gen4 x16 (64 GB/s)
TDP	350 W	300 W	72 W	300 W
Form factor	FHFL dual-slot	FHFL dual-slot	LP single-slot	FHFL dual-slot
Cooling	Passive (server)	Passive (server)	Passive (server)	Active (workstation)
Confidential Compute	Not supported	Not supported	Not supported	Not supported
Minimum driver	R535	R525	R525	R525
Recommended driver (2026)	R570 stable	R570	R570	R570
Minimum CUDA	12.0	11.8	11.8	11.8

Warning: L40S is PCIe-only with no NVLink bridge. Multi-card workloads (tensor-parallel above 48 GB) communicate over PCIe Gen4 at 64 GB/s — roughly 13x slower than NVLink 4.0 — and tensor-parallel inference at TP=2 typically loses 35-55 % throughput versus the same model on a single H100. If a workload needs more than 48 GB, the choice is usually 'move to H100/H200' rather than 'use 2x L40S'.

Interconnect and form factor: PCIe-only, dual-slot, 350 W passive

PCIe Gen4 x16: 64 GB/s bidirectional per card — the only inter-card path.
Form factor: FHFL dual-slot passive — server cooling required, no fanless or LP options.
Dense chassis target: 4-card (no switches needed) or 8-card (Broadcom PEX 88096 PCIe switch required).
Power: 1x EPS12V 16-pin (CEM5) connector per card on 2024+ chassis; legacy CEM4 8+8-pin variants exist on early production.
Cooling envelope: 350 W passive, requires sustained 400+ LFM airflow at <35 C inlet — verify in chassis qualification.

Sizing and capacity planning

Single-card ceiling: weights + KV cache + activations + cuBLAS scratch under ~46 GB; above that, OOMs even with paged KV.
70B BF16 is infeasible on a single L40S (140 GB weights) and slow on 2x L40S over PCIe — use AWQ INT4 for single-card 70B inference, or move to H100/H200 if BF16 is required.
FP8 calibration step on L40S (TensorRT-LLM quantize.py or vLLM's nm-vllm toolchain) takes 30-90 minutes for 7B-13B models; budget the time in the deployment pipeline.
Long-context decode (32K+) regresses faster on L40S than on HBM cards — KV cache bandwidth is the binding constraint. Cap deployed context at 16K-32K and treat above-32K traffic as a separate H200/B200 tier.
Multi-tenant isolation without MIG: use vGPU (NVIDIA AI Enterprise licence) or container-level isolation with --gpus device=N and per-pod resource limits; neither is hardware-enforced.
Cost-per-million-output-tokens on Llama 3.1 8B FP8 at $1.00/GPU-hr and 4,000 TPS sustained: roughly $0.07 per million — the lowest mainstream NVIDIA SKU for sub-13B serving.

Workload	Precision	Context	Cards per replica	Approx throughput	VRAM headroom
Llama 3.1 8B serving	FP8	8K	1x L40S	3,400-4,500 TPS	30 GB free
Llama 3.1 8B serving	BF16	8K	1x L40S	2,200-2,800 TPS	20 GB free
Mistral 7B / Qwen 7B serving	FP8	16K	1x L40S	3,200-4,100 TPS	28 GB free
Llama 3 13B serving	FP8	8K	1x L40S	1,800-2,400 TPS	15 GB free
Codestral 22B / Yi 34B serving	AWQ INT4	8K	1x L40S	900-1,300 TPS	20 GB free
Codestral 22B serving	FP8	8K	1x L40S	650-900 TPS	8 GB free
Llama 3 70B serving	AWQ INT4	4K	1x L40S	180-260 TPS	5 GB free
Llama 3 70B serving	BF16	4K	2x L40S (PCIe TP)	120-180 TPS	PCIe-limited
SDXL 1.0 (1024x1024, 25 steps)	BF16 + torch.compile	n/a	1x L40S	0.9-1.2 images/s	30 GB free
Flux.1-dev (1024x1024)	BF16	n/a	1x L40S	0.45-0.65 images/s	15 GB free
7B QLoRA fine-tune	NF4 base + BF16 LoRA	4K	1x L40S	~3,200 tokens/s training	12 GB free
13B QLoRA fine-tune	NF4 base + BF16 LoRA	4K	1x L40S	~1,700 tokens/s training	8 GB free
Whisper Large v3 batch ASR	BF16	30s clip	1x L40S	~38 RTF (real-time-factor)	40 GB free
Video transcoding (H.264 -> AV1 4K60)	n/a (NVENC)	n/a	1x L40S	5-6 streams concurrent	n/a

Cost and TCO

Cost-per-million-output-tokens on Llama 3.1 8B FP8 at $1.00/GPU-hr and 4,000 TPS: roughly $0.07 per million tokens — typically 30-50 % cheaper than H100 on the same workload.
Cost-per-million-output-tokens on Codestral 22B AWQ INT4 at $1.00/GPU-hr and 1,100 TPS: roughly $0.25 per million — competitive with H100 70B FP8 once H100 is amortised over higher-throughput sessions.
SDXL image generation: at 1.1 images/s and $1.00/GPU-hr, cost per image lands at roughly $0.00025 — the floor across NVIDIA's inference fleet.
3-year reservation typically cuts effective $/GPU-hr by 40-50 % versus on-demand; commit only when steady-state utilisation exceeds 60 %.
On-prem TCO break-even versus 3-year reserved cloud: roughly 18-24 months on 4x L40S inference servers at typical UK/EU power prices ($0.12-$0.20/kWh).
Egress: less of a factor than on H100 fleets because L40S workloads are typically lower-throughput per session; still budget 6-10 % of total bill at hyperscalers.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Spot
Hyperscaler (AWS g6e, Azure NCadsv5)	L40S	$1.05-$1.25	$0.75-$0.95	$0.60-$0.75	$0.40-$0.55
Tier-1 neocloud (CoreWeave, Lambda)	L40S	$0.90-$1.10	$0.65-$0.85	$0.50-$0.65	$0.35-$0.45
Tier-2 neocloud	L40S	$0.70-$0.95	$0.55-$0.75	$0.45-$0.60	$0.25-$0.40
On-prem 4x L40S server (capex amortised)	L40S	$0.55-$0.75 amortised	n/a	n/a	n/a
Yobitel NeoCloud (UK + EU)	L40S	$0.80-$1.00	$0.60-$0.80	$0.48-$0.62	n/a
Yobitel Omniscient Compute	L40S multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	n/a

Tip: L40S is the most over-recommended H100 alternative — and also the most genuinely correct one when the workload actually fits. The discipline is: verify FP8 calibration accuracy, verify that the model + KV cache fits in 48 GB at production context length, verify that the workload does not need NVLink. Three for three -> ship on L40S; any one miss -> H100/H200.

Software ecosystem

Migration and alternatives

From / to	When it pays	Migration effort	Key incompatibility
L4 -> L40S	Need more memory (24 GB -> 48 GB) or Tensor throughput	Low (same sm_89 software)	350 W TDP — chassis cooling redesign
L40 -> L40S	AI inference rather than visualisation primary load	Trivial (same drivers, same kernels)	None — clock binning only
A100 -> L40S	Sub-30B inference, NVLink not needed, $/token priority	Medium (BF16 -> FP8 calibration step)	No MIG; GDDR6 long-context tail latency
A10G -> L40S	Need FP8 or 48 GB memory	Low (sm_86 -> sm_89 kernel recompile)	350 W vs 250 W chassis envelope
L40S -> H100	Workload exceeds 48 GB or needs NVLink	Medium (FP8 calibration step on Hopper TE)	Cost roughly doubles per GPU-hour
L40S -> H200	Long-context (32K+) or 70B BF16 in-place	Medium (same software as L40S -> H100)	HBM3e cost premium
L40S -> RTX 6000 Ada	Workstation single-card use case (not server)	Trivial (same silicon)	Active cooling — not server-friendly
L40S -> MI300X	Want HBM3 capacity without H100 price	High (CUDA -> ROCm rewrite)	CUDA kernels not portable; vLLM ROCm path lags

Pitfalls and operational notes

GDDR6 is the binding constraint on long-context inference — 864 GB/s vs A100's 2.0 TB/s vs H100's 3.35 TB/s. Benchmark KV-cache-heavy decodes (32K+ context) on L40S explicitly before promising production SLAs.
No NVLink — TP=2 inference at 70B BF16 loses 35-55 % throughput vs single-card H100. If the model does not fit in 48 GB at FP8/INT4, plan to move up rather than splitting across L40S cards.
No MIG — multi-tenant isolation requires vGPU licensing (NVIDIA AI Enterprise) or container-level scheduling. Hardware isolation is not available.
FP8 calibration is build-time only — no Transformer Engine runtime amax on Ada. Recalibrate after any model update or significant change in prompt distribution.
350 W passive in a dual-slot card needs ~400 LFM sustained airflow at <35 C inlet; some chassis (especially older 2U designs) throttle L40S under sustained load. Verify in chassis qualification, not on the datasheet.
PCIe Gen4 (64 GB/s) is half of Gen5 — on hosts with Gen5 NICs (CX-7 NDR 400 Gb/s), the L40S PCIe link becomes the bottleneck for streaming inference inputs at multi-replica scale.
CEM5 16-pin power connector on 2024+ production — earlier cards use CEM4 8+8-pin adapters. Inventory mixing causes power-rail surprises; tag both in your asset DB.
No Confidential Compute — sovereign-attested deployments requiring SPDM attestation must target H100/H200, not L40S.
Driver R570 is the 2026 recommended baseline; R535/R545 builds are missing important NCCL-over-PCIe and DCGM fixes for L40S clusters.

Where this fits in the Yobitel stack

References

NVIDIA L40S Datasheet · NVIDIA
Ada Lovelace Architecture Whitepaper · NVIDIA
TensorRT-LLM Ada FP8 engine builds · NVIDIA
vLLM quantisation (FP8, AWQ, GPTQ) · vLLM
NVIDIA Video Codec SDK (NVENC/NVDEC) · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation

NVIDIA L40S GPU

Overview

How it works: AD102 and the L40S binning

Reference: full specification sheet

Interconnect and form factor: PCIe-only, dual-slot, 350 W passive

Sizing and capacity planning

Cost and TCO

Software ecosystem

Migration and alternatives

Pitfalls and operational notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

NVIDIA L40S GPU

Overview

How it works: AD102 and the L40S binning

Reference: full specification sheet

Interconnect and form factor: PCIe-only, dual-slot, 350 W passive

Sizing and capacity planning

Cost and TCO

Software ecosystem

Migration and alternatives

Pitfalls and operational notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte