TL;DR
- Ada Lovelace data centre GPU (AD102, TSMC 4N, 76 billion transistors) launched at SIGGRAPH August 2023 as the AI-rebalanced sibling of L40 and the workhorse of the 'inference + media' tier — the default cloud SKU below H100 for 7B-13B serving, SDXL image generation, video transcoding and small-model fine-tuning through 2026.
- 48 GB GDDR6 ECC at 864 GB/s — no HBM, no NVLink. Fourth-generation Tensor Core delivers 366 TFLOPS BF16/FP16 (sparse), 733 TFLOPS dense BF16/FP16, 1,466 TFLOPS FP8 (sparse), 733 TFLOPS dense FP8; third-generation RT cores at 212 TFLOPS for ray tracing; full NVENC/NVDEC pair for AV1/H.264/H.265 transcoding.
- 350 W TDP in a dual-slot FHFL PCIe Gen4 card — fits standard 2U/4U inference servers without exotic cooling. No NVLink means tensor parallelism cannot scale across cards over a high-bandwidth link; replicas above 48 GB are single-card or pipeline-parallel over PCIe.
- Sweet spot: 7B-13B BF16 serving, 34B AWQ INT4 inference, SDXL/Flux image generation pipelines, batch transcoding workloads. Not the right card for 70B+ models, 128K+ contexts, or any workload that depends on multi-card NVLink collectives.
- Pricing in 2026: on-demand $0.95-$1.20 / GPU-hr at hyperscalers, $0.70-$0.90 1-year reserved, $0.55-$0.70 3-year, $0.35-$0.45 spot. Roughly 40-55 % cheaper than H100 on $/GPU-hr while delivering 35-50 % of H100 throughput on compute-bound workloads — usually wins on $/token for sub-30B inference.
Overview#
The L40S is the Ada Lovelace data centre GPU built specifically for AI inference and accelerated graphics. Announced at SIGGRAPH 2023 as an AI-rebalanced refresh of the visualisation-focused L40, it shares the AD102 die with the RTX 6000 Ada and L40 but ships with higher sustained Tensor Core clocks, AI-tuned driver paths and an FP8-first software story. Where Hopper went to HBM and NVLink to chase frontier training, Ada/L40S stayed on GDDR6 and PCIe to chase inference economics — the trade-off that defines its position in the rack.
Through 2024-2026, L40S has become the default 'cheap-but-capable' inference SKU on every major hyperscaler (AWS g6e, Azure NCadsH100v5 has H100; L40S sits in NCads variants and on Lambda/CoreWeave/Crusoe), most Tier-1 neoclouds, and the on-prem AI-factory reference designs from Dell, Supermicro, HPE and Lenovo. The reason is simple: at $0.95-$1.20 / GPU-hour on-demand, with 48 GB of VRAM and first-class FP8 support, it serves 7B-13B BF16 traffic at roughly 40-60 % of H100 throughput while costing 40-55 % less. For workloads that fit, L40S wins on $/token by a clear margin.
This entry is the 2026 reference for teams sizing L40S fleets: the AD102 silicon, the full per-SKU spec sheet, where L40S sits relative to L4/L40 and to H100/H200, the workloads it dominates and the workloads it fails on, current cost ranges in USD, and the migration matrix in and out. Yobitel NeoCloud offers L40S capacity broadly across UK and EU regions with NCSC OFFICIAL alignment, and L40S is the default landing zone Yobibyte schedules sub-34B serving workloads onto with FP8 calibration baked into the model-onboarding pipeline. This entry helps you decide when L40S is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.
How it works: AD102 and the L40S binning#
AD102 is the largest Ada Lovelace die — 76 billion transistors on TSMC's custom 4N process, 144 SMs (142 active on L40S), 18,176 CUDA cores, 568 fourth-generation Tensor Cores, and 96 MB L2 cache. The same silicon ships in three personalities: the RTX 6000 Ada (workstation, 48 GB), L40 (visualisation, 48 GB, lower clocks), and L40S (AI inference, 48 GB, higher sustained clocks). L40S is the AI binning — same die, same memory, but boost-clocked for sustained tensor throughput where L40 was tuned for ray-tracing and viewport latency.
The fourth-generation Tensor Core is the same generation as Hopper's, including native FP8 (E4M3 forward, E5M2 gradient) support at twice the FP16 throughput. What L40S does not have is the Hopper Transformer Engine's runtime amax tracking — FP8 calibration on Ada is a build-time step in TensorRT-LLM or a one-shot calibration pass in vLLM, not a per-layer runtime decision. Calibrated correctly, FP8 inference on L40S reaches 1,466 TFLOPS sparse / 733 TFLOPS dense, putting it within a factor of 2.5-3x of H100 FP8 throughput on compute-bound workloads.
The single architectural choice that defines L40S is GDDR6 over HBM. 48 GB at 864 GB/s is genuinely a lot of memory — twice an A100 40 GB, equal to A100 80 GB minus 32 GB — but the bandwidth ceiling is roughly 43 % of A100 HBM2e (2.0 TB/s), 26 % of H100 HBM3 (3.35 TB/s), and 18 % of H200 HBM3e (4.8 TB/s). For decode-bound LLM inference (long contexts, large KV caches), this is the binding constraint. For compute-bound workloads (image generation denoising, prefill, batched 7B BF16), the gap closes substantially. Pick L40S where compute dominates; pick HBM-class cards where memory bandwidth dominates.
The other architectural choices follow the inference brief. No NVLink — multi-card collectives go over PCIe Gen4 (64 GB/s) only, which makes tensor parallelism across cards roughly 8-12x slower than on NVLink and limits L40S to single-card replicas or pipeline-parallel splits. No MIG — multi-tenant isolation relies on vGPU licensing (NVIDIA AI Enterprise) or container-level scheduling, neither of which is hardware-isolated. Third-generation RT cores at 212 TFLOPS for ray tracing remain useful for 3D rendering and Omniverse workloads, but most AI deployments leave them idle.
- AD102 die: 142 active SMs on L40S (144 physical, harvested), 568 fourth-generation Tensor Cores, 96 MB L2 cache, 128 KB L1/SMEM per SM.
- Compute capability sm_89 (Ada Lovelace) — distinct from sm_80 (Ampere) and sm_90 (Hopper); kernels compiled for sm_89 are L4/L40/L40S/RTX 6000 Ada and the RTX 40-series consumer cards.
- Memory: 24 GDDR6 ECC chips, 384-bit bus, 18 Gbps per pin, 864 GB/s aggregate. No HBM. No EDR/ECC variants — all L40S ship with ECC enabled.
- Third-generation RT cores: 212 TFLOPS for BVH traversal — relevant for Omniverse, ray-traced denoising in image pipelines.
- NVENC/NVDEC: third-generation; 3 NVENC + 3 NVDEC engines per card; AV1 encode/decode, H.265 10-bit; sustains roughly 6x simultaneous 4K60 streams for transcoding.
- No FP4 (Blackwell-only), no Transformer Engine runtime amax (Hopper-only), no NVLink (PCIe-only), no MIG (Hopper/Ampere-only), no Confidential Compute (Hopper-only).
Reference: full specification sheet#
Authoritative per-card figures. Sparse Tensor figures assume 2:4 structured sparsity; dense throughput is half the sparse figure. L40S ships in a single SKU — there is no SXM L40S, no 24 GB variant, no NVLink bridge. The closely related cards (L40, L4, RTX 6000 Ada) are shown alongside for sizing context.
| Metric | L40S | L40 (compare) | L4 (compare) | RTX 6000 Ada (compare) |
|---|---|---|---|---|
| Architecture | Ada Lovelace AD102 | Ada Lovelace AD102 | Ada Lovelace AD104 | Ada Lovelace AD102 |
| Process | TSMC 4N | TSMC 4N | TSMC 4N | TSMC 4N |
| Transistors | 76 billion | 76 billion | 35.8 billion | 76 billion |
| Active SMs | 142 | 142 | 60 | 142 |
| CUDA cores | 18,176 | 18,176 | 7,680 | 18,176 |
| Tensor cores | 568 | 568 | 240 | 568 |
| RT cores | 142 (gen 3) | 142 (gen 3) | 60 (gen 3) | 142 (gen 3) |
| Compute capability | sm_89 | sm_89 | sm_89 | sm_89 |
| FP32 | 91.6 TFLOPS | 90.5 TFLOPS | 30.3 TFLOPS | 91.1 TFLOPS |
| TF32 (Tensor, sparse) | 366 TFLOPS | 362 TFLOPS | 120 TFLOPS | 364 TFLOPS |
| BF16 / FP16 (Tensor, sparse) | 733 TFLOPS | 362 TFLOPS | 242 TFLOPS | 728 TFLOPS |
| BF16 / FP16 (Tensor, dense) | 366 TFLOPS | 181 TFLOPS | 121 TFLOPS | 364 TFLOPS |
| FP8 (Tensor, sparse) | 1,466 TFLOPS | 724 TFLOPS | 485 TFLOPS | 1,457 TFLOPS |
| INT8 (Tensor, sparse) | 1,466 TOPS | 724 TOPS | 485 TOPS | 1,457 TOPS |
| RT-core throughput | 212 TFLOPS | 210 TFLOPS | 73 TFLOPS | 210 TFLOPS |
| Memory | 48 GB GDDR6 ECC | 48 GB GDDR6 ECC | 24 GB GDDR6 | 48 GB GDDR6 ECC |
| Memory bandwidth | 864 GB/s | 864 GB/s | 300 GB/s | 960 GB/s |
| L2 cache | 96 MB | 96 MB | 48 MB | 96 MB |
| NVENC / NVDEC | 3 / 3 (gen 3) | 3 / 3 (gen 3) | 2 / 4 (gen 3) | 3 / 3 (gen 3) |
| NVLink | Not supported | Not supported | Not supported | Not supported |
| MIG | Not supported | Not supported | Not supported | Not supported |
| PCIe | Gen4 x16 (64 GB/s) | Gen4 x16 (64 GB/s) | Gen4 x16 (64 GB/s) | Gen4 x16 (64 GB/s) |
| TDP | 350 W | 300 W | 72 W | 300 W |
| Form factor | FHFL dual-slot | FHFL dual-slot | LP single-slot | FHFL dual-slot |
| Cooling | Passive (server) | Passive (server) | Passive (server) | Active (workstation) |
| Confidential Compute | Not supported | Not supported | Not supported | Not supported |
| Minimum driver | R535 | R525 | R525 | R525 |
| Recommended driver (2026) | R570 stable | R570 | R570 | R570 |
| Minimum CUDA | 12.0 | 11.8 | 11.8 | 11.8 |
L40S is PCIe-only with no NVLink bridge. Multi-card workloads (tensor-parallel above 48 GB) communicate over PCIe Gen4 at 64 GB/s — roughly 13x slower than NVLink 4.0 — and tensor-parallel inference at TP=2 typically loses 35-55 % throughput versus the same model on a single H100. If a workload needs more than 48 GB, the choice is usually 'move to H100/H200' rather than 'use 2x L40S'.
Interconnect and form factor: PCIe-only, dual-slot, 350 W passive#
L40S exposes only PCIe Gen4 x16 — 64 GB/s bidirectional — to the host. There is no NVLink, no NVLink bridge, no SXM variant. Multi-card workloads (tensor-parallel, pipeline-parallel, all-reduce-heavy collectives) communicate over PCIe and through host memory. NCCL on PCIe-only L40S clusters falls back to ring or tree algorithms over PCIe peer-to-peer where the BIOS allows it; on hosts without P2P (most cloud instances), traffic transits host RAM with another order of magnitude of latency.
The form factor is a full-height full-length dual-slot PCIe card with passive cooling — designed for server airflow at 350 W. This fits cleanly in standard 2U/4U inference chassis from every major OEM (Dell PowerEdge R760xa/XE9680, HPE ProLiant DL380a Gen11, Supermicro AS-4125GS, Lenovo ThinkSystem SR675 V3); most reference designs slot 4 or 8 L40S cards per chassis. The 350 W envelope is meaningful — older Ampere generations of inference cards (T4 at 70 W, A10 at 150 W) often went into denser 1U servers; L40S requires 2U or 4U airflow at minimum and cannot be deployed in fanless edge enclosures.
PCIe bifurcation is supported (8 cards per dual-CPU host with PCIe Gen4 x16 to each card requires PCIe switches, typically Broadcom PEX 88096); without switches, most dual-socket hosts cap at 4 cards with x16 lanes each. This matters for sizing: an 8x L40S server is invariably a PCIe-switch design with potential cross-switch hops between cards.
- PCIe Gen4 x16: 64 GB/s bidirectional per card — the only inter-card path.
- Form factor: FHFL dual-slot passive — server cooling required, no fanless or LP options.
- Dense chassis target: 4-card (no switches needed) or 8-card (Broadcom PEX 88096 PCIe switch required).
- Power: 1x EPS12V 16-pin (CEM5) connector per card on 2024+ chassis; legacy CEM4 8+8-pin variants exist on early production.
- Cooling envelope: 350 W passive, requires sustained 400+ LFM airflow at <35 C inlet — verify in chassis qualification.
Sizing and capacity planning#
Sizing tables we use internally to scope L40S inference fleets. All figures assume L40S in a server with PCIe Gen4 P2P enabled, vLLM 0.6+ with paged KV cache and prefix caching, FP8 weights where supported (calibration step required), and AWQ INT4 for memory-pressured workloads. Output tokens per second is per replica at moderate concurrency (16-32 sessions); compute-bound workloads (SDXL, transcoding) are listed in their own units.
- Single-card ceiling: weights + KV cache + activations + cuBLAS scratch under ~46 GB; above that, OOMs even with paged KV.
- 70B BF16 is infeasible on a single L40S (140 GB weights) and slow on 2x L40S over PCIe — use AWQ INT4 for single-card 70B inference, or move to H100/H200 if BF16 is required.
- FP8 calibration step on L40S (TensorRT-LLM `quantize.py` or vLLM's `nm-vllm` toolchain) takes 30-90 minutes for 7B-13B models; budget the time in the deployment pipeline.
- Long-context decode (32K+) regresses faster on L40S than on HBM cards — KV cache bandwidth is the binding constraint. Cap deployed context at 16K-32K and treat above-32K traffic as a separate H200/B200 tier.
- Multi-tenant isolation without MIG: use vGPU (NVIDIA AI Enterprise licence) or container-level isolation with `--gpus device=N` and per-pod resource limits; neither is hardware-enforced.
- Cost-per-million-output-tokens on Llama 3.1 8B FP8 at $1.00/GPU-hr and 4,000 TPS sustained: roughly $0.07 per million — the lowest mainstream NVIDIA SKU for sub-13B serving.
| Workload | Precision | Context | Cards per replica | Approx throughput | VRAM headroom |
|---|---|---|---|---|---|
| Llama 3.1 8B serving | FP8 | 8K | 1x L40S | 3,400-4,500 TPS | 30 GB free |
| Llama 3.1 8B serving | BF16 | 8K | 1x L40S | 2,200-2,800 TPS | 20 GB free |
| Mistral 7B / Qwen 7B serving | FP8 | 16K | 1x L40S | 3,200-4,100 TPS | 28 GB free |
| Llama 3 13B serving | FP8 | 8K | 1x L40S | 1,800-2,400 TPS | 15 GB free |
| Codestral 22B / Yi 34B serving | AWQ INT4 | 8K | 1x L40S | 900-1,300 TPS | 20 GB free |
| Codestral 22B serving | FP8 | 8K | 1x L40S | 650-900 TPS | 8 GB free |
| Llama 3 70B serving | AWQ INT4 | 4K | 1x L40S | 180-260 TPS | 5 GB free |
| Llama 3 70B serving | BF16 | 4K | 2x L40S (PCIe TP) | 120-180 TPS | PCIe-limited |
| SDXL 1.0 (1024x1024, 25 steps) | BF16 + torch.compile | n/a | 1x L40S | 0.9-1.2 images/s | 30 GB free |
| Flux.1-dev (1024x1024) | BF16 | n/a | 1x L40S | 0.45-0.65 images/s | 15 GB free |
| 7B QLoRA fine-tune | NF4 base + BF16 LoRA | 4K | 1x L40S | ~3,200 tokens/s training | 12 GB free |
| 13B QLoRA fine-tune | NF4 base + BF16 LoRA | 4K | 1x L40S | ~1,700 tokens/s training | 8 GB free |
| Whisper Large v3 batch ASR | BF16 | 30s clip | 1x L40S | ~38 RTF (real-time-factor) | 40 GB free |
| Video transcoding (H.264 -> AV1 4K60) | n/a (NVENC) | n/a | 1x L40S | 5-6 streams concurrent | n/a |
Cost and TCO#
L40S pricing settled in a tight band through 2025-2026 as supply caught up with demand. The card is genuinely the cheapest-per-token NVIDIA SKU for inference of models that fit in 48 GB at FP8 or AWQ INT4, and the on-prem capex story is equally favourable — a 4x L40S inference server lands at roughly $40,000-$55,000 capital plus power, versus $180,000-$240,000 for an equivalent 4x H100 PCIe server.
- Cost-per-million-output-tokens on Llama 3.1 8B FP8 at $1.00/GPU-hr and 4,000 TPS: roughly $0.07 per million tokens — typically 30-50 % cheaper than H100 on the same workload.
- Cost-per-million-output-tokens on Codestral 22B AWQ INT4 at $1.00/GPU-hr and 1,100 TPS: roughly $0.25 per million — competitive with H100 70B FP8 once H100 is amortised over higher-throughput sessions.
- SDXL image generation: at 1.1 images/s and $1.00/GPU-hr, cost per image lands at roughly $0.00025 — the floor across NVIDIA's inference fleet.
- 3-year reservation typically cuts effective $/GPU-hr by 40-50 % versus on-demand; commit only when steady-state utilisation exceeds 60 %.
- On-prem TCO break-even versus 3-year reserved cloud: roughly 18-24 months on 4x L40S inference servers at typical UK/EU power prices ($0.12-$0.20/kWh).
- Egress: less of a factor than on H100 fleets because L40S workloads are typically lower-throughput per session; still budget 6-10 % of total bill at hyperscalers.
| Provider class | SKU | On-demand $/GPU-hr | 1y reserved | 3y reserved | Spot |
|---|---|---|---|---|---|
| Hyperscaler (AWS g6e, Azure NCadsv5) | L40S | $1.05-$1.25 | $0.75-$0.95 | $0.60-$0.75 | $0.40-$0.55 |
| Tier-1 neocloud (CoreWeave, Lambda) | L40S | $0.90-$1.10 | $0.65-$0.85 | $0.50-$0.65 | $0.35-$0.45 |
| Tier-2 neocloud | L40S | $0.70-$0.95 | $0.55-$0.75 | $0.45-$0.60 | $0.25-$0.40 |
| On-prem 4x L40S server (capex amortised) | L40S | $0.55-$0.75 amortised | n/a | n/a | n/a |
| Yobitel NeoCloud (UK + EU) | L40S | $0.80-$1.00 | $0.60-$0.80 | $0.48-$0.62 | n/a |
| Yobitel Omniscient Compute | L40S multi-cloud | Market-clearing | Commit-discounted | Commit-discounted | n/a |
L40S is the most over-recommended H100 alternative — and also the most genuinely correct one when the workload actually fits. The discipline is: verify FP8 calibration accuracy, verify that the model + KV cache fits in 48 GB at production context length, verify that the workload does not need NVLink. Three for three -> ship on L40S; any one miss -> H100/H200.
Software ecosystem#
L40S is a first-class target across the modern AI inference stack. TensorRT-LLM treats sm_89 as a fully supported architecture with FP8 engine builds (`--gemm_plugin fp8 --gpt_attention_plugin fp8 --workers 1`), and the engine cache is portable across L40, L40S and RTX 6000 Ada at the same compute capability. vLLM supports FP8 weight quantisation on L40S via the `nm-vllm` toolchain (one-shot calibration) and AWQ/GPTQ INT4 via the standard `--quantization awq` or `--quantization gptq` paths. SGLang and TGI both ship L40S-tuned kernels for 7B-34B serving.
Image generation has the deepest L40S story: `diffusers` with `torch.compile` reaches advertised throughput on SDXL/Flux without custom work, and the TensorRT 10 UNet engine path adds another 1.5-2x on top. NVIDIA Omniverse and Isaac Sim run natively, taking advantage of the third-generation RT cores. Video pipelines use the NVENC/NVDEC engines via NVIDIA's Video Codec SDK 12+ for AV1 encode/decode at 4K60.
Triton Inference Server hosts L40S without special configuration; the standard `tensorrt_llm_backend` and `vllm_backend` both work. NVIDIA AI Enterprise treats L40S as the recommended L-class platform for production inference, and the NVIDIA NIM (NVIDIA Inference Microservices) catalogue ships pre-built containers for popular open-weight models targeting L40S as the cost-optimised tier.
The notable absences: no Confidential Compute (Hopper-only — sovereign-attested workloads must target H100/H200), no Transformer Engine runtime amax tracking (FP8 calibration is build-time only, not adaptive), no MIG (multi-tenant isolation via vGPU licensing only), no FP4 (Blackwell-only).
Migration and alternatives#
When L40S is the right choice and when it isn't. The two heuristics: (1) the workload fits in 48 GB at FP8 or INT4 and is compute-bound or moderately memory-bound -> L40S wins on cost; (2) the workload exceeds 48 GB, requires NVLink, or has tail-latency sensitivity to KV-cache bandwidth -> move up to H100/H200. Below the L40S tier, L4 is the right answer for low-power inference under 24 GB.
| From / to | When it pays | Migration effort | Key incompatibility |
|---|---|---|---|
| L4 -> L40S | Need more memory (24 GB -> 48 GB) or Tensor throughput | Low (same sm_89 software) | 350 W TDP — chassis cooling redesign |
| L40 -> L40S | AI inference rather than visualisation primary load | Trivial (same drivers, same kernels) | None — clock binning only |
| A100 -> L40S | Sub-30B inference, NVLink not needed, $/token priority | Medium (BF16 -> FP8 calibration step) | No MIG; GDDR6 long-context tail latency |
| A10G -> L40S | Need FP8 or 48 GB memory | Low (sm_86 -> sm_89 kernel recompile) | 350 W vs 250 W chassis envelope |
| L40S -> H100 | Workload exceeds 48 GB or needs NVLink | Medium (FP8 calibration step on Hopper TE) | Cost roughly doubles per GPU-hour |
| L40S -> H200 | Long-context (32K+) or 70B BF16 in-place | Medium (same software as L40S -> H100) | HBM3e cost premium |
| L40S -> RTX 6000 Ada | Workstation single-card use case (not server) | Trivial (same silicon) | Active cooling — not server-friendly |
| L40S -> MI300X | Want HBM3 capacity without H100 price | High (CUDA -> ROCm rewrite) | CUDA kernels not portable; vLLM ROCm path lags |
Pitfalls and operational notes#
- GDDR6 is the binding constraint on long-context inference — 864 GB/s vs A100's 2.0 TB/s vs H100's 3.35 TB/s. Benchmark KV-cache-heavy decodes (32K+ context) on L40S explicitly before promising production SLAs.
- No NVLink — TP=2 inference at 70B BF16 loses 35-55 % throughput vs single-card H100. If the model does not fit in 48 GB at FP8/INT4, plan to move up rather than splitting across L40S cards.
- No MIG — multi-tenant isolation requires vGPU licensing (NVIDIA AI Enterprise) or container-level scheduling. Hardware isolation is not available.
- FP8 calibration is build-time only — no Transformer Engine runtime amax on Ada. Recalibrate after any model update or significant change in prompt distribution.
- 350 W passive in a dual-slot card needs ~400 LFM sustained airflow at <35 C inlet; some chassis (especially older 2U designs) throttle L40S under sustained load. Verify in chassis qualification, not on the datasheet.
- PCIe Gen4 (64 GB/s) is half of Gen5 — on hosts with Gen5 NICs (CX-7 NDR 400 Gb/s), the L40S PCIe link becomes the bottleneck for streaming inference inputs at multi-replica scale.
- CEM5 16-pin power connector on 2024+ production — earlier cards use CEM4 8+8-pin adapters. Inventory mixing causes power-rail surprises; tag both in your asset DB.
- No Confidential Compute — sovereign-attested deployments requiring SPDM attestation must target H100/H200, not L40S.
- Driver R570 is the 2026 recommended baseline; R535/R545 builds are missing important NCCL-over-PCIe and DCGM fixes for L40S clusters.
Where this fits in the Yobitel stack#
L40S is the cost-optimised inference SKU across the Yobitel stack in 2026. Yobibyte — our AI-native platform — schedules sub-34B serving workloads onto L40S pools by default, with FP8 calibration baked into the model-onboarding pipeline and automatic fallback to H100/H200 for workloads that exceed L40S's 48 GB / 864 GB/s envelope. The platform's placement layer tags every L40S replica with the chassis it landed on (4-card no-switch, 8-card PEX 88096) so multi-card pipeline-parallel jobs can prefer same-switch placements.
Omniscient Compute — our cross-cloud capacity broker — indexes L40S across AWS g6e, Azure NCads-v5 variants, CoreWeave, Lambda, Crusoe, Civo, and a long tail of regional neoclouds, and normalises pricing onto the FinOps Foundation FOCUS spec. Because L40S supply expanded faster than H100 through 2025, the broker frequently surfaces L40S as the cost-leading SKU for 7B-13B serving — sometimes 50-60 % below H100 at parity throughput on workloads that fit.
InferenceBench — our public, reproducible benchmarking harness — publishes L40S throughput, latency and cost-per-token numbers for every major open-weight model under 34B on vLLM, TensorRT-LLM, SGLang and TGI, including SDXL/Flux image generation throughput and Whisper batch ASR figures. The L40S sizing tables in this entry are anchored on InferenceBench runs; if you are sizing a 2026 L40S footprint, start with InferenceBench, lift the platform configuration into a Yobibyte manifest, and let Omniscient Compute pick the region.
References
- NVIDIA L40S Datasheet · NVIDIA
- Ada Lovelace Architecture Whitepaper · NVIDIA
- TensorRT-LLM Ada FP8 engine builds · NVIDIA
- vLLM quantisation (FP8, AWQ, GPTQ) · vLLM
- NVIDIA Video Codec SDK (NVENC/NVDEC) · NVIDIA
- FinOps Foundation FOCUS billing specification · FinOps Foundation