Nsight Systems

TL;DR

Free, proprietary profiler from NVIDIA. Captures a timeline of CPU threads, CUDA API calls, kernel launches, GPU activity, NCCL communication, memory copies, NVTX ranges, and OS scheduling events.
Designed for system-level performance work — answering 'why is the GPU idle?' rather than 'why is this kernel slow?'. For kernel-level analysis use Nsight Compute; for end-to-end framework profiling use PyTorch Profiler on top of Nsight.
Low overhead (typically < 5 % on typical workloads), supports multi-node and multi-GPU traces, and writes `.nsys-rep` reports viewable in the desktop GUI or queryable via the `nsys-ui` SQLite interface.
Standard tool for analysing training step time, communication-vs-compute overlap on multi-GPU jobs, inference latency breakdowns, and tracking down CPU-side bottlenecks (dataloader stalls, Python GIL, host-to-device transfers).

When to Use Nsight Systems#

On a healthy GPU workload, the GPU is busy executing kernels back-to-back. On almost every real workload, the GPU is idle some of the time — waiting for the dataloader, blocked on host-to-device memcpy, waiting for an NCCL AllReduce to complete, sitting through a sync point that should have been removed. Nsight Systems is the tool that makes that idle time visible at the system level.

It is the right starting point when training throughput is below expectation, inference tail latency has an unexplained component, or a multi-GPU job scales worse than projected. It is the wrong tool for tuning the SASS of a single kernel — that is Nsight Compute's job — or for high-level operator-level profiling, which PyTorch Profiler does more ergonomically.

Capturing a Trace#

The standard pattern is to wrap a representative workload — a few training steps, a handful of inference requests — under `nsys profile`. The CLI writes a `.nsys-rep` file; the GUI opens it as a timeline. For long-running workloads, the `--capture-range cudaProfilerApi` flag plus `torch.cuda.profiler.start()` / `stop()` in code limits capture to the interesting window.

bash

# Profile a few training iterations
nsys profile \
    -o train_profile \
    --trace=cuda,nvtx,osrt,cudnn,cublas,nccl \
    --capture-range=cudaProfilerApi \
    --cuda-memory-usage=true \
    python train.py

# Programmatic capture window in PyTorch
import torch.cuda.profiler as cuda_profiler
import torch.autograd.profiler as autograd_profiler

with autograd_profiler.emit_nvtx():
    for step, batch in enumerate(loader):
        if step == 5:
            cuda_profiler.start()
        train_step(batch)
        if step == 10:
            cuda_profiler.stop()
            break

Reading the Timeline#

The Nsight timeline shows multiple synchronised rows: CPU threads at the top (Python, dataloader workers, CUDA driver threads), the CUDA API row below (call sequences from `cudaMalloc`, `cudaMemcpyAsync`, `cudaLaunchKernel`), and a row per GPU showing kernel execution and memory copies. NVTX annotations from the framework — PyTorch emits one per operator under `emit_nvtx()` — appear as named ranges, making the timeline navigable instead of an undifferentiated sea of CUDA calls.

Three patterns recur. Gaps in the GPU row mean the GPU is waiting — find the corresponding CPU activity. AllReduce kernels (`ncclAllReduce`) that dominate step time mean communication is not overlapped with compute. Long `cudaMemcpyAsync` ranges immediately before a kernel mean the dataloader is the bottleneck.

Always trace `nvtx` and `nccl` together. NVTX ranges give you human-readable scope; NCCL traces give you communication primitives. Together they answer 90 % of multi-GPU performance questions.

Common Findings#

Symptom on timeline	Likely cause	Fix
GPU idle between steps	Dataloader stall	Increase num_workers, pin_memory, prefetch
Large memcpy H2D before each step	CPU tensors not pinned	Use pinned memory + non_blocking=True
Long NCCL AllReduce, no overlap	No gradient bucketing	Tune DDP bucket size or use ZeRO
GPU 1 idle while GPU 0 busy	Imbalanced pipeline stage	Re-shard pipeline schedule
Many short kernels in a row	Operator launch overhead	torch.compile, CUDA Graphs
GIL-blocked Python thread	Python overhead in hot loop	Move work to compiled kernels

Multi-Node Profiling#

Nsight Systems supports multi-node traces by running `nsys` independently on each rank and merging the reports for viewing. The recommended pattern on large jobs is to profile a small subset of steps (5-20) on a small subset of ranks (rank 0 plus one rank from each pipeline stage), keep the report files under a few hundred MB each, and open them side-by-side. Full-cluster traces are technically possible but unwieldy.

Relationship to Other Tools#

PyTorch Profiler emits NVTX ranges and uses Nsight as a backend for system-level capture, then overlays operator-level metadata. The two are complementary: PyTorch Profiler answers 'which operator is slow'; Nsight Systems answers 'why is the GPU waiting'. For per-kernel SASS analysis — instruction mix, memory throughput, occupancy bottlenecks — use Nsight Compute. For continuous telemetry in production, use DCGM Exporter and Prometheus; Nsight is a development tool, not a monitoring system.

References

Nsight Systems Documentation · NVIDIA Documentation
Nsight Systems Product Page · NVIDIA Developer
Profiling PyTorch with Nsight · NVIDIA Documentation

When to Use Nsight Systems#

Capturing a Trace#

bash

# Profile a few training iterations
nsys profile \
    -o train_profile \
    --trace=cuda,nvtx,osrt,cudnn,cublas,nccl \
    --capture-range=cudaProfilerApi \
    --cuda-memory-usage=true \
    python train.py

# Programmatic capture window in PyTorch
import torch.cuda.profiler as cuda_profiler
import torch.autograd.profiler as autograd_profiler

with autograd_profiler.emit_nvtx():
    for step, batch in enumerate(loader):
        if step == 5:
            cuda_profiler.start()
        train_step(batch)
        if step == 10:
            cuda_profiler.stop()
            break

Reading the Timeline#

Always trace `nvtx` and `nccl` together. NVTX ranges give you human-readable scope; NCCL traces give you communication primitives. Together they answer 90 % of multi-GPU performance questions.

Common Findings#

Symptom on timeline	Likely cause	Fix
GPU idle between steps	Dataloader stall	Increase num_workers, pin_memory, prefetch
Large memcpy H2D before each step	CPU tensors not pinned	Use pinned memory + non_blocking=True
Long NCCL AllReduce, no overlap	No gradient bucketing	Tune DDP bucket size or use ZeRO
GPU 1 idle while GPU 0 busy	Imbalanced pipeline stage	Re-shard pipeline schedule
Many short kernels in a row	Operator launch overhead	torch.compile, CUDA Graphs
GIL-blocked Python thread	Python overhead in hot loop	Move work to compiled kernels

Multi-Node Profiling#

Relationship to Other Tools#

Nsight Systems

When to Use Nsight Systems#

Capturing a Trace#

Reading the Timeline#

Common Findings#

Multi-Node Profiling#

Relationship to Other Tools#

References

Browse all entries

Deploy on Yobitel

Nsight Systems

When to Use Nsight Systems#

Capturing a Trace#

Reading the Timeline#

Common Findings#

Multi-Node Profiling#

Relationship to Other Tools#

References

Browse all entries

Deploy on Yobitel