TL;DR
- Free, proprietary profiler from NVIDIA. Captures a timeline of CPU threads, CUDA API calls, kernel launches, GPU activity, NCCL communication, memory copies, NVTX ranges, and OS scheduling events.
- Designed for system-level performance work — answering 'why is the GPU idle?' rather than 'why is this kernel slow?'. For kernel-level analysis use Nsight Compute; for end-to-end framework profiling use PyTorch Profiler on top of Nsight.
- Low overhead (typically < 5 % on typical workloads), supports multi-node and multi-GPU traces, and writes `.nsys-rep` reports viewable in the desktop GUI or queryable via the `nsys-ui` SQLite interface.
- Standard tool for analysing training step time, communication-vs-compute overlap on multi-GPU jobs, inference latency breakdowns, and tracking down CPU-side bottlenecks (dataloader stalls, Python GIL, host-to-device transfers).
When to Use Nsight Systems#
On a healthy GPU workload, the GPU is busy executing kernels back-to-back. On almost every real workload, the GPU is idle some of the time — waiting for the dataloader, blocked on host-to-device memcpy, waiting for an NCCL AllReduce to complete, sitting through a sync point that should have been removed. Nsight Systems is the tool that makes that idle time visible at the system level.
It is the right starting point when training throughput is below expectation, inference tail latency has an unexplained component, or a multi-GPU job scales worse than projected. It is the wrong tool for tuning the SASS of a single kernel — that is Nsight Compute's job — or for high-level operator-level profiling, which PyTorch Profiler does more ergonomically.
Capturing a Trace#
The standard pattern is to wrap a representative workload — a few training steps, a handful of inference requests — under `nsys profile`. The CLI writes a `.nsys-rep` file; the GUI opens it as a timeline. For long-running workloads, the `--capture-range cudaProfilerApi` flag plus `torch.cuda.profiler.start()` / `stop()` in code limits capture to the interesting window.
# Profile a few training iterations
nsys profile \
-o train_profile \
--trace=cuda,nvtx,osrt,cudnn,cublas,nccl \
--capture-range=cudaProfilerApi \
--cuda-memory-usage=true \
python train.py
# Programmatic capture window in PyTorch
import torch.cuda.profiler as cuda_profiler
import torch.autograd.profiler as autograd_profiler
with autograd_profiler.emit_nvtx():
for step, batch in enumerate(loader):
if step == 5:
cuda_profiler.start()
train_step(batch)
if step == 10:
cuda_profiler.stop()
breakReading the Timeline#
The Nsight timeline shows multiple synchronised rows: CPU threads at the top (Python, dataloader workers, CUDA driver threads), the CUDA API row below (call sequences from `cudaMalloc`, `cudaMemcpyAsync`, `cudaLaunchKernel`), and a row per GPU showing kernel execution and memory copies. NVTX annotations from the framework — PyTorch emits one per operator under `emit_nvtx()` — appear as named ranges, making the timeline navigable instead of an undifferentiated sea of CUDA calls.
Three patterns recur. Gaps in the GPU row mean the GPU is waiting — find the corresponding CPU activity. AllReduce kernels (`ncclAllReduce`) that dominate step time mean communication is not overlapped with compute. Long `cudaMemcpyAsync` ranges immediately before a kernel mean the dataloader is the bottleneck.
Always trace `nvtx` and `nccl` together. NVTX ranges give you human-readable scope; NCCL traces give you communication primitives. Together they answer 90 % of multi-GPU performance questions.
Common Findings#
| Symptom on timeline | Likely cause | Fix |
|---|---|---|
| GPU idle between steps | Dataloader stall | Increase num_workers, pin_memory, prefetch |
| Large memcpy H2D before each step | CPU tensors not pinned | Use pinned memory + non_blocking=True |
| Long NCCL AllReduce, no overlap | No gradient bucketing | Tune DDP bucket size or use ZeRO |
| GPU 1 idle while GPU 0 busy | Imbalanced pipeline stage | Re-shard pipeline schedule |
| Many short kernels in a row | Operator launch overhead | torch.compile, CUDA Graphs |
| GIL-blocked Python thread | Python overhead in hot loop | Move work to compiled kernels |
Multi-Node Profiling#
Nsight Systems supports multi-node traces by running `nsys` independently on each rank and merging the reports for viewing. The recommended pattern on large jobs is to profile a small subset of steps (5-20) on a small subset of ranks (rank 0 plus one rank from each pipeline stage), keep the report files under a few hundred MB each, and open them side-by-side. Full-cluster traces are technically possible but unwieldy.
Relationship to Other Tools#
PyTorch Profiler emits NVTX ranges and uses Nsight as a backend for system-level capture, then overlays operator-level metadata. The two are complementary: PyTorch Profiler answers 'which operator is slow'; Nsight Systems answers 'why is the GPU waiting'. For per-kernel SASS analysis — instruction mix, memory throughput, occupancy bottlenecks — use Nsight Compute. For continuous telemetry in production, use DCGM Exporter and Prometheus; Nsight is a development tool, not a monitoring system.
References
- Nsight Systems Documentation · NVIDIA Documentation
- Nsight Systems Product Page · NVIDIA Developer
- Profiling PyTorch with Nsight · NVIDIA Documentation