NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)

TL;DR

SHARP is an NVIDIA-developed in-network compute capability that offloads AllReduce, Reduce, Barrier, and other collective operations into InfiniBand switch ASICs.
Trees are built across switches at job start; reduction data flows up the tree, gets summed/min/maxed in switch hardware, and the result is multicast back down — halving the data volume on the wire.
Generations track switch ASICs: SHARPv2 on Quantum (HDR), SHARPv3 on Quantum-2 (NDR), SHARPv4 on Quantum-3 (XDR) with BF16/FP8 and deeper trees.
Integrated transparently into NCCL, MPI, and UCX — toggled via env vars like `NCCL_COLLNET_ENABLE=1`. Speed-ups are largest for AllReduce-bound workloads at thousand-GPU scale.

Overview#

SHARP — the Scalable Hierarchical Aggregation and Reduction Protocol — is the in-network compute capability that distinguishes NVIDIA InfiniBand fabrics from generic lossless transports. The idea is straightforward: rather than every endpoint sending its gradient slice to a coordinator, having it added up, then having the result sent back, the switches themselves perform the addition as packets flow through them.

For data-parallel training, AllReduce of model gradients is the single most expensive collective. SHARP roughly halves the bytes that have to traverse the fabric and eliminates the receive-side reduction cost on endpoints. The result is shorter, more deterministic AllReduce times at scale.

How a SHARP Tree Is Built#

When a job begins, the SHARP daemon on each node coordinates with the Aggregation Manager (a service typically run on a head node or UFM appliance) to allocate a reduction tree across the switches participating in the job. Each switch ASIC contains a fixed number of Aggregation Nodes (ANs) that perform the actual arithmetic; the tree is sized so every job gets a non-overlapping slice of ANs.

Each leaf of the tree corresponds to an endpoint; each internal node corresponds to a switch AN. When the application calls AllReduce, data is sent from leaves up the tree; at each AN, the values from child branches are summed and the result is forwarded to the parent. At the root, the final sum is multicast back down the same tree. Endpoints receive the final value without a second host-side reduction.

Generations#

Generation	Switch ASIC	Data Types	Notes
SHARPv1	Switch-IB 2	FP32, INT32	Original streaming reduction
SHARPv2	Quantum (HDR)	FP32, INT32, FP16	Larger trees, more parallel streams
SHARPv3	Quantum-2 (NDR)	FP32, FP16, BF16	Multiple concurrent jobs, AllToAll support
SHARPv4	Quantum-3 (XDR)	FP32, BF16, FP8	Deeper trees, FP8 sums for very large clusters

Enabling SHARP in NCCL#

bash

# NCCL picks SHARP via the CollNet plugin (libnccl-net.so + libsharp.so).
export NCCL_COLLNET_ENABLE=1
export NCCL_ALGO=CollnetChain,CollnetDirect,Ring
export SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1

# Verify SHARP is active in the job log:
#   NCCL INFO Connected ... using CollNet
# Then benchmark with nccl-tests AllReduce at the message sizes you care about.
mpirun -np 256 ./build/all_reduce_perf -b 64M -e 8G -f 2 -g 1

When SHARP Helps Most#

Data-parallel training of dense models where AllReduce is the dominant collective.
Message sizes above ~16 MB — below that, the tree setup overhead dominates and host-side ring AllReduce can be faster.
Multi-tenant clusters where multiple jobs run concurrent collectives — SHARP's per-job tree isolation prevents head-of-line blocking.
Clusters above 256 GPUs, where the host-side reduction's log(N) ring stages become significant.

Pitfalls#

Aggregation Manager misconfiguration silently disables SHARP — jobs fall back to ring AllReduce and you only notice through the throughput regression.
AN count per switch is finite. Very large jobs may exhaust ANs and partially fall back; verify in NCCL logs.
SHARP requires consistent firmware across all switches in the tree path; mixed-firmware fabrics cause sporadic CollNet initialisation failures.
Not all NCCL algorithms benefit equally — Reduce-Scatter and AllGather still use endpoint algorithms.

Always run an `nccl-tests` baseline with and without `NCCL_COLLNET_ENABLE` before declaring SHARP active in production — the env var only enables the attempt; failures are silent.

References

NVIDIA SHARP Documentation · NVIDIA
Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction · NVIDIA / COMHPC
NCCL CollNet Plugin Documentation · NVIDIA

Overview#

How a SHARP Tree Is Built#

Generations#

Generation	Switch ASIC	Data Types	Notes
SHARPv1	Switch-IB 2	FP32, INT32	Original streaming reduction
SHARPv2	Quantum (HDR)	FP32, INT32, FP16	Larger trees, more parallel streams
SHARPv3	Quantum-2 (NDR)	FP32, FP16, BF16	Multiple concurrent jobs, AllToAll support
SHARPv4	Quantum-3 (XDR)	FP32, BF16, FP8	Deeper trees, FP8 sums for very large clusters

Enabling SHARP in NCCL#

bash

# NCCL picks SHARP via the CollNet plugin (libnccl-net.so + libsharp.so).
export NCCL_COLLNET_ENABLE=1
export NCCL_ALGO=CollnetChain,CollnetDirect,Ring
export SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1

# Verify SHARP is active in the job log:
#   NCCL INFO Connected ... using CollNet
# Then benchmark with nccl-tests AllReduce at the message sizes you care about.
mpirun -np 256 ./build/all_reduce_perf -b 64M -e 8G -f 2 -g 1

When SHARP Helps Most#

Data-parallel training of dense models where AllReduce is the dominant collective.

Message sizes above ~16 MB — below that, the tree setup overhead dominates and host-side ring AllReduce can be faster.

Multi-tenant clusters where multiple jobs run concurrent collectives — SHARP's per-job tree isolation prevents head-of-line blocking.

Clusters above 256 GPUs, where the host-side reduction's log(N) ring stages become significant.

Pitfalls#

Aggregation Manager misconfiguration silently disables SHARP — jobs fall back to ring AllReduce and you only notice through the throughput regression.

AN count per switch is finite. Very large jobs may exhaust ANs and partially fall back; verify in NCCL logs.

SHARP requires consistent firmware across all switches in the tree path; mixed-firmware fabrics cause sporadic CollNet initialisation failures.

Not all NCCL algorithms benefit equally — Reduce-Scatter and AllGather still use endpoint algorithms.

Always run an `nccl-tests` baseline with and without `NCCL_COLLNET_ENABLE` before declaring SHARP active in production — the env var only enables the attempt; failures are silent.

NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)

Overview#

How a SHARP Tree Is Built#

Generations#

Enabling SHARP in NCCL#

When SHARP Helps Most#

Pitfalls#

References

Browse all entries

Deploy on Yobitel

NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)

Overview#

How a SHARP Tree Is Built#

Generations#

Enabling SHARP in NCCL#

When SHARP Helps Most#

Pitfalls#

References

Browse all entries

Deploy on Yobitel