mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
4.0 KiB
4.0 KiB
NDmv4 Performance
All results from NDmv4. NCCL version 2.17.1+cuda11.8, reported in-place numbers.
nccl-tests command example:
mpirun --bind-to numa -hostfile /mnt/hostfile --tag-output --allow-run-as-root -map-by ppr:8:node --bind-to numa -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 -x PATH -x LD_PRELOAD=/mnt/nccl/build/lib/libnccl.so -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_SOCKET_IFNAME=eth0 -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_NET_GDR_LEVEL=5 -x NCCL_TOPO_FILE=/mnt/ndv4-topo.xml -x NCCL_DEBUG=WARN ./build/all_gather_perf -b 1K -e 1K -g 1 -c 1 -w 10 -n 10 -G 1
mscclpp-tests command example:
mpirun -allow-run-as-root -map-by ppr:8:node -hostfile /mnt/hostfile ./build/test/mscclpp-test/allgather_test_perf -b 1K -e 1K -w 10 -n 10 -G 10 -k 0
NOTE: NCCL AllGather leverages Ring algorithm instead of all-pairs alike algorithm, which greatly reduces inter-node transmission, causing significant higher performance. MSCCL++ should do something similar in the future
1 node, 8 gpus/node
Latency (us)
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|---|---|---|---|---|---|---|
| 1K | 12.53 | 16.96 | 9.34 | 7.76 / 21.06 / 28.50 | 157.91 / 143.21 / 447.0 | 326.4 |
BusBW (GB/s)
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|---|---|---|---|---|---|---|
| 1G | 253.59 | 231.45 | 254.69 | 217.05 / 216.98 / 217.15 | 125.06 / 255.64 / 124.89 | 22.55 |
2 nodes, 1 gpu/node
Latency (us)
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|---|---|---|---|---|---|---|
| 1K | 16.08 | 21.27 | 29.84 | 14.67 / 29.12 / 35.43 | 15.32 / 13.84 / 26.08 | - |
BusBW (GB/s)
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|---|---|---|---|---|---|---|
| 1G | 15.84 | 18.65 | 15.48 | 13.94 / 13.83 / 14.10 | 23.30 / 23.29 / 21.60 | - |
2 nodes, 8 gpus/node
Latency (us)
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|---|---|---|---|---|---|---|
| 1K | 33.74 | 35.85 | 49.75 | 22.55 / 39.33 / 56.93 | 159.14 / 230.52 / 462.7 | - |
BusBW (GB/s)
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|---|---|---|---|---|---|---|
| 1G | 177.05 | 183.82 | 37.80 | 40.17 / 40.18 / 40.23 | 44.19 / 9.31 / 209.33 | - |
| 4G | 186.01 | 188.18 | 37.81 | - / - / - | 44.60 / - / 234.08 | - |