Saeed Maleki c9ac615b20 Merge pull request #74 from microsoft/saemal/offloading
offloading allgather to CPU entirely
2023-05-15 16:27:00 -07:00
2023-05-11 08:55:51 +00:00
2023-04-17 19:06:58 +00:00
2023-05-12 22:42:22 +00:00
2023-05-05 18:05:55 +00:00
2023-05-12 21:25:29 +00:00
2023-02-01 16:28:54 -08:00
2023-04-24 18:46:34 +08:00
2023-02-01 16:28:56 -08:00
2023-02-01 16:28:57 -08:00
2023-04-27 00:26:00 +00:00

MSCCL++

GPU-driven computation & communication stack.

Quick Start

Preliminaries

Compile Library

Run make in the top directory. To use MPI for test code, pass MPI_HOME (/usr/local/mpi by default). For example:

$ MPI_HOME=/usr/local/mpi make -j

If you do not want to use MPI, pass USE_MPI_FOR_TESTS=0.

# Do not use MPI
$ USE_MPI_FOR_TESTS=0 make -j

make will create a header file build/include/mscclpp.h and a shared library build/lib/libmscclpp.so.

(Optional) Tests

For verification, one can try provided sample code bootstrap_test or p2p_test. First add the MSCCL++ library path to LD_LIBRARY_PATH.

$ export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH

Run tests using MPI:

$ mpirun -np 8 ./build/bin/tests/bootstrap_test 127.0.0.1:50000
$ mpirun -np 8 ./build/bin/tests/p2p_test 127.0.0.1:50000

If tests are compiled without MPI, pass a rank and the number of ranks as the following example. Usage of p2p_test is also the same as bootstrap_test.

# Terminal 1: Rank 0, #Ranks 2
$ ./build/bin/tests/bootstrap_test 127.0.0.1:50000 0 2
# Terminal 2: Rank 1, #Ranks 2
$ ./build/bin/tests/bootstrap_test 127.0.0.1:50000 1 2

Performance

All results from NDv4. NCCL version 2.17.1+cuda11.8, reported in-place numbers.

nccl-tests command example:

mpirun --bind-to numa -hostfile /mnt/hostfile --tag-output --allow-run-as-root -map-by ppr:8:node --bind-to numa -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 -x PATH -x LD_PRELOAD=/mnt/nccl/build/lib/libnccl.so -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_SOCKET_IFNAME=eth0 -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_NET_GDR_LEVEL=5 -x NCCL_TOPO_FILE=/mnt/ndv4-topo.xml -x NCCL_DEBUG=WARN ./build/all_gather_perf -b 1K -e 1K -g 1 -c 1 -w 10 -n 10 -G 1

mscclpp-tests command example:

mpirun -allow-run-as-root -map-by ppr:8:node -hostfile /mnt/hostfile -x LD_LIBRARY_PATH=/mnt/mscclpp/build/lib:$LD_LIBRARY_PATH ./build/bin/tests/allgather_test_perf -b 1K -e 1K -w 10 -n 10 -G 1 -k 0

NOTE: NCCL AllGather leverages Ring algorithm instead of all-pairs alike algorithm, which greatly reduces inter-node transmission, causing significant higher performance. MSCCL++ should do something similar in the future

1 node, 8 gpus/node

Latency (us)

Message Size NCCL AllGather NCCL AllReduce NCCL AllToAll MSCCL AllToAll LL/LL128/Simple MSCCL++ AllGather K0/K1/K2 MSCCL++ AllReduce
1K 12.53 16.96 9.34 7.76 / 21.06 / 28.50 157.91 / 143.21 / 447.0 326.4

BusBW (GB/s)

Message Size NCCL AllGather NCCL AllReduce NCCL AllToAll MSCCL AllToAll LL/LL128/Simple MSCCL++ AllGather K0/K1/K2 MSCCL++ AllReduce
1G 253.59 132.31 254.69 217.05 / 216.98 / 217.15 125.06 / 255.64 / 124.89 22.55

2 nodes, 1 gpu/node

Latency (us)

Message Size NCCL AllGather NCCL AllReduce NCCL AllToAll MSCCL AllToAll LL/LL128/Simple MSCCL++ AllGather K0/K1/K2 MSCCL++ AllReduce
1K 16.08 21.27 29.84 14.67 / 29.12 / 35.43 15.32 / 13.84 / 26.08 -

BusBW (GB/s)

Message Size NCCL AllGather NCCL AllReduce NCCL AllToAll MSCCL AllToAll LL/LL128/Simple MSCCL++ AllGather K0/K1/K2 MSCCL++ AllReduce
1G 15.84 18.65 15.48 13.94 / 13.83 / 14.10 23.30 / 23.29 / 21.60 -

2 nodes, 8 gpus/node

Latency (us)

Message Size NCCL AllGather NCCL AllReduce NCCL AllToAll MSCCL AllToAll LL/LL128/Simple MSCCL++ AllGather K0/K1/K2 MSCCL++ AllReduce
1K 33.74 35.85 49.75 22.55 / 39.33 / 56.93 159.14 / 230.52 / 462.7 -

BusBW (GB/s)

Message Size NCCL AllGather NCCL AllReduce NCCL AllToAll MSCCL AllToAll LL/LL128/Simple MSCCL++ AllGather K0/K1/K2 MSCCL++ AllReduce
1G 177.05 183.82 37.80 40.17 / 40.18 / 40.23 44.19 / 9.31 / 209.33 -
4G 186.01 188.18 37.81 - / - / - 44.60 / - / 234.08 -

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Description
MSCCL++: A GPU-driven communication stack for scalable AI applications
Readme MIT 25 MiB
Languages
C++ 46.1%
Python 27.4%
Cuda 22.3%
CMake 1.5%
C 1.2%
Other 1.5%