mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 01:10:22 +00:00

Go to file

Saeed Maleki c9ac615b20 Merge pull request #74 from microsoft/saemal/offloading

offloading allgather to CPU entirely

2023-05-15 16:27:00 -07:00

.github/workflows

update

2023-05-11 08:55:51 +00:00

cmake

Make clang-format style file explicit

2023-05-05 19:15:38 +00:00

include

Add headers to install and set default install dir

2023-05-12 21:23:01 +00:00

python

lint + typo fix

2023-04-17 19:06:58 +00:00

src

Merge remote-tracking branch 'origin/api-extension' into saemal/offloading

2023-05-12 22:43:22 +00:00

test

fully working with double buffering

2023-05-12 22:42:22 +00:00

tools/npkit

NPKit: add DMA events and fix bandwidth calculation (#33 )

2023-03-28 09:58:32 +08:00

.clang-format

Add clang-format to CMake

2023-05-05 18:05:55 +00:00

.gitignore

[python] switch to setup.py to build package

2023-04-12 12:29:17 -07:00

CMakeLists.txt

Change install dir

2023-05-12 21:25:29 +00:00

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md committed

2023-02-01 16:28:53 -08:00

LICENSE

LICENSE committed

2023-02-01 16:28:54 -08:00

README.md

Fix perf numbers in README.md

2023-04-24 18:46:34 +08:00

SECURITY.md

SECURITY.md committed

2023-02-01 16:28:56 -08:00

SUPPORT.md

SUPPORT.md committed

2023-02-01 16:28:57 -08:00

TODO.md

TODOs

2023-04-27 00:26:00 +00:00

README.md

MSCCL++

GPU-driven computation & communication stack.

Quick Start

Preliminaries

OS: tested over Ubuntu 18.04 and 20.04
Libraries: CUDA >= 11.1.1, libnuma
GPUs: A100 (TBU: H100)
Azure SKUs: ND_A100_v4, NDm_A100_v4 (TBD: NC_A100_v4)

Compile Library

Run make in the top directory. To use MPI for test code, pass MPI_HOME (/usr/local/mpi by default). For example:

$ MPI_HOME=/usr/local/mpi make -j

If you do not want to use MPI, pass USE_MPI_FOR_TESTS=0.

# Do not use MPI
$ USE_MPI_FOR_TESTS=0 make -j

make will create a header file build/include/mscclpp.h and a shared library build/lib/libmscclpp.so.

(Optional) Tests

For verification, one can try provided sample code bootstrap_test or p2p_test. First add the MSCCL++ library path to LD_LIBRARY_PATH.

$ export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH

Run tests using MPI:

$ mpirun -np 8 ./build/bin/tests/bootstrap_test 127.0.0.1:50000
$ mpirun -np 8 ./build/bin/tests/p2p_test 127.0.0.1:50000

If tests are compiled without MPI, pass a rank and the number of ranks as the following example. Usage of p2p_test is also the same as bootstrap_test.

# Terminal 1: Rank 0, #Ranks 2
$ ./build/bin/tests/bootstrap_test 127.0.0.1:50000 0 2
# Terminal 2: Rank 1, #Ranks 2
$ ./build/bin/tests/bootstrap_test 127.0.0.1:50000 1 2

Performance

All results from NDv4. NCCL version 2.17.1+cuda11.8, reported in-place numbers.

nccl-tests command example:

mpirun --bind-to numa -hostfile /mnt/hostfile --tag-output --allow-run-as-root -map-by ppr:8:node --bind-to numa -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 -x PATH -x LD_PRELOAD=/mnt/nccl/build/lib/libnccl.so -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_SOCKET_IFNAME=eth0 -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_NET_GDR_LEVEL=5 -x NCCL_TOPO_FILE=/mnt/ndv4-topo.xml -x NCCL_DEBUG=WARN ./build/all_gather_perf -b 1K -e 1K -g 1 -c 1 -w 10 -n 10 -G 1

mscclpp-tests command example:

mpirun -allow-run-as-root -map-by ppr:8:node -hostfile /mnt/hostfile -x LD_LIBRARY_PATH=/mnt/mscclpp/build/lib:$LD_LIBRARY_PATH ./build/bin/tests/allgather_test_perf -b 1K -e 1K -w 10 -n 10 -G 1 -k 0

NOTE: NCCL AllGather leverages Ring algorithm instead of all-pairs alike algorithm, which greatly reduces inter-node transmission, causing significant higher performance. MSCCL++ should do something similar in the future

1 node, 8 gpus/node

Latency (us)

Message Size	NCCL AllGather	NCCL AllReduce	NCCL AllToAll	MSCCL AllToAll LL/LL128/Simple	MSCCL++ AllGather K0/K1/K2	MSCCL++ AllReduce
1K	12.53	16.96	9.34	7.76 / 21.06 / 28.50	157.91 / 143.21 / 447.0	326.4

BusBW (GB/s)

Message Size	NCCL AllGather	NCCL AllReduce	NCCL AllToAll	MSCCL AllToAll LL/LL128/Simple	MSCCL++ AllGather K0/K1/K2	MSCCL++ AllReduce
1G	253.59	132.31	254.69	217.05 / 216.98 / 217.15	125.06 / 255.64 / 124.89	22.55

2 nodes, 1 gpu/node

Latency (us)

Message Size	NCCL AllGather	NCCL AllReduce	NCCL AllToAll	MSCCL AllToAll LL/LL128/Simple	MSCCL++ AllGather K0/K1/K2	MSCCL++ AllReduce
1K	16.08	21.27	29.84	14.67 / 29.12 / 35.43	15.32 / 13.84 / 26.08	-

BusBW (GB/s)

Message Size	NCCL AllGather	NCCL AllReduce	NCCL AllToAll	MSCCL AllToAll LL/LL128/Simple	MSCCL++ AllGather K0/K1/K2	MSCCL++ AllReduce
1G	15.84	18.65	15.48	13.94 / 13.83 / 14.10	23.30 / 23.29 / 21.60	-

2 nodes, 8 gpus/node

Latency (us)

Message Size	NCCL AllGather	NCCL AllReduce	NCCL AllToAll	MSCCL AllToAll LL/LL128/Simple	MSCCL++ AllGather K0/K1/K2	MSCCL++ AllReduce
1K	33.74	35.85	49.75	22.55 / 39.33 / 56.93	159.14 / 230.52 / 462.7	-

BusBW (GB/s)

Message Size	NCCL AllGather	NCCL AllReduce	NCCL AllToAll	MSCCL AllToAll LL/LL128/Simple	MSCCL++ AllGather K0/K1/K2	MSCCL++ AllReduce
1G	177.05	183.82	37.80	40.17 / 40.18 / 40.23	44.19 / 9.31 / 209.33	-
4G	186.01	188.18	37.81	- / - / -	44.60 / - / 234.08	-

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Languages

C++ 46.1%

Python 27.4%

Cuda 22.3%

CMake 1.5%

C 1.2%

Other 1.5%