Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2.1 KiB
C++ Examples
Tutorials
Check out our tutorials for step-by-step guides on using MSCCL++.
(mscclpp-test)=
mscclpp-test
NOTE: mscclpp-test is NOT a performance benchmark. If you want to get the latest performance numbers, please use the Python benchmark or the NCCL APIs instead.
mscclpp-test is a set of C++ implementation examples. It requires MPI on the system, and the path should be provided via MPI_HOME environment variable to the CMake build system.
$ MPI_HOME=/path/to/mpi cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j allgather_test_perf allreduce_test_perf
For example, the following command runs the allreduce5 algorithm with 8 GPUs starting from 3MB to 48MB messages, by doubling the message size in between. You can try different algorithms by changing the -k 5 option to another value (e.g., -k 3 runs allreduce3). Check all algorithms from the code: allreduce_test.cu and allgather_test.cu.
$ mpirun --bind-to numa -np 8 ./bin/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5
NOTE: a few algorithms set a condition on the total data size, such as to be a multiple of 3. If the condition is unmet, the command will throw a regarding error.
Check the help message for more details.
$ ./bin/allreduce_test_perf --help
USAGE: allreduce_test_perf
[-b,--minbytes <min size in bytes>]
[-e,--maxbytes <max size in bytes>]
[-i,--stepbytes <increment size>]
[-f,--stepfactor <increment factor>]
[-n,--iters <iteration count>]
[-w,--warmup_iters <warmup iteration count>]
[-c,--check <0/1>]
[-T,--timeout <time in seconds>]
[-G,--cudagraph <num graph launches>]
[-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>]
[-k,--kernel_num <kernel number of commnication primitive>]
[-o, --output_file <output file name>]
[-h,--help]