Update documentation (#217)

Co-authored-by: Saeed Maleki <saemal@microsoft.com>
This commit is contained in:
Changho Hwang
2023-11-23 04:58:04 +08:00
committed by GitHub
parent 7bd66a938c
commit 15f6dcca49
22 changed files with 281 additions and 312 deletions

47
CITATION.cff Normal file
View File

@@ -0,0 +1,47 @@
cff-version: 1.2.0
title: "MSCCL++: A GPU-driven communication stack for scalable AI applications"
version: 0.4.0
message: >-
If you use this project in your research, please cite it as below.
authors:
- given-names: Peng
family-names: Cheng
affiliation: Microsoft Research
- given-names: Changho
family-names: Hwang
affiliation: Microsoft Research
- given-names: Abhinav
family-names: Jangda
affiliation: Microsoft Research
- given-names: Suriya
family-names: Kalivardhan
affiliation: Microsoft Azure
- given-names: Binyang
family-names: Li
affiliation: Microsoft Azure
- given-names: Shuguang
family-names: Liu
affiliation: Microsoft Azure
- given-names: Saeed
family-names: Maleki
affiliation: Microsoft Research
- given-names: Madan
family-names: Musuvathi
affiliation: Microsoft Research
- given-names: Olli
family-names: Saarikivi
affiliation: Microsoft Research
- given-names: Wei
family-names: Tsui
affiliation: Microsoft Research
- given-names: Ziyue
family-names: Yang
affiliation: Microsoft Research
repository-code: 'https://github.com/microsoft/mscclpp'
abstract: >-
MSCCL++ redefines the interface for inter-GPU communication, thereby
delivering a highly efficient and customizable communication stack
tailored for distributed GPU applications.
license: MIT
license-url: https://github.com/microsoft/mscclpp/blob/main/LICENSE

View File

@@ -2,7 +2,7 @@
# Licensed under the MIT license.
set(MSCCLPP_MAJOR "0")
set(MSCCLPP_MINOR "3")
set(MSCCLPP_MINOR "4")
set(MSCCLPP_PATCH "0")
set(MSCCLPP_SOVERSION ${MSCCLPP_MAJOR})

View File

@@ -1,32 +1,54 @@
# MSCCL++
GPU-driven computation & communication stack.
[![Latest Release](https://img.shields.io/github/release/microsoft/mscclpp.svg)](https://github.com/microsoft/mscclpp/releases/latest)
[![License](https://img.shields.io/github/license/microsoft/mscclpp.svg)](LICENSE)
[![CodeQL](https://github.com/microsoft/mscclpp/actions/workflows/codeql-analysis.yml/badge.svg?branch=main)](https://github.com/microsoft/mscclpp/actions/workflows/codeql-analysis.yml)
| Pipelines | Build Status |
|--------------------------|-------------------|
| Unit Tests (CUDA) | [![Build Status](https://dev.azure.com/binyli/HPC/_apis/build/status%2Fmscclpp-ut?branchName=main)](https://dev.azure.com/binyli/HPC/_build/latest?definitionId=4&branchName=main) |
| Integration Tests (CUDA) | [![Build Status](https://dev.azure.com/binyli/HPC/_apis/build/status%2Fmscclpp-test?branchName=main)](https://dev.azure.com/binyli/HPC/_build/latest?definitionId=3&branchName=main) |
A GPU-driven communication stack for scalable AI applications.
See [Quick Start](docs/quickstart.md) to quickly get started.
See the latest performance evaluation on Azure [NDmv4](docs/performance-ndmv4.md).
Build our Doxygen document by running `doxygen` in [`docs/`](docs/) directory. Run `python3 -m http.server <PORT>` in `docs/doxygen/html/` directory to serve the generated HTML files.
## Overview
MSCCL++ is a development kit for implementing highly optimized distributed GPU applications, in terms of both inter-GPU communication and GPU computation. MSCCL++ is specially designed for developers who want to fine-tune inter-GPU communication of their applications at the GPU kernel level, without awareness of detailed communication mechanisms. The key underlying concept of MSCCL++ is GPU-driven execution, where both communication and computation tasks are initiated by GPU not by CPU. That is, the communication and computation interfaces of MSCCL++ are provided as device-side APIs (called inside a GPU kernel), while the host-side APIs of MSCCL++ are for bootstrapping, initial connection setups, or background host threads for inter-GPU DMA and RDMA (called proxies). By using MSCCL++, we expect:
MSCCL++ redefines inter-GPU communication interfaces, thereby delivering a highly efficient and customizable communication stack for distributed GPU applications. Its design is specifically tailored to accommodate diverse performance optimization scenarios often encountered in state-of-the-art AI applications. Figure below provides a high-level overview of MSCCL++ abstractions in CUDA, C, and Python.
* **Holistic Optimization for High GPU Utilization.** As both communication and computation are scheduled inside a GPU kernel at the same time, we can optimize end-to-end performance of distributed GPU applications from a global view. For example, we can minimize the GPU resource contention between communication and computation, which is known to often substantially degrade throughput of distributed deep learning applications.
| <center>MSCCL++ Abstractions Overview |
|-------------------------------|
| <img src="./docs/figs/abstractions.png" alt="MSCCL++ Abstractions" style="width: 800px;"/> |
* **Fully Pipelined System to Reduce Overhead from the Control Plane.** We can eliminate control overhead from CPU by allowing GPU to autonomously schedule both communication and computation. This significantly reduces GPU scheduling overhead and CPU-GPU synchronization overhead. For example, this allows us to implement a highly fine-grained system pipelining (i.e., hiding communication delays by overlapping with computation), which has been difficult for CPU-controlled applications due to the large control/scheduling overhead.
The followings highlight the key features of MSCCL++.
* **Runtime Performance Optimization for Dynamic Workload.** As we can easily implement flexible communication logics, we can optimize communication performance even during runtime. For example, we can implement the system to automatically choose different communication paths or different collective communication algorithms depending on the dynamic workload at runtime.
* **Light-weight and multi-layer abstractions.** MSCCL++ provides communication abstractions at lowest level close to hardware and at the highest level close to application API. The lowest level of abstraction is ultra light weight which enables a user to implement logics of data movement for a collective operation such as AllReduce inside a GPU kernel extremely efficiently without worrying about memory ordering of different ops. The modularity of MSCCL++ enables a user to construct the building blocks of MSCCL++ in a high level abstraction in Python and feed them to a CUDA kernel in order to facilitate the user's productivity.
## Key Features (v0.3)
* **1-sided 0-copy synchronous and asynchronous abstracts.** MSCCL++ provides fine-grained synchronous and asynchronous 0-copy 1-sided abstracts for communication primitives such as `put()`, `get()`, `signal()`, `flush()`, and `wait()`. The 1-sided abstractions allows a user to asynchronously `put()` their data on the remote GPU as soon as it is ready without requiring the remote side to issue any receive instruction. This enables users to easily implement flexible communication logics, such as overlapping communication with computation, or implementing customized collective communication algorithms without worrying about potential deadlocks. Additionally, the 0-copy capability enables MSCCL++ to directly transfer data between user's buffers without using intermediate internal buffers which saves GPU bandwidth and memory capacity.
MSCCL++ v0.3 supports the following features.
* **Unified abstractions for different interconnection hardware.** MSCCL++ provides consistent abstractions regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink/xGMI or InfiniBand). This simplifies the code for inter-GPU communication, which is often complex due to memory ordering of GPU/CPU read/writes and therefore, is error-prone.
### In-Kernel Communication Interfaces
## Performance
MSCCL++ provides inter-GPU communication interfaces to be called by a GPU thread. For example, the `put()` method in the following example copies 1KB data from the local GPU to a remote GPU. `channel` is a peer-to-peer communication channel between two GPUs, which consists of information on send/receive buffers. `channel` is initialized from the host side before the kernel execution.
While the power of MSCCL++ is fully realized with application-specific optimization, it still delivers performance benefits even for collective communication operations. The following figures provide a comparison of the AllReduce throughput of MSCCL++ against NCCL 2.19.3. This benchmark was tested over two [Azure NDmv4 SKUs](https://learn.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series) (8 A100-80G GPUs per node).
The key motivation behind these results is scaling of inference for LLM models using tensor parallelism. LLM requests usually are executed in two phases: prompt processing and token sampling. The prompt processing uses a large batch size that is usually equal to a request context length and the corresponding AllReduce size is `len_context*dim_hidden*sizeof(fp16)`. For a context length of 2048 with a hidden dimension of 12288 (GPT-3 size), the AllReduce size is 48MB. The token sampling uses a smaller batch size which corresponds to concurrent user requests in the system and therefore, the AllReduce size is `batch_size*dim_hidden*sizeof(fp16)`. For a concurrency of 16 users, the AllReduce size is 384KB. As the figures below demonstrates, MSCCL++ provides significant speed up over NCCL which is crucial for efficiency of serving LLMs at large scale.
| <center>Single-node AllReduce | <center>Two-node AllReduce |
|-------------------------------|----------------------------|
| <img src="./docs/figs/mscclpp_vs_nccl_comparison_num_nodes_1.jpeg" alt="MSCCL++ vs NCCL AllReduce (Single-node)" style="width: 400px;"/> | <img src="./docs/figs/mscclpp_vs_nccl_comparison_num_nodes_2.jpeg" alt="MSCCL++ vs NCCL AllReduce (Two-node)" style="width: 400px;"/> |
## Key Concepts
The following highlights key concepts of MSCCL++.
### On-GPU Communication Interfaces: Channels
MSCCL++ provides peer-to-peer communication methods between GPUs. A peer-to-peer connection between two GPUs is called a *Channel*. Channels are constructed by MSCCL++ host-side interfaces and copied to GPUs during initialization. Channels provide *GPU-side interfaces*, which means that all communication methods are defined as a device function to be called from a GPU kernel code. For example, the `put()` method in the following example copies 1KB data from the local GPU to a remote GPU.
```cpp
// `ProxyChannel` will be explained in the following section.
__device__ mscclpp::DeviceHandle<mscclpp::SimpleProxyChannel> channel;
__global__ void gpuKernel() {
...
@@ -53,11 +75,17 @@ __device__ void barrier() {
}
```
MSCCL++ provides consistent in-kernel interfaces, i.e., the above interfaces are used regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink or InfiniBand).
MSCCL++ provides consistent interfaces, i.e., the above interfaces are used regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink or InfiniBand).
### ProxyChannel and SmChannel
MSCCL++ delivers two types of channels, **ProxyChannel** and **SmChannel**. `ProxyChannel` provides (R)DMA-based data copy and synchronization methods. When called, these methods send/receive a signal to/from a host-side proxy (hence the name `ProxyChannel`), which will trigger (R)DMA (such as `cudaMemcpy*` or `ibv_post_send`) or issue synchronization methods (such as `cudaStreamSynchronize` or `ibv_poll_cq`). Since the key functionalities are run by the proxy, `ProxyChannel` requires only a single GPU thread to call its methods. See all `ProxyChannel` methods from [here](./include/mscclpp/proxy_channel_device.hpp).
On the other hand, `SmChannel` provides memory-mapping-based copy and synchronization methods. When called, these methods will directly use GPU threads to read/write from/to the remote GPU's memory space. Comparing against `ProxyChannel`, `SmChannel` is especially performant for low-latency scenarios, while it may need many GPU threads to call copying methods at the same time to achieve high copying bandwidth. See all `SmChannel` methods from [here](./include/mscclpp/sm_channel_device.hpp).
### Host-Side Communication Proxy
Some in-kernel communication interfaces of MSCCL++ send requests (called triggers) to a GPU-external helper that conducts key functionalities such as DMA or RDMA. This helper is called a proxy service or a proxy in short. MSCCL++ provides a default implementation of a proxy, which is a background host thread that busy polls triggers from GPUs and conducts functionalities accordingly. For example, the following is a typical host-side code for MSCCL++.
MSCCL++ provides a default implementation of a host-side proxy for ProxyChannels, which is a background host thread that busy polls triggers from GPUs and conducts functionalities accordingly. For example, the following is a typical host-side code for MSCCL++.
```cpp
// Bootstrap: initialize control-plane connections between all ranks
@@ -120,19 +148,9 @@ public:
Customized proxies can be used for conducting a series of pre-defined data transfers within only a single trigger from GPU at runtime. This would be more efficient than sending a trigger for each data transfer one by one.
### Flexible Customization
### Python Interfaces
Most of key components of MSCCL++ are designed to be easily customized. This enables MSCCL++ to easily adopt a new software / hardware technology and lets users implement algorithms optimized for their own use cases.
### New in MSCCL++ v0.3 (Latest Release)
* Updated interfaces
* Add Python bindings and interfaces
* Add Python unit tests
* Add more configurable parameters
* Add a new single-node AllReduce kernel
* Fix bugs
See details from https://github.com/microsoft/mscclpp/issues/89.
MSCCL++ provides Python bindings and interfaces, which simplifies integration with Python applications.
## Contributing

View File

@@ -26,11 +26,11 @@ find_program(BLACK black)
if (BLACK)
message(STATUS "Found black: ${BLACK}")
add_custom_target(check-format-py
COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml --check ${PROJECT_SOURCE_DIR}/python ${PROJECT_SOURCE_DIR}/test
COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml --check ${PROJECT_SOURCE_DIR}
)
add_dependencies(check-format check-format-py)
add_custom_target(format-py
COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml ${PROJECT_SOURCE_DIR}/python ${PROJECT_SOURCE_DIR}/test
COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml ${PROJECT_SOURCE_DIR}
)
add_dependencies(format format-py)
else()

2
docs/.gitignore vendored
View File

@@ -1 +1,3 @@
doxygen/
_build/
sphinx/

View File

@@ -2043,7 +2043,7 @@ MAN_LINKS = NO
# captures the structure of the code including all documentation.
# The default value is: NO.
GENERATE_XML = NO
GENERATE_XML = YES
# The XML_OUTPUT tag is used to specify where the XML pages will be put. If a
# relative path is entered the value of OUTPUT_DIRECTORY will be put in front of

20
docs/Makefile Normal file
View File

@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

27
docs/README.md Normal file
View File

@@ -0,0 +1,27 @@
## How to build docs
1. Install `doxygen`.
```bash
$ sudo apt-get install doxygen
```
2. Install Python packages below. If you install them on the user's local, you need to include `~/.local/bin` to `$PATH` (to use `sphinx-build`).
```bash
$ sudo python3 -m pip install sphinx sphinx_rtd_theme breathe
```
3. Create Doxygen documents.
```bash
$ doxygen
```
4. Create Sphinx documents.
```bash
$ sphinx-build -b html -Dbreathe_projects.mscclpp=$PWD/doxygen/xml $PWD $PWD/sphinx
```
5. Done. The HTML files will be on `sphinx/` directory.

29
docs/conf.py Normal file
View File

@@ -0,0 +1,29 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = "mscclpp"
copyright = "2023, MSCCL++ Team"
author = "MSCCL++ Team"
release = "v0.4.0"
# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
extensions = ["breathe"]
templates_path = ["_templates"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
# Breathe configuration
breathe_default_project = "mscclpp"
# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = "sphinx_rtd_theme"
html_static_path = ["_static"]

BIN
docs/figs/abstractions.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 65 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

26
docs/index.rst Normal file
View File

@@ -0,0 +1,26 @@
.. MSCCL++ documentation master file, created by
sphinx-quickstart on Tue Sep 5 13:03:46 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to MSCCL++'s documentation!
===================================
.. toctree::
:maxdepth: 2
:caption: Contents:
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
Docs
====
.. doxygennamespace:: mscclpp
:members:

35
docs/make.bat Normal file
View File

@@ -0,0 +1,35 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

View File

@@ -1,50 +1,3 @@
# NDmv4 Performance
All results from NDmv4. NCCL version 2.17.1+cuda11.8, reported in-place numbers.
nccl-tests command example:
```bash
mpirun --bind-to numa -hostfile /mnt/hostfile --tag-output --allow-run-as-root -map-by ppr:8:node --bind-to numa -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 -x PATH -x LD_PRELOAD=/mnt/nccl/build/lib/libnccl.so -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_SOCKET_IFNAME=eth0 -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_NET_GDR_LEVEL=5 -x NCCL_TOPO_FILE=/mnt/ndv4-topo.xml -x NCCL_DEBUG=WARN ./build/all_gather_perf -b 1K -e 1K -g 1 -c 1 -w 10 -n 10 -G 1
```
mscclpp-tests command example:
```bash
mpirun -allow-run-as-root -map-by ppr:8:node -hostfile /mnt/hostfile ./build/test/mscclpp-test/allgather_test_perf -b 1K -e 1K -w 10 -n 10 -G 10 -k 0
```
**NOTE:** NCCL AllGather leverages Ring algorithm instead of all-pairs alike algorithm, which greatly reduces inter-node transmission, causing significant higher performance. MSCCL++ should do something similar in the future
### 1 node, 8 gpus/node
**Latency (us)**
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:--------------------------:|:-----------------:|
| 1K | 12.53 | **16.96** | 9.34 | **7.76** / 21.06 / 28.50 | 157.91 / 143.21 / 447.0 | 326.4 |
**BusBW (GB/s)**
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:----------------------------:|:-----------------:|
| 1G | 253.59 | **231.45** | 254.69 | 217.05 / 216.98 / 217.15 | 125.06 / **255.64** / 124.89 | 22.55 |
### 2 nodes, 1 gpu/node
**Latency (us)**
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|:------------:|:--------------:|:--------------:|:--------------:|:------------------------------:|:--------------------------:|:-----------------:|
| 1K | 16.08 | **21.27** | 29.84 | 14.67 / 29.12 / 35.43 | 15.32 / **13.84** / 26.08 | - |
**BusBW (GB/s)**
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:--------------------------:|:-----------------:|
| 1G | 15.84 | **18.65** | 15.48 | 13.94 / 13.83 / 14.10 | **23.30** / 23.29 / 21.60 | - |
### 2 nodes, 8 gpus/node
**Latency (us)**
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:--------------------------:|:-----------------:|
| 1K | 33.74 | **35.85** | 49.75 | **22.55** / 39.33 / 56.93 | 159.14 / 230.52 / 462.7 | - |
**BusBW (GB/s)**
| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:--------------------------:|:-----------------:|
| 1G | 177.05 | **183.82** | 37.80 | 40.17 / 40.18 / 40.23 | 44.19 / 9.31 / **209.33** | - |
| 4G | 186.01 | **188.18** | 37.81 | - / - / - | 44.60 / - / **234.08** | - |
TBU

View File

@@ -8,8 +8,9 @@
* ND_H100_v5
* [NC_A100_v4](https://learn.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series) (TBD)
* Non-Azure Systems
* NVIDIA A100 GPUs + CUDA >= 11.1.1
* NVIDIA H100 GPUs + CUDA >= 12.0.0
* NVIDIA A100 GPUs + CUDA >= 11.8
* NVIDIA H100 GPUs + CUDA >= 12.0
* AMD support is underway.
* OS: tested over Ubuntu 18.04 and 20.04
* Libraries: [libnuma](https://github.com/numactl/numactl), MPI (optional)
* Others
@@ -54,6 +55,8 @@ Our base image installs all prerequisites for MSCCL++.
$ docker pull ghcr.io/microsoft/mscclpp/mscclpp:base-cuda12.1
```
See all available images [here](https://github.com/microsoft/mscclpp/pkgs/container/mscclpp%2Fmscclpp).
## Unit Tests
`unit_tests` require one GPU on the system. It only tests operation of basic components.
@@ -76,37 +79,53 @@ To run `mp_unit_tests` with more than two nodes, you need to specify the `-ip_po
$ mpirun -np 16 -npernode 8 -hostfile hostfile ./test/mp_unit_tests -ip_port 10.0.0.5:50000
```
## mscclpp-test
## Performance Benchmark
mscclpp-test is a set of performance benchmarks for MSCCL++. It requires MPI to be installed on the system, and the path should be provided via `MPI_HOME` environment variable to the CMake build system.
### Python Benchmark
[Install the MSCCL++ Python package](https://github.com/microsoft/mscclpp/blob/chhwang/docs/docs/quickstart.md#install-from-source-python-module) and run our Python AllReduce benchmark as follows. It requires MPI on the system.
```bash
# Choose either `requirements_cu11.txt` or `requirements_cu12.txt` according to your CUDA version.
$ python3 -m pip install -r ./python/requirements_cu12.txt
$ mpirun -tag-output -np 8 python3 ./python/benchmark/allreduce_bench.py
```
### C++ Benchmark (mscclpp-test)
*NOTE: mscclpp-test will be retired soon and will be maintained only as an example of C++ implementation. If you want to get the latest performance numbers, please use the Python benchmark instead.*
mscclpp-test is a set of C++ performance benchmarks. It requires MPI on the system, and the path should be provided via `MPI_HOME` environment variable to the CMake build system.
```bash
$ MPI_HOME=/path/to/mpi cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j sendrecv_test_perf allgather_test_perf allreduce_test_perf alltoall_test_perf
$ make -j allgather_test_perf allreduce_test_perf
```
For example, the following command runs the AllReduce benchmark with 8 GPUs starting from 3MB to 48MB messages, by doubling the message size in between.
For example, the following command runs the `allreduce5` algorithm with 8 GPUs starting from 3MB to 48MB messages, by doubling the message size in between. You can try different algorithms by changing the `-k 5` option to another value (e.g., `-k 3` runs `allreduce3`). Check all algorithms from the code: [allreduce_test.cu](https://github.com/microsoft/mscclpp/blob/main/test/mscclpp-test/allreduce_test.cu) and [allgather_test.cu](https://github.com/microsoft/mscclpp/blob/main/test/mscclpp-test/allgather_test.cu).
```bash
$ mpirun -np 8 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 4
$ mpirun --bind-to-numa -np 8 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5
```
*NOTE: a few algorithms set a condition on the total data size, such as to be a multiple of 3. If the condition is unmet, the command will throw a regarding error.*
Check the help message for more details.
```bash
$ ./test/mscclpp-test/allreduce_test_perf --help
USAGE: allreduce_test_perf
[-b,--minbytes <min size in bytes>]
[-e,--maxbytes <max size in bytes>]
[-i,--stepbytes <increment size>]
[-f,--stepfactor <increment factor>]
[-n,--iters <iteration count>]
[-w,--warmup_iters <warmup iteration count>]
[-c,--check <0/1>]
[-T,--timeout <time in seconds>]
[-G,--cudagraph <num graph launches>]
[-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>]
[-k,--kernel_num <kernel number of commnication primitive>]
[-o, --output_file <output file name>]
USAGE: allreduce_test_perf
[-b,--minbytes <min size in bytes>]
[-e,--maxbytes <max size in bytes>]
[-i,--stepbytes <increment size>]
[-f,--stepfactor <increment factor>]
[-n,--iters <iteration count>]
[-w,--warmup_iters <warmup iteration count>]
[-c,--check <0/1>]
[-T,--timeout <time in seconds>]
[-G,--cudagraph <num graph launches>]
[-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>]
[-k,--kernel_num <kernel number of commnication primitive>]
[-o, --output_file <output file name>]
[-h,--help]
```

View File

@@ -5,7 +5,7 @@
#define MSCCLPP_CORE_HPP_
#define MSCCLPP_MAJOR 0
#define MSCCLPP_MINOR 3
#define MSCCLPP_MINOR 4
#define MSCCLPP_PATCH 0
#define MSCCLPP_VERSION (MSCCLPP_MAJOR * 10000 + MSCCLPP_MINOR * 100 + MSCCLPP_PATCH)

View File

@@ -7,7 +7,7 @@ build-backend = "scikit_build_core.build"
[project]
name = "mscclpp"
version = "0.3.0"
version = "0.4.0"
[tool.scikit-build]
cmake.minimum-version = "3.25.0"

View File

@@ -1,109 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
import argparse
import logging
import multiprocessing as mp
import sys
import mscclpp
import torch
IB_TRANSPORTS = [
mscclpp.Transport.IB0,
mscclpp.Transport.IB1,
mscclpp.Transport.IB2,
mscclpp.Transport.IB3,
mscclpp.Transport.IB4,
mscclpp.Transport.IB5,
mscclpp.Transport.IB6,
mscclpp.Transport.IB7,
]
# Use to hold the sm channels so they don't get garbage collected
sm_channels = []
def setup_connections(comm, rank, world_size, element_size, proxy_service):
simple_proxy_channels = []
sm_semaphores = []
connections = []
remote_memories = []
memory = torch.zeros(element_size, dtype=torch.int32)
memory = memory.to("cuda")
transport_flag = mscclpp.TransportFlags(IB_TRANSPORTS[rank]) | mscclpp.Transport.CudaIpc
ptr = memory.data_ptr()
size = memory.numel() * memory.element_size()
reg_mem = comm.register_memory(ptr, size, transport_flag)
for r in range(world_size):
if r == rank:
continue
conn = comm.connect_on_setup(r, 0, mscclpp.Transport.CudaIpc)
connections.append(conn)
comm.send_memory_on_setup(reg_mem, r, 0)
remote_mem = comm.recv_memory_on_setup(r, 0)
remote_memories.append(remote_mem)
comm.setup()
connections = [conn.get() for conn in connections]
# Create simple proxy channels
for i, conn in enumerate(connections):
proxy_channel = mscclpp.SimpleProxyChannel(
proxy_service.proxy_channel(proxy_service.build_and_add_semaphore(comm, conn)),
proxy_service.add_memory(remote_memories[i].get()),
proxy_service.add_memory(reg_mem),
)
simple_proxy_channels.append(proxy_channel.device_handle())
comm.setup()
# Create sm channels
for i, conn in enumerate(connections):
sm_chan = mscclpp.SmDevice2DeviceSemaphore(comm, conn)
sm_semaphores.append(sm_chan)
comm.setup()
for i, conn in enumerate(sm_semaphores):
sm_chan = mscclpp.SmChannel(sm_semaphores[i], remote_memories[i].get(), ptr)
sm_channels.append(sm_chan)
return simple_proxy_channels, [sm_chan.device_handle() for sm_chan in sm_channels]
def run(rank, args):
world_size = args.gpu_number
torch.cuda.set_device(rank)
boot = mscclpp.TcpBootstrap.create(rank, world_size)
boot.initialize(args.if_ip_port_trio)
comm = mscclpp.Communicator(boot)
proxy_service = mscclpp.ProxyService()
logging.info("Rank: %d, setting up connections", rank)
setup_connections(comm, rank, world_size, args.num_elements, proxy_service)
logging.info("Rank: %d, starting proxy service", rank)
proxy_service.start_proxy()
def main():
logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
parser = argparse.ArgumentParser()
parser.add_argument("if_ip_port_trio", type=str)
parser.add_argument("-n", "--num-elements", type=int, default=10)
parser.add_argument("-g", "--gpu_number", type=int, default=2)
args = parser.parse_args()
processes = []
for rank in range(args.gpu_number):
p = mp.Process(target=run, args=(rank, args))
p.start()
processes.append(p)
for p in processes:
p.join()
if __name__ == "__main__":
main()

View File

@@ -1,82 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
import argparse
import time
import mscclpp
def main(args):
if args.root:
rank = 0
else:
rank = 1
boot = mscclpp.TcpBootstrap.create(rank, 2)
boot.initialize(args.if_ip_port_trio)
comm = mscclpp.Communicator(boot)
if args.gpu:
import torch
print("Allocating GPU memory")
memory = torch.zeros(args.num_elements, dtype=torch.int32)
memory = memory.to("cuda")
ptr = memory.data_ptr()
size = memory.numel() * memory.element_size()
else:
from array import array
print("Allocating host memory")
memory = array("i", [0] * args.num_elements)
ptr, elements = memory.buffer_info()
size = elements * memory.itemsize
my_reg_mem = comm.register_memory(ptr, size, mscclpp.Transport.IB0)
conn = comm.connect_on_setup((rank + 1) % 2, 0, mscclpp.Transport.IB0)
other_reg_mem = None
if rank == 0:
other_reg_mem = comm.recv_memory_on_setup((rank + 1) % 2, 0)
else:
comm.send_memory_on_setup(my_reg_mem, (rank + 1) % 2, 0)
comm.setup()
if rank == 0:
other_reg_mem = other_reg_mem.get()
if rank == 0:
for i in range(args.num_elements):
memory[i] = i + 1
conn.write(other_reg_mem, 0, my_reg_mem, 0, size)
print("Done sending")
else:
print("Checking for correctness")
# polling
for _ in range(args.polling_num):
all_correct = True
for i in range(args.num_elements):
if memory[i] != i + 1:
all_correct = False
print(f"Error: Mismatch at index {i}: expected {i + 1}, got {memory[i]}")
break
if all_correct:
print("All data matched expected values")
break
else:
time.sleep(0.1)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("if_ip_port_trio", type=str)
parser.add_argument("-r", "--root", action="store_true")
parser.add_argument("-n", "--num-elements", type=int, default=10)
parser.add_argument("--gpu", action="store_true")
parser.add_argument("--polling_num", type=int, default=100)
args = parser.parse_args()
main(args)

View File

@@ -1,17 +0,0 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
import time
import mscclpp
def main():
timer = mscclpp.Timer()
timer.reset()
time.sleep(2)
assert timer.elapsed() >= 2000000
if __name__ == "__main__":
main()

View File

@@ -56,11 +56,12 @@ def plot_graph(sizes, mscclpp_algbw, nccl_algbw, speed_ups):
ax1.legend(lines, labels, loc="upper left")
# Setting title and grid
ax1.set_title("MSCCLPP vs NCCL -- " + str(MPI.COMM_WORLD.size // N_GPUS_PER_NODE) + " Nodes")
num_nodes = MPI.COMM_WORLD.size // N_GPUS_PER_NODE
ax1.set_title(f"MSCCLPP vs NCCL -- {num_nodes} Nodes")
ax2.grid(True, which="both", ls="--")
# Saving the plot
plt.savefig("mscclpp_vs_nccl_comparison.pdf", format="pdf")
plt.savefig(f"mscclpp_vs_nccl_comparison_num_nodes_{num_nodes}.jpeg", format="jpeg")
def human_readable_size(size, decimal_places=1):
@@ -246,7 +247,7 @@ if __name__ == "__main__":
mscclpp_algbw = []
nccl_algbw = []
speed_ups = []
for i in range(10, 30):
for i in range(10, 29):
if MPI.COMM_WORLD.size // N_GPUS_PER_NODE == 1:
nelems = 2**i
elif MPI.COMM_WORLD.size // N_GPUS_PER_NODE == 2: