Update documentation (#217)

Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2026-04-19 22:39:11 +00:00 · 2023-11-23 04:58:04 +08:00
parent 7bd66a938c
commit 15f6dcca49
22 changed files with 281 additions and 312 deletions
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -0,0 +1,47 @@
+cff-version: 1.2.0
+title: "MSCCL++: A GPU-driven communication stack for scalable AI applications"
+version: 0.4.0
+message: >-
+  If you use this project in your research, please cite it as below.
+authors:
+  - given-names: Peng
+    family-names: Cheng
+    affiliation: Microsoft Research
+  - given-names: Changho
+    family-names: Hwang
+    affiliation: Microsoft Research
+  - given-names: Abhinav
+    family-names: Jangda
+    affiliation: Microsoft Research
+  - given-names: Suriya
+    family-names: Kalivardhan
+    affiliation: Microsoft Azure
+  - given-names: Binyang
+    family-names: Li
+    affiliation: Microsoft Azure
+  - given-names: Shuguang
+    family-names: Liu
+    affiliation: Microsoft Azure
+  - given-names: Saeed
+    family-names: Maleki
+    affiliation: Microsoft Research
+  - given-names: Madan
+    family-names: Musuvathi
+    affiliation: Microsoft Research
+  - given-names: Olli
+    family-names: Saarikivi
+    affiliation: Microsoft Research
+  - given-names: Wei
+    family-names: Tsui
+    affiliation: Microsoft Research
+  - given-names: Ziyue
+    family-names: Yang
+    affiliation: Microsoft Research
+
+repository-code: 'https://github.com/microsoft/mscclpp'
+abstract: >-
+  MSCCL++ redefines the interface for inter-GPU communication, thereby
+  delivering a highly efficient and customizable communication stack
+  tailored for distributed GPU applications.
+license: MIT
+license-url: https://github.com/microsoft/mscclpp/blob/main/LICENSE
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,7 +2,7 @@
 # Licensed under the MIT license.

 set(MSCCLPP_MAJOR "0")
-set(MSCCLPP_MINOR "3")
+set(MSCCLPP_MINOR "4")
 set(MSCCLPP_PATCH "0")

 set(MSCCLPP_SOVERSION ${MSCCLPP_MAJOR})
--- a/README.md
+++ b/README.md
@@ -1,32 +1,54 @@
 # MSCCL++

-GPU-driven computation & communication stack.
+[![Latest Release](https://img.shields.io/github/release/microsoft/mscclpp.svg)](https://github.com/microsoft/mscclpp/releases/latest)
+[![License](https://img.shields.io/github/license/microsoft/mscclpp.svg)](LICENSE)
+[![CodeQL](https://github.com/microsoft/mscclpp/actions/workflows/codeql-analysis.yml/badge.svg?branch=main)](https://github.com/microsoft/mscclpp/actions/workflows/codeql-analysis.yml)
+
+| Pipelines                | Build Status      |
+|--------------------------|-------------------|
+| Unit Tests (CUDA)        | [![Build Status](https://dev.azure.com/binyli/HPC/_apis/build/status%2Fmscclpp-ut?branchName=main)](https://dev.azure.com/binyli/HPC/_build/latest?definitionId=4&branchName=main) |
+| Integration Tests (CUDA) | [![Build Status](https://dev.azure.com/binyli/HPC/_apis/build/status%2Fmscclpp-test?branchName=main)](https://dev.azure.com/binyli/HPC/_build/latest?definitionId=3&branchName=main) |
+
+A GPU-driven communication stack for scalable AI applications.

 See [Quick Start](docs/quickstart.md) to quickly get started.

-See the latest performance evaluation on Azure [NDmv4](docs/performance-ndmv4.md).
-
-Build our Doxygen document by running `doxygen` in [`docs/`](docs/) directory. Run `python3 -m http.server <PORT>` in `docs/doxygen/html/` directory to serve the generated HTML files.
-
 ## Overview

-MSCCL++ is a development kit for implementing highly optimized distributed GPU applications, in terms of both inter-GPU communication and GPU computation. MSCCL++ is specially designed for developers who want to fine-tune inter-GPU communication of their applications at the GPU kernel level, without awareness of detailed communication mechanisms. The key underlying concept of MSCCL++ is GPU-driven execution, where both communication and computation tasks are initiated by GPU not by CPU. That is, the communication and computation interfaces of MSCCL++ are provided as device-side APIs (called inside a GPU kernel), while the host-side APIs of MSCCL++ are for bootstrapping, initial connection setups, or background host threads for inter-GPU DMA and RDMA (called proxies). By using MSCCL++, we expect:
+MSCCL++ redefines inter-GPU communication interfaces, thereby delivering a highly efficient and customizable communication stack for distributed GPU applications. Its design is specifically tailored to accommodate diverse performance optimization scenarios often encountered in state-of-the-art AI applications. Figure below provides a high-level overview of MSCCL++ abstractions in CUDA, C, and Python.

-* **Holistic Optimization for High GPU Utilization.** As both communication and computation are scheduled inside a GPU kernel at the same time, we can optimize end-to-end performance of distributed GPU applications from a global view. For example, we can minimize the GPU resource contention between communication and computation, which is known to often substantially degrade throughput of distributed deep learning applications.
+| <center>MSCCL++ Abstractions Overview |
+|-------------------------------|
+| <img src="./docs/figs/abstractions.png" alt="MSCCL++ Abstractions" style="width: 800px;"/> |

-* **Fully Pipelined System to Reduce Overhead from the Control Plane.** We can eliminate control overhead from CPU by allowing GPU to autonomously schedule both communication and computation. This significantly reduces GPU scheduling overhead and CPU-GPU synchronization overhead. For example, this allows us to implement a highly fine-grained system pipelining (i.e., hiding communication delays by overlapping with computation), which has been difficult for CPU-controlled applications due to the large control/scheduling overhead.
+The followings highlight the key features of MSCCL++.

-* **Runtime Performance Optimization for Dynamic Workload.** As we can easily implement flexible communication logics, we can optimize communication performance even during runtime. For example, we can implement the system to automatically choose different communication paths or different collective communication algorithms depending on the dynamic workload at runtime.
+* **Light-weight and multi-layer abstractions.** MSCCL++ provides communication abstractions at lowest level close to hardware and at the highest level close to application API. The lowest level of abstraction is ultra light weight which enables a user to implement logics of data movement for a collective operation such as AllReduce inside a GPU kernel extremely efficiently without worrying about memory ordering of different ops. The modularity of MSCCL++ enables a user to construct the building blocks of MSCCL++ in a high level abstraction in Python and feed them to a CUDA kernel in order to facilitate the user's productivity.

-## Key Features (v0.3)
+* **1-sided 0-copy synchronous and asynchronous abstracts.** MSCCL++ provides fine-grained synchronous and asynchronous 0-copy 1-sided abstracts for communication primitives such as `put()`, `get()`, `signal()`, `flush()`, and `wait()`. The 1-sided abstractions allows a user to asynchronously `put()` their data on the remote GPU as soon as it is ready without requiring the remote side to issue any receive instruction. This enables users to easily implement flexible communication logics, such as overlapping communication with computation, or implementing customized collective communication algorithms without worrying about potential deadlocks. Additionally, the 0-copy capability enables MSCCL++ to directly transfer data between user's buffers without using intermediate internal buffers which saves GPU bandwidth and memory capacity.

-MSCCL++ v0.3 supports the following features.
+* **Unified abstractions for different interconnection hardware.** MSCCL++ provides consistent abstractions regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink/xGMI or InfiniBand). This simplifies the code for inter-GPU communication, which is often complex due to memory ordering of GPU/CPU read/writes and therefore, is error-prone.

-### In-Kernel Communication Interfaces
+## Performance

-MSCCL++ provides inter-GPU communication interfaces to be called by a GPU thread. For example, the `put()` method in the following example copies 1KB data from the local GPU to a remote GPU. `channel` is a peer-to-peer communication channel between two GPUs, which consists of information on send/receive buffers. `channel` is initialized from the host side before the kernel execution.
+While the power of MSCCL++ is fully realized with application-specific optimization, it still delivers performance benefits even for collective communication operations. The following figures provide a comparison of the AllReduce throughput of MSCCL++ against NCCL 2.19.3. This benchmark was tested over two [Azure NDmv4 SKUs](https://learn.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series) (8 A100-80G GPUs per node).
+
+The key motivation behind these results is scaling of inference for LLM models using tensor parallelism. LLM requests usually are executed in two phases: prompt processing and token sampling. The prompt processing uses a large batch size that is usually equal to a request context length and the corresponding AllReduce size is `len_context*dim_hidden*sizeof(fp16)`. For a context length of 2048 with a hidden dimension of 12288 (GPT-3 size), the AllReduce size is 48MB. The token sampling uses a smaller batch size which corresponds to concurrent user requests in the system and therefore, the AllReduce size is `batch_size*dim_hidden*sizeof(fp16)`. For a concurrency of 16 users, the AllReduce size is 384KB. As the figures below demonstrates, MSCCL++ provides significant speed up over NCCL which is crucial for efficiency of serving LLMs at large scale.
+
+| <center>Single-node AllReduce | <center>Two-node AllReduce |
+|-------------------------------|----------------------------|
+| <img src="./docs/figs/mscclpp_vs_nccl_comparison_num_nodes_1.jpeg" alt="MSCCL++ vs NCCL AllReduce (Single-node)" style="width: 400px;"/> | <img src="./docs/figs/mscclpp_vs_nccl_comparison_num_nodes_2.jpeg" alt="MSCCL++ vs NCCL AllReduce (Two-node)" style="width: 400px;"/> |
+
+## Key Concepts
+
+The following highlights key concepts of MSCCL++.
+
+### On-GPU Communication Interfaces: Channels
+
+MSCCL++ provides peer-to-peer communication methods between GPUs. A peer-to-peer connection between two GPUs is called a *Channel*. Channels are constructed by MSCCL++ host-side interfaces and copied to GPUs during initialization. Channels provide *GPU-side interfaces*, which means that all communication methods are defined as a device function to be called from a GPU kernel code. For example, the `put()` method in the following example copies 1KB data from the local GPU to a remote GPU.

 ```cpp
+// `ProxyChannel` will be explained in the following section.
 __device__ mscclpp::DeviceHandle<mscclpp::SimpleProxyChannel> channel;
 __global__ void gpuKernel() {
  ...
@@ -53,11 +75,17 @@ __device__ void barrier() {
 }
 ```

-MSCCL++ provides consistent in-kernel interfaces, i.e., the above interfaces are used regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink or InfiniBand).
+MSCCL++ provides consistent interfaces, i.e., the above interfaces are used regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink or InfiniBand).
+
+### ProxyChannel and SmChannel
+
+MSCCL++ delivers two types of channels, **ProxyChannel** and **SmChannel**. `ProxyChannel` provides (R)DMA-based data copy and synchronization methods. When called, these methods send/receive a signal to/from a host-side proxy (hence the name `ProxyChannel`), which will trigger (R)DMA (such as `cudaMemcpy*` or `ibv_post_send`) or issue synchronization methods (such as `cudaStreamSynchronize` or `ibv_poll_cq`). Since the key functionalities are run by the proxy, `ProxyChannel` requires only a single GPU thread to call its methods. See all `ProxyChannel` methods from [here](./include/mscclpp/proxy_channel_device.hpp).
+
+On the other hand, `SmChannel` provides memory-mapping-based copy and synchronization methods. When called, these methods will directly use GPU threads to read/write from/to the remote GPU's memory space. Comparing against `ProxyChannel`, `SmChannel` is especially performant for low-latency scenarios, while it may need many GPU threads to call copying methods at the same time to achieve high copying bandwidth. See all `SmChannel` methods from [here](./include/mscclpp/sm_channel_device.hpp).

 ### Host-Side Communication Proxy

-Some in-kernel communication interfaces of MSCCL++ send requests (called triggers) to a GPU-external helper that conducts key functionalities such as DMA or RDMA. This helper is called a proxy service or a proxy in short. MSCCL++ provides a default implementation of a proxy, which is a background host thread that busy polls triggers from GPUs and conducts functionalities accordingly. For example, the following is a typical host-side code for MSCCL++.
+MSCCL++ provides a default implementation of a host-side proxy for ProxyChannels, which is a background host thread that busy polls triggers from GPUs and conducts functionalities accordingly. For example, the following is a typical host-side code for MSCCL++.

 ```cpp
 // Bootstrap: initialize control-plane connections between all ranks
@@ -120,19 +148,9 @@ public:

 Customized proxies can be used for conducting a series of pre-defined data transfers within only a single trigger from GPU at runtime. This would be more efficient than sending a trigger for each data transfer one by one.

-### Flexible Customization
+### Python Interfaces

-Most of key components of MSCCL++ are designed to be easily customized. This enables MSCCL++ to easily adopt a new software / hardware technology and lets users implement algorithms optimized for their own use cases.
-
-### New in MSCCL++ v0.3 (Latest Release)
-* Updated interfaces
-* Add Python bindings and interfaces
-* Add Python unit tests
-* Add more configurable parameters
-* Add a new single-node AllReduce kernel
-* Fix bugs
-
-See details from https://github.com/microsoft/mscclpp/issues/89.
+MSCCL++ provides Python bindings and interfaces, which simplifies integration with Python applications.

 ## Contributing

--- a/cmake/AddFormatTargets.cmake
+++ b/cmake/AddFormatTargets.cmake
@@ -26,11 +26,11 @@ find_program(BLACK black)
 if (BLACK)
    message(STATUS "Found black: ${BLACK}")
    add_custom_target(check-format-py
-        COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml --check ${PROJECT_SOURCE_DIR}/python ${PROJECT_SOURCE_DIR}/test
+        COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml --check ${PROJECT_SOURCE_DIR}
    )
    add_dependencies(check-format check-format-py)
    add_custom_target(format-py
-        COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml ${PROJECT_SOURCE_DIR}/python ${PROJECT_SOURCE_DIR}/test
+        COMMAND ${BLACK} --config ${PROJECT_SOURCE_DIR}/pyproject.toml ${PROJECT_SOURCE_DIR}
    )
    add_dependencies(format format-py)
 else()
--- a/docs/.gitignore
+++ b/docs/.gitignore
@@ -1 +1,3 @@
 doxygen/
+_build/
+sphinx/
--- a/docs/Doxyfile
+++ b/docs/Doxyfile
@@ -2043,7 +2043,7 @@ MAN_LINKS              = NO
 # captures the structure of the code including all documentation.
 # The default value is: NO.

-GENERATE_XML           = NO
+GENERATE_XML           = YES

 # The XML_OUTPUT tag is used to specify where the XML pages will be put. If a
 # relative path is entered the value of OUTPUT_DIRECTORY will be put in front of
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/README.md
+++ b/docs/README.md
@@ -0,0 +1,27 @@
+## How to build docs
+
+1. Install `doxygen`.
+
+    ```bash
+    $ sudo apt-get install doxygen
+    ```
+
+2. Install Python packages below. If you install them on the user's local, you need to include `~/.local/bin` to `$PATH` (to use `sphinx-build`).
+
+    ```bash
+    $ sudo python3 -m pip install sphinx sphinx_rtd_theme breathe
+    ```
+
+3. Create Doxygen documents.
+
+    ```bash
+    $ doxygen
+    ```
+
+4. Create Sphinx documents.
+
+    ```bash
+    $ sphinx-build -b html -Dbreathe_projects.mscclpp=$PWD/doxygen/xml $PWD $PWD/sphinx
+    ```
+
+5. Done. The HTML files will be on `sphinx/` directory.
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -0,0 +1,29 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+project = "mscclpp"
+copyright = "2023, MSCCL++ Team"
+author = "MSCCL++ Team"
+release = "v0.4.0"
+
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = ["breathe"]
+
+templates_path = ["_templates"]
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+
+# Breathe configuration
+breathe_default_project = "mscclpp"
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = "sphinx_rtd_theme"
+html_static_path = ["_static"]
--- a/docs/figs/abstractions.png
+++ b/docs/figs/abstractions.png
--- a/docs/figs/mscclpp_vs_nccl_comparison_num_nodes_1.jpeg
+++ b/docs/figs/mscclpp_vs_nccl_comparison_num_nodes_1.jpeg
--- a/docs/figs/mscclpp_vs_nccl_comparison_num_nodes_2.jpeg
+++ b/docs/figs/mscclpp_vs_nccl_comparison_num_nodes_2.jpeg
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -0,0 +1,26 @@
+.. MSCCL++ documentation master file, created by
+   sphinx-quickstart on Tue Sep  5 13:03:46 2023.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to MSCCL++'s documentation!
+===================================
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
+
+Docs
+====
+
+.. doxygennamespace:: mscclpp
+   :members:
--- a/docs/make.bat
+++ b/docs/make.bat
@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
--- a/docs/performance-ndmv4.md
+++ b/docs/performance-ndmv4.md
@@ -1,50 +1,3 @@
 # NDmv4 Performance

-All results from NDmv4. NCCL version 2.17.1+cuda11.8, reported in-place numbers.
-
-nccl-tests command example:
-```bash
-mpirun --bind-to numa -hostfile /mnt/hostfile --tag-output --allow-run-as-root -map-by ppr:8:node --bind-to numa -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 -x PATH -x LD_PRELOAD=/mnt/nccl/build/lib/libnccl.so -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_SOCKET_IFNAME=eth0 -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_NET_GDR_LEVEL=5 -x NCCL_TOPO_FILE=/mnt/ndv4-topo.xml -x NCCL_DEBUG=WARN ./build/all_gather_perf -b 1K -e 1K -g 1 -c 1 -w 10 -n 10 -G 1
-```
-
-mscclpp-tests command example:
-```bash
-mpirun -allow-run-as-root -map-by ppr:8:node -hostfile /mnt/hostfile ./build/test/mscclpp-test/allgather_test_perf -b 1K -e 1K -w 10 -n 10 -G 10 -k 0
-```
-
-**NOTE:** NCCL AllGather leverages Ring algorithm instead of all-pairs alike algorithm, which greatly reduces inter-node transmission, causing significant higher performance. MSCCL++ should do something similar in the future
-
-### 1 node, 8 gpus/node
-**Latency (us)**
-| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
-|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:--------------------------:|:-----------------:|
-| 1K           | 12.53          | **16.96**      | 9.34          | **7.76** / 21.06 / 28.50       | 157.91 / 143.21 / 447.0    | 326.4             |
-
-**BusBW (GB/s)**
-| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2   | MSCCL++ AllReduce |
-|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:----------------------------:|:-----------------:|
-| 1G           | 253.59         | **231.45**     | 254.69        | 217.05 / 216.98 / 217.15       | 125.06 / **255.64** / 124.89 | 22.55             |
-
-### 2 nodes, 1 gpu/node
-**Latency (us)**
-| Message Size | NCCL AllGather | NCCL AllReduce |  NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
-|:------------:|:--------------:|:--------------:|:--------------:|:------------------------------:|:--------------------------:|:-----------------:|
-| 1K           | 16.08          | **21.27**      | 29.84          | 14.67 / 29.12 / 35.43          | 15.32 / **13.84** / 26.08  | -                 |
-
-**BusBW (GB/s)**
-| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
-|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:--------------------------:|:-----------------:|
-| 1G           | 15.84          | **18.65**      | 15.48         | 13.94 / 13.83 / 14.10          | **23.30** / 23.29 / 21.60  | -                 |
-
-### 2 nodes, 8 gpus/node
-**Latency (us)**
-| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
-|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:--------------------------:|:-----------------:|
-| 1K           | 33.74          | **35.85**      | 49.75         | **22.55** / 39.33 / 56.93      | 159.14 / 230.52 / 462.7    | -                 |
-
-**BusBW (GB/s)**
-| Message Size | NCCL AllGather | NCCL AllReduce | NCCL AllToAll | MSCCL AllToAll LL/LL128/Simple | MSCCL++ AllGather K0/K1/K2 | MSCCL++ AllReduce |
-|:------------:|:--------------:|:--------------:|:-------------:|:------------------------------:|:--------------------------:|:-----------------:|
-| 1G           | 177.05         | **183.82**     | 37.80         | 40.17 / 40.18 / 40.23          | 44.19 / 9.31 / **209.33**  | -                 |
-| 4G           | 186.01         | **188.18**     | 37.81         | - / - / -                      | 44.60 / - / **234.08**     | -                 |
-
+TBU
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -8,8 +8,9 @@
    * ND_H100_v5
    * [NC_A100_v4](https://learn.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series) (TBD)
 * Non-Azure Systems
-    * NVIDIA A100 GPUs + CUDA >= 11.1.1
-    * NVIDIA H100 GPUs + CUDA >= 12.0.0
+    * NVIDIA A100 GPUs + CUDA >= 11.8
+    * NVIDIA H100 GPUs + CUDA >= 12.0
+    * AMD support is underway.
 * OS: tested over Ubuntu 18.04 and 20.04
 * Libraries: [libnuma](https://github.com/numactl/numactl), MPI (optional)
 * Others
@@ -54,6 +55,8 @@ Our base image installs all prerequisites for MSCCL++.
 $ docker pull ghcr.io/microsoft/mscclpp/mscclpp:base-cuda12.1
 ```

+See all available images [here](https://github.com/microsoft/mscclpp/pkgs/container/mscclpp%2Fmscclpp).
+
 ## Unit Tests

 `unit_tests` require one GPU on the system. It only tests operation of basic components.
@@ -76,37 +79,53 @@ To run `mp_unit_tests` with more than two nodes, you need to specify the `-ip_po
 $ mpirun -np 16 -npernode 8 -hostfile hostfile ./test/mp_unit_tests -ip_port 10.0.0.5:50000
 ```

-## mscclpp-test
+## Performance Benchmark

-mscclpp-test is a set of performance benchmarks for MSCCL++. It requires MPI to be installed on the system, and the path should be provided via `MPI_HOME` environment variable to the CMake build system.
+### Python Benchmark
+
+[Install the MSCCL++ Python package](https://github.com/microsoft/mscclpp/blob/chhwang/docs/docs/quickstart.md#install-from-source-python-module) and run our Python AllReduce benchmark as follows. It requires MPI on the system.
+
+```bash
+# Choose either `requirements_cu11.txt` or `requirements_cu12.txt` according to your CUDA version.
+$ python3 -m pip install -r ./python/requirements_cu12.txt
+$ mpirun -tag-output -np 8 python3 ./python/benchmark/allreduce_bench.py
+```
+
+### C++ Benchmark (mscclpp-test)
+
+*NOTE: mscclpp-test will be retired soon and will be maintained only as an example of C++ implementation. If you want to get the latest performance numbers, please use the Python benchmark instead.*
+
+mscclpp-test is a set of C++ performance benchmarks. It requires MPI on the system, and the path should be provided via `MPI_HOME` environment variable to the CMake build system.

 ```bash
 $ MPI_HOME=/path/to/mpi cmake -DCMAKE_BUILD_TYPE=Release ..
-$ make -j sendrecv_test_perf allgather_test_perf allreduce_test_perf alltoall_test_perf
+$ make -j allgather_test_perf allreduce_test_perf
 ```

-For example, the following command runs the AllReduce benchmark with 8 GPUs starting from 3MB to 48MB messages, by doubling the message size in between.
+For example, the following command runs the `allreduce5` algorithm with 8 GPUs starting from 3MB to 48MB messages, by doubling the message size in between. You can try different algorithms by changing the `-k 5` option to another value (e.g., `-k 3` runs `allreduce3`). Check all algorithms from the code: [allreduce_test.cu](https://github.com/microsoft/mscclpp/blob/main/test/mscclpp-test/allreduce_test.cu) and [allgather_test.cu](https://github.com/microsoft/mscclpp/blob/main/test/mscclpp-test/allgather_test.cu).

 ```bash
-$ mpirun -np 8 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 4
+$ mpirun --bind-to-numa -np 8 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5
 ```

+*NOTE: a few algorithms set a condition on the total data size, such as to be a multiple of 3. If the condition is unmet, the command will throw a regarding error.*
+
 Check the help message for more details.

 ```bash
 $ ./test/mscclpp-test/allreduce_test_perf --help
-USAGE: allreduce_test_perf 
-        [-b,--minbytes <min size in bytes>] 
-        [-e,--maxbytes <max size in bytes>] 
-        [-i,--stepbytes <increment size>] 
-        [-f,--stepfactor <increment factor>] 
-        [-n,--iters <iteration count>] 
-        [-w,--warmup_iters <warmup iteration count>] 
-        [-c,--check <0/1>] 
-        [-T,--timeout <time in seconds>] 
-        [-G,--cudagraph <num graph launches>] 
-        [-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>] 
-        [-k,--kernel_num <kernel number of commnication primitive>] 
-        [-o, --output_file <output file name>] 
+USAGE: allreduce_test_perf
+        [-b,--minbytes <min size in bytes>]
+        [-e,--maxbytes <max size in bytes>]
+        [-i,--stepbytes <increment size>]
+        [-f,--stepfactor <increment factor>]
+        [-n,--iters <iteration count>]
+        [-w,--warmup_iters <warmup iteration count>]
+        [-c,--check <0/1>]
+        [-T,--timeout <time in seconds>]
+        [-G,--cudagraph <num graph launches>]
+        [-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>]
+        [-k,--kernel_num <kernel number of commnication primitive>]
+        [-o, --output_file <output file name>]
        [-h,--help]
 ```
--- a/include/mscclpp/core.hpp
+++ b/include/mscclpp/core.hpp
@@ -5,7 +5,7 @@
 #define MSCCLPP_CORE_HPP_

 #define MSCCLPP_MAJOR 0
-#define MSCCLPP_MINOR 3
+#define MSCCLPP_MINOR 4
 #define MSCCLPP_PATCH 0
 #define MSCCLPP_VERSION (MSCCLPP_MAJOR * 10000 + MSCCLPP_MINOR * 100 + MSCCLPP_PATCH)

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -7,7 +7,7 @@ build-backend = "scikit_build_core.build"

 [project]
 name = "mscclpp"
-version = "0.3.0"
+version = "0.4.0"

 [tool.scikit-build]
 cmake.minimum-version = "3.25.0"
--- a/python/examples/bootstrap.py
+++ b/python/examples/bootstrap.py
@@ -1,109 +0,0 @@
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT license.
-
-import argparse
-import logging
-import multiprocessing as mp
-import sys
-
-import mscclpp
-import torch
-
-IB_TRANSPORTS = [
-    mscclpp.Transport.IB0,
-    mscclpp.Transport.IB1,
-    mscclpp.Transport.IB2,
-    mscclpp.Transport.IB3,
-    mscclpp.Transport.IB4,
-    mscclpp.Transport.IB5,
-    mscclpp.Transport.IB6,
-    mscclpp.Transport.IB7,
-]
-
-# Use to hold the sm channels so they don't get garbage collected
-sm_channels = []
-
-
-def setup_connections(comm, rank, world_size, element_size, proxy_service):
-    simple_proxy_channels = []
-    sm_semaphores = []
-    connections = []
-    remote_memories = []
-    memory = torch.zeros(element_size, dtype=torch.int32)
-    memory = memory.to("cuda")
-
-    transport_flag = mscclpp.TransportFlags(IB_TRANSPORTS[rank]) | mscclpp.Transport.CudaIpc
-    ptr = memory.data_ptr()
-    size = memory.numel() * memory.element_size()
-    reg_mem = comm.register_memory(ptr, size, transport_flag)
-
-    for r in range(world_size):
-        if r == rank:
-            continue
-        conn = comm.connect_on_setup(r, 0, mscclpp.Transport.CudaIpc)
-        connections.append(conn)
-        comm.send_memory_on_setup(reg_mem, r, 0)
-        remote_mem = comm.recv_memory_on_setup(r, 0)
-        remote_memories.append(remote_mem)
-    comm.setup()
-
-    connections = [conn.get() for conn in connections]
-
-    # Create simple proxy channels
-    for i, conn in enumerate(connections):
-        proxy_channel = mscclpp.SimpleProxyChannel(
-            proxy_service.proxy_channel(proxy_service.build_and_add_semaphore(comm, conn)),
-            proxy_service.add_memory(remote_memories[i].get()),
-            proxy_service.add_memory(reg_mem),
-        )
-        simple_proxy_channels.append(proxy_channel.device_handle())
-    comm.setup()
-
-    # Create sm channels
-    for i, conn in enumerate(connections):
-        sm_chan = mscclpp.SmDevice2DeviceSemaphore(comm, conn)
-        sm_semaphores.append(sm_chan)
-    comm.setup()
-
-    for i, conn in enumerate(sm_semaphores):
-        sm_chan = mscclpp.SmChannel(sm_semaphores[i], remote_memories[i].get(), ptr)
-        sm_channels.append(sm_chan)
-    return simple_proxy_channels, [sm_chan.device_handle() for sm_chan in sm_channels]
-
-
-def run(rank, args):
-    world_size = args.gpu_number
-    torch.cuda.set_device(rank)
-
-    boot = mscclpp.TcpBootstrap.create(rank, world_size)
-    boot.initialize(args.if_ip_port_trio)
-    comm = mscclpp.Communicator(boot)
-    proxy_service = mscclpp.ProxyService()
-
-    logging.info("Rank: %d, setting up connections", rank)
-    setup_connections(comm, rank, world_size, args.num_elements, proxy_service)
-
-    logging.info("Rank: %d, starting proxy service", rank)
-    proxy_service.start_proxy()
-
-
-def main():
-    logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
-    parser = argparse.ArgumentParser()
-    parser.add_argument("if_ip_port_trio", type=str)
-    parser.add_argument("-n", "--num-elements", type=int, default=10)
-    parser.add_argument("-g", "--gpu_number", type=int, default=2)
-    args = parser.parse_args()
-    processes = []
-
-    for rank in range(args.gpu_number):
-        p = mp.Process(target=run, args=(rank, args))
-        p.start()
-        processes.append(p)
-
-    for p in processes:
-        p.join()
-
-
-if __name__ == "__main__":
-    main()
--- a/python/examples/send_recv.py
+++ b/python/examples/send_recv.py
@@ -1,82 +0,0 @@
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT license.
-
-import argparse
-import time
-
-import mscclpp
-
-
-def main(args):
-    if args.root:
-        rank = 0
-    else:
-        rank = 1
-
-    boot = mscclpp.TcpBootstrap.create(rank, 2)
-    boot.initialize(args.if_ip_port_trio)
-
-    comm = mscclpp.Communicator(boot)
-
-    if args.gpu:
-        import torch
-
-        print("Allocating GPU memory")
-        memory = torch.zeros(args.num_elements, dtype=torch.int32)
-        memory = memory.to("cuda")
-        ptr = memory.data_ptr()
-        size = memory.numel() * memory.element_size()
-    else:
-        from array import array
-
-        print("Allocating host memory")
-        memory = array("i", [0] * args.num_elements)
-        ptr, elements = memory.buffer_info()
-        size = elements * memory.itemsize
-    my_reg_mem = comm.register_memory(ptr, size, mscclpp.Transport.IB0)
-
-    conn = comm.connect_on_setup((rank + 1) % 2, 0, mscclpp.Transport.IB0)
-
-    other_reg_mem = None
-    if rank == 0:
-        other_reg_mem = comm.recv_memory_on_setup((rank + 1) % 2, 0)
-    else:
-        comm.send_memory_on_setup(my_reg_mem, (rank + 1) % 2, 0)
-
-    comm.setup()
-
-    if rank == 0:
-        other_reg_mem = other_reg_mem.get()
-
-    if rank == 0:
-        for i in range(args.num_elements):
-            memory[i] = i + 1
-        conn.write(other_reg_mem, 0, my_reg_mem, 0, size)
-        print("Done sending")
-    else:
-        print("Checking for correctness")
-        # polling
-        for _ in range(args.polling_num):
-            all_correct = True
-            for i in range(args.num_elements):
-                if memory[i] != i + 1:
-                    all_correct = False
-                    print(f"Error: Mismatch at index {i}: expected {i + 1}, got {memory[i]}")
-                    break
-            if all_correct:
-                print("All data matched expected values")
-                break
-            else:
-                time.sleep(0.1)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("if_ip_port_trio", type=str)
-    parser.add_argument("-r", "--root", action="store_true")
-    parser.add_argument("-n", "--num-elements", type=int, default=10)
-    parser.add_argument("--gpu", action="store_true")
-    parser.add_argument("--polling_num", type=int, default=100)
-    args = parser.parse_args()
-
-    main(args)
--- a/python/examples/utils.py
+++ b/python/examples/utils.py
@@ -1,17 +0,0 @@
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT license.
-
-import time
-
-import mscclpp
-
-
-def main():
-    timer = mscclpp.Timer()
-    timer.reset()
-    time.sleep(2)
-    assert timer.elapsed() >= 2000000
-
-
-if __name__ == "__main__":
-    main()
--- a/python/mscclpp_benchmark/allreduce_bench.py
+++ b/python/mscclpp_benchmark/allreduce_bench.py
@@ -56,11 +56,12 @@ def plot_graph(sizes, mscclpp_algbw, nccl_algbw, speed_ups):
    ax1.legend(lines, labels, loc="upper left")

    # Setting title and grid
-    ax1.set_title("MSCCLPP vs NCCL -- " + str(MPI.COMM_WORLD.size // N_GPUS_PER_NODE) + " Nodes")
+    num_nodes = MPI.COMM_WORLD.size // N_GPUS_PER_NODE
+    ax1.set_title(f"MSCCLPP vs NCCL -- {num_nodes} Nodes")
    ax2.grid(True, which="both", ls="--")

    # Saving the plot
-    plt.savefig("mscclpp_vs_nccl_comparison.pdf", format="pdf")
+    plt.savefig(f"mscclpp_vs_nccl_comparison_num_nodes_{num_nodes}.jpeg", format="jpeg")


 def human_readable_size(size, decimal_places=1):
@@ -246,7 +247,7 @@ if __name__ == "__main__":
    mscclpp_algbw = []
    nccl_algbw = []
    speed_ups = []
-    for i in range(10, 30):
+    for i in range(10, 29):
        if MPI.COMM_WORLD.size // N_GPUS_PER_NODE == 1:
            nelems = 2**i
        elif MPI.COMM_WORLD.size // N_GPUS_PER_NODE == 2: