mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-24 06:44:40 +00:00
75 lines
2.3 KiB
Markdown
75 lines
2.3 KiB
Markdown
# Results
|
|
|
|
This page presents performance benchmarks for collective communication algorithms implemented using the MSCCL++ DSL (Domain Specific Language).
|
|
|
|
## Available Algorithms
|
|
|
|
The following reference implementations are provided:
|
|
|
|
### Single-Node AllReduce on H100 (NVLS)
|
|
|
|
We evaluate a single-node AllReduce algorithm designed for NVIDIA H100 GPUs leveraging NVLink Switch (NVLS) technology. This algorithm demonstrates optimal performance for intra-node collective operations.
|
|
|
|
**Source Code Location:**
|
|
|
|
The algorithm implementation can be found at:
|
|
```
|
|
mscclpp/python/mscclpp/language/tests
|
|
```
|
|
|
|
**Running the Benchmark:**
|
|
|
|
Users can generate the corresponding JSON execution plan by following the steps described in the Quick Start section. Once the JSON file is generated, it can be executed using the `executor_test.py` tool to measure performance.
|
|
|
|
**Performance Results:**
|
|
|
|
The following figures show the achieved bandwidth for message sizes ranging from 1KB to 1GB:
|
|
|
|
```{figure} ./figs/single_node_allreduce_results_1K_to_1M.png
|
|
:name: single-node-allreduce-small
|
|
:alt: Single-node AllReduce performance (1KB to 1MB)
|
|
:align: center
|
|
|
|
Single-node AllReduce performance on H100 with NVLS (1KB to 1MB message sizes)
|
|
```
|
|
|
|
```{figure} ./figs/single_node_allreduce_results_1M_to_1G.png
|
|
:name: single-node-allreduce-large
|
|
:alt: Single-node AllReduce performance (1MB to 1GB)
|
|
:align: center
|
|
|
|
Single-node AllReduce performance on H100 with NVLS (1MB to 1GB message sizes)
|
|
```
|
|
|
|
### Two-Node AllReduce on H100 (Small Message Sizes)
|
|
|
|
We also provide a two-node AllReduce algorithm for H100 GPUs, specifically optimized for small message sizes. This algorithm uses a non-zero-copy communication path to minimize latency for small data transfers.
|
|
|
|
**Installation:**
|
|
|
|
This algorithm is installed by default when running:
|
|
```bash
|
|
python3 -m mscclpp --install
|
|
```
|
|
|
|
**Execution Plan Location:**
|
|
|
|
After installation, the generated JSON execution plan can be found at:
|
|
```
|
|
~/.cache/mscclpp_default/
|
|
```
|
|
|
|
**Performance Results:**
|
|
|
|
The figure below shows the performance characteristics for small message sizes in a two-node configuration:
|
|
|
|
```{figure} ./figs/2node_all_reduce_results.png
|
|
:name: two-node-allreduce-small
|
|
:alt: Two-node AllReduce performance for small message sizes
|
|
:align: center
|
|
|
|
Two-node AllReduce performance on H100 for small message sizes
|
|
```
|
|
|
|
|