Files
mscclpp/docs/dsl/results.md
Caio Rocha 8d998820a3 Improve DSL Documentation (#707)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-12-19 15:17:08 -08:00

75 lines
2.3 KiB
Markdown

# Results
This page presents performance benchmarks for collective communication algorithms implemented using the MSCCL++ DSL (Domain Specific Language).
## Available Algorithms
The following reference implementations are provided:
### Single-Node AllReduce on H100 (NVLS)
We evaluate a single-node AllReduce algorithm designed for NVIDIA H100 GPUs leveraging NVLink Switch (NVLS) technology. This algorithm demonstrates optimal performance for intra-node collective operations.
**Source Code Location:**
The algorithm implementation can be found at:
```
mscclpp/python/mscclpp/language/tests
```
**Running the Benchmark:**
Users can generate the corresponding JSON execution plan by following the steps described in the Quick Start section. Once the JSON file is generated, it can be executed using the `executor_test.py` tool to measure performance.
**Performance Results:**
The following figures show the achieved bandwidth for message sizes ranging from 1KB to 1GB:
```{figure} ./figs/single_node_allreduce_results_1K_to_1M.png
:name: single-node-allreduce-small
:alt: Single-node AllReduce performance (1KB to 1MB)
:align: center
Single-node AllReduce performance on H100 with NVLS (1KB to 1MB message sizes)
```
```{figure} ./figs/single_node_allreduce_results_1M_to_1G.png
:name: single-node-allreduce-large
:alt: Single-node AllReduce performance (1MB to 1GB)
:align: center
Single-node AllReduce performance on H100 with NVLS (1MB to 1GB message sizes)
```
### Two-Node AllReduce on H100 (Small Message Sizes)
We also provide a two-node AllReduce algorithm for H100 GPUs, specifically optimized for small message sizes. This algorithm uses a non-zero-copy communication path to minimize latency for small data transfers.
**Installation:**
This algorithm is installed by default when running:
```bash
python3 -m mscclpp --install
```
**Execution Plan Location:**
After installation, the generated JSON execution plan can be found at:
```
~/.cache/mscclpp_default/
```
**Performance Results:**
The figure below shows the performance characteristics for small message sizes in a two-node configuration:
```{figure} ./figs/2node_all_reduce_results.png
:name: two-node-allreduce-small
:alt: Two-node AllReduce performance for small message sizes
:align: center
Two-node AllReduce performance on H100 for small message sizes
```