mscclpp/docs/dsl/results.md

# Results

This page presents performance benchmarks for collective communication algorithms implemented using the MSCCL++ DSL (Domain Specific Language).

## Available Algorithms

The following reference implementations are provided:

### Single-Node AllReduce on H100 (NVLS)

We evaluate a single-node AllReduce algorithm designed for NVIDIA H100 GPUs leveraging NVLink Switch (NVLS) technology. This algorithm demonstrates optimal performance for intra-node collective operations.

**Source Code Location:**

The algorithm implementation can be found at:
```
mscclpp/python/mscclpp/language/tests
```

**Running the Benchmark:**

Users can generate the corresponding JSON execution plan by following the steps described in the Quick Start section. Once the JSON file is generated, it can be executed using the `executor_test.py` tool to measure performance.

**Performance Results:**

The following figures show the achieved bandwidth for message sizes ranging from 1KB to 1GB:

```{figure} ./figs/single_node_allreduce_results_1K_to_1M.png
:name: single-node-allreduce-small
:alt: Single-node AllReduce performance (1KB to 1MB)
:align: center

Single-node AllReduce performance on H100 with NVLS (1KB to 1MB message sizes)
```

```{figure} ./figs/single_node_allreduce_results_1M_to_1G.png
:name: single-node-allreduce-large
:alt: Single-node AllReduce performance (1MB to 1GB)
:align: center

Single-node AllReduce performance on H100 with NVLS (1MB to 1GB message sizes)
```

### Two-Node AllReduce on H100 (Small Message Sizes)

We also provide a two-node AllReduce algorithm for H100 GPUs, specifically optimized for small message sizes. This algorithm uses a non-zero-copy communication path to minimize latency for small data transfers.

**Installation:**

This algorithm is installed by default when running:
```bash
python3 -m mscclpp --install
```

**Execution Plan Location:**

After installation, the generated JSON execution plan can be found at:
```
~/.cache/mscclpp_default/
```

**Performance Results:**

The figure below shows the performance characteristics for small message sizes in a two-node configuration:

```{figure} ./figs/2node_all_reduce_results.png
:name: two-node-allreduce-small
:alt: Two-node AllReduce performance for small message sizes
:align: center

Two-node AllReduce performance on H100 for small message sizes
```