Improve DSL Documentation (#707)

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
This commit is contained in:
Caio Rocha
2025-12-19 15:17:08 -08:00
committed by GitHub
parent 9e076da3d4
commit 8d998820a3
8 changed files with 86 additions and 28 deletions

View File

@@ -1,13 +1,13 @@
MSCCL++ DSL
-----------------
This section provides advanced topics and best practices for using MSCCL++ DSL. It is designed for users who are already familiar with the basics of MSCCL++ and want to deepen their understanding or optimize their usage.
This section provides advanced topics and best practices for using MSCCL++ DSL.
.. toctree::
:maxdepth: 1
:caption: MSCCL++ DSL
:hidden:
/dsl_quick_start
guide/mscclpp-dsl
guide/mscclpp-dsl-integration
dsl/quick_start
dsl/results
dsl/concepts
dsl/integration

View File

@@ -1,4 +1,4 @@
# MSCCL++ DSL
# Concepts
## Introduction
The MSCCL++ Domain-Specific Language (DSL) provides a Python-native API for defining and executing GPU-based communication collectives. With a few high-level calls, users can construct complex data movement and synchronization workflows without dealing with low-level CUDA code.
@@ -11,7 +11,7 @@ Here are the highlights of the MSCCL++ DSL:
- **Flexible execution model**: The MSCCL++ DSL allows users to load different execution plans at runtime, enabling dynamic optimization based on the current workload and hardware configuration.
## MSCCL++ DSL Concepts
## Basic Concepts
### Collectives
@@ -79,8 +79,9 @@ The synchronization inside the thread-block can be inferred by MSCCL++ DSL autom
But for multi-thread-blocks synchronization and cross-ranks synchronization, we need to insert the synchronization point manually.
## Post Processing Steps
## Operation Fusion (Instruction Fusion)
### Operation Fusion (Instruction Fusion)
MSCCL++ DSL performs operation fusion by analyzing all operations scheduled within the same threadblock. For each threadblock, the DSL builds a directed acyclic graph (DAG) of chunklevel operations and tracks data dependencies and usage patterns. When two or more operations meet fusion criteria—such as contiguous chunk access, no intervening dependencies, and compatible resource requirements—the DSL merges them into a single operation function. This fusion strategy reduces memory traffic and avoids unnecessary synchronization, resulting in more efficient execution.
For example:
@@ -93,7 +94,7 @@ channel.put(remote_chunk, dst_chunk, tb=0)
When the DSL detects that a reduce operation is immediately followed by a put operation using the same data chunk, it automatically fuses them into a single operation internally, eliminating intermediate memory writes and improving performance.
## Data dependencies analysis
### Data dependencies analysis
The MSCCL++ DSL automatically tracks data dependencies at the chunk level within each thread block by maintaining the last writer and active readers for each memory slot. When operations have data dependencies, the DSL automatically inserts necessary synchronization points to ensure correct execution order. Additionally, the system analyzes the dependency graph to remove redundant synchronization operations (such as unnecessary barriers) when the execution order already guarantees correctness, optimizing performance while maintaining safety.
@@ -113,8 +114,6 @@ rank.nop(tb=0) # Inserted for intra-block synchronization, nop is an internal o
channel.put(remote_chunk, dst_chunk, tb=0)
```
## Pipeline Loop
Pipeline enables overlapping operations across thread blocks. Using Semaphore for cross-block synchronization, it overlaps stages—such as copying data from the input buffer to a scratch buffer—with subsequent peer transfers. A pipelined loop orchestrates these stages to run concurrently, maximizing overall throughput.

Binary file not shown.

After

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

View File

@@ -1,23 +1,8 @@
# MSCCL++ DSL Integration Guide
# Integration
MSCCL++ DSL (domain-specific language) enables concise expression of collective algorithms as Python functions.
MSCCL++ offers pythonic utilities to author, JIT-compile, register, and select execution plans. This guide walks through two integration paths: a customized MSCCL++ communicator and NCCL interposition that accelerates existing PyTorch `backend="nccl"` workloads.
## Initial Setup
Run the following from the repository root after completing the basic project setup:
1. Install Python dependencies.
```bash
pip install -r ./python/<requirements_file>
```
Replace `<requirements_file>` with the file that matches your environment (e.g., `requirements_cuda11.txt`, `requirements_cuda12.txt`, or `requirements_rocm6.txt`).
2. Install the module and generate default algorithm plans.
```bash
pip install . && python3 -m mscclpp --install
```
## Integration Options
MSCCL++ DSL integrates into your training or inference workload in two ways:

View File

@@ -1,4 +1,4 @@
# DSL Quick Start
# Quick Start
The MSCCL++ DSL (Domain Specific Language) provides a high-level Python API for defining custom collective communication algorithms. This guide will help you get started with writing and testing your own communication patterns.

74
docs/dsl/results.md Normal file
View File

@@ -0,0 +1,74 @@
# Results
This page presents performance benchmarks for collective communication algorithms implemented using the MSCCL++ DSL (Domain Specific Language).
## Available Algorithms
The following reference implementations are provided:
### Single-Node AllReduce on H100 (NVLS)
We evaluate a single-node AllReduce algorithm designed for NVIDIA H100 GPUs leveraging NVLink Switch (NVLS) technology. This algorithm demonstrates optimal performance for intra-node collective operations.
**Source Code Location:**
The algorithm implementation can be found at:
```
mscclpp/python/mscclpp/language/tests
```
**Running the Benchmark:**
Users can generate the corresponding JSON execution plan by following the steps described in the Quick Start section. Once the JSON file is generated, it can be executed using the `executor_test.py` tool to measure performance.
**Performance Results:**
The following figures show the achieved bandwidth for message sizes ranging from 1KB to 1GB:
```{figure} ./figs/single_node_allreduce_results_1K_to_1M.png
:name: single-node-allreduce-small
:alt: Single-node AllReduce performance (1KB to 1MB)
:align: center
Single-node AllReduce performance on H100 with NVLS (1KB to 1MB message sizes)
```
```{figure} ./figs/single_node_allreduce_results_1M_to_1G.png
:name: single-node-allreduce-large
:alt: Single-node AllReduce performance (1MB to 1GB)
:align: center
Single-node AllReduce performance on H100 with NVLS (1MB to 1GB message sizes)
```
### Two-Node AllReduce on H100 (Small Message Sizes)
We also provide a two-node AllReduce algorithm for H100 GPUs, specifically optimized for small message sizes. This algorithm uses a non-zero-copy communication path to minimize latency for small data transfers.
**Installation:**
This algorithm is installed by default when running:
```bash
python3 -m mscclpp --install
```
**Execution Plan Location:**
After installation, the generated JSON execution plan can be found at:
```
~/.cache/mscclpp_default/
```
**Performance Results:**
The figure below shows the performance characteristics for small message sizes in a two-node configuration:
```{figure} ./figs/2node_all_reduce_results.png
:name: two-node-allreduce-small
:alt: Two-node AllReduce performance for small message sizes
:align: center
Two-node AllReduce performance on H100 for small message sizes
```