Files
mscclpp/docs/guide/customized-algorithm-with-nccl-api.md
Binyang Li 3d94383696 Add MSCCLPP_GIT_COMMIT micro (#640)
- Add MSCCLPP_GIT_COMMIT micro
- Update docs
2025-10-06 15:57:28 -07:00

41 lines
1.6 KiB
Markdown

# Customized Collective Algorithm with NCCL API
```{note}
This tutorial demonstrates how to plug a **custom collective algorithm** (an AllGather variant) into the MSCCL++ NCCL interposition / algorithm registration path and invoke it transparently via the standard NCCL API (`ncclAllGather`).
```
## Overview
The example shows how to:
1. Define a device kernel (`allgather`) that uses `PortChannel` device handles to exchange data.
2. Wrap that kernel inside an algorithm class (`AllgatherAlgoBuilder`) responsible for:
- Connection discovery / proxy setup.
- Context key generation (so contexts can be reused / cached).
- Launch function binding (kernel wrapper executed when NCCL all-gather is called).
3. Register the algorithm builder with the global `AlgorithmCollectionBuilder` and install a **selector** deciding which implementation to return for a given collective request.
4. Run a multi-process (multi-rank) test using standard NCCL calls. The user program remains unchanged apart from initialization / registration code.
5. (Optionally) Capture the sequence of `ncclAllGather` calls into a CUDA Graph for efficient replay.
## Location
Example source directory:
```
examples/customized-collective-algorithm/
```
Key file: `customized_allgather.cu`.
## Build and Run
From the repository root:
```bash
cd examples/customized-collective-algorithm
make
```
Run (inside container you may need root privileges depending on GPU access):
```bash
LD_PRELOAD=<MSCCLPP_INSTALL_DIR>/lib/libmscclpp_nccl.so ./customized_allgather
```
Expected (abbreviated) output on success:
```
GPU 0: bytes 268435456, elapsed 7.35012 ms/iter, BW 109.564 GB/s
Succeed!
```