Commit Graph

976 Commits

Author SHA1 Message Date
Empyreus
93f96e97cd Merge branch 'main' into rjsouza/nvls-allgather-pr 2026-06-23 23:27:24 +00:00
Caio Rocha
88e1a44858 wip 2026-06-23 22:41:27 +00:00
Caio Rocha
2091a337be wip 2026-06-23 09:04:05 +00:00
Caio Rocha
f74814142e wip 2026-06-23 08:32:14 +00:00
Binyang Li
dc0b8d75f3 GB200 support: SendRecv DSL collective and per-channel executor connections (#810)
## Summary
 
GB200 support work: introduces point-to-point send/receive in the
MSCCL++ DSL
and extends the executor for split-NVL-domain topologies where some
ranks are
NVL-connected within a node and other ranks must communicate across the
network.
 
 ### DSL
 - New `SendRecv` collective with separate input/output buffers
   (`python/mscclpp/language/collectives.py`).
 - New multi-node sendrecv DSL example
(`python/mscclpp/language/tests/multi_node/send_recv.py`) with
`--split_mask`
(group size − 1) and `--instances` CLI options. Documents the
channel-ordering
   trick that keeps signal tags cross-matched between paired peers when
   `prev == next`.
 - `BaseBuffer.__getitem__` now accepts slices with `None` start/stop
   (e.g., `buf[:]`).
 
 ### Executor
 - One connection (unique QP) per channel entry instead of one per peer.
Required for HostNoAtomic IB mode where each QP can forward signals to a
single semaphore. Uses per-peer tag counters so paired ranks agree on
tag
ordering regardless of the order peers appear in each rank's
`connected_to`
   list.
- MEMORY channels now unconditionally use `Transport::CudaIpc`; only
PORT
   channels can use IB. Matches the invariant already enforced by
   `getTransportFlags`.
- `ExecutionContext::connections` is now a `vector<Connection>` indexed
by
channel order (was `unordered_map<int, Connection>` keyed by peer).
Removes
   redundant semaphore fields from `ExecutionContext`.
 - TODO: explicit NVL-domain check in `useIB`

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2026-06-19 13:19:01 -07:00
Empyreus
e342039f88 add comments 2026-06-18 17:50:18 +00:00
Empyreus
7813f3b1b0 remove execution_kernel alignment check 2026-06-18 17:00:49 +00:00
Empyreus
9a02f3669f enforce scratch on boardcast packets 2026-06-18 16:17:56 +00:00
Empyreus
253bc05c7c add auto_sync false flag 2026-06-18 00:13:36 +00:00
Empyreus
9b4175412b combine groupstore and groupstorepacket 2026-06-18 00:07:38 +00:00
Empyreus
24187fcded use .to_dict() 2026-06-18 00:01:41 +00:00
Empyreus
e48c6da34b fix formatting 2026-06-17 00:11:06 +00:00
Empyreus
1d95244cf9 tune for ptk 2026-06-16 19:46:14 +00:00
Empyreus
1e17f20618 add allgather packet algorithm 2026-06-11 20:46:51 +00:00
Empyreus
02eb2cfc2e add support for allgather packet for small message sizes 2026-06-11 20:46:09 +00:00
Empyreus
5d7737437a handle non 16bit aligned 2026-06-09 18:54:38 +00:00
Empyreus
6f61c014c1 fix missing flag 2026-06-09 18:07:25 +00:00
Empyreus
3b263c3324 revert useIB change 2026-06-09 16:02:33 +00:00
Empyreus
5348c4a774 refactor function for thread usage 2026-06-08 21:53:46 +00:00
Empyreus
1e1187aa65 reuse useIB function 2026-06-08 21:53:16 +00:00
Empyreus
f2204ee569 improve variable names 2026-06-08 21:02:39 +00:00
Empyreus
b864746083 python formatting fixes 2026-06-08 20:28:34 +00:00
Empyreus
2b87985927 update to type agnostic 2026-06-08 20:04:33 +00:00
Empyreus
0d8efdb43d update algo 2026-06-08 20:04:33 +00:00
Empyreus
00668b4a41 add allgather gstore support 2026-06-08 20:04:33 +00:00
Empyreus
54bfb1d3b7 add flag to disable IB 2026-06-08 20:04:33 +00:00
Binyang Li
7c390fffd6 Expose NVLS multicast granularity option for GpuBuffer (#815)
Add a public Granularity enum (MultiCastMinimum, MultiCastRecommended)
and let GpuBuffer choose the NVLS multicast allocation granularity via a
constructor argument, defaulting to MultiCastMinimum to minimize memory
usage. Expose the same option through the C++ and Python (nanobind)
APIs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-04 13:16:18 -07:00
Binyang Li
c9f8be64bb Add collective benchmark and correctness check (#814)
- Add unit-test for float8_e4m3b15 data type.
- And tuner and benchmark for allreduce/allgather algo, make sure the
correctness and performance.
2026-06-04 09:22:10 -07:00
RJ Souza
29d5beb348 Adding Support for SGLang CI Tests (#800)
Adds Azure DevOps pipelines, templates, and supporting scripts to run
SGLang end-to-end and benchmark tests against MSCCL++ on H100 GPU nodes,
plus the Docker image and small infrastructure tweaks needed to make
those pipelines runnable.

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-27 14:57:21 -07:00
Binyang Li
379d0e51e4 Fix for allgather_fullmesh algo (#813) 2026-05-26 12:34:13 -07:00
Binyang Li
08ee18be64 Add check to filter invalid nblock/nthread candidates (#811)
Add check for invalid nblock/nthread candidate
2026-05-22 09:18:41 -07:00
Binyang Li
9e177b388c remove useless sync (#809) 2026-05-20 16:49:49 -07:00
Binyang Li
72621e7221 add nBlocks check for allreduce_allpair_packet algo (#807)
- Fix the correctness issue for allreduce_allpair_packet algo. Make sure
no overwrite for input buffer. Use same tb for send/reduce/write-back.
- Check if nBlocks/nthreads validate for packet algorithm.
- Add more logs
- Modify flag update logic, make it work for the case: nthreadPerNBlock
< nflags

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-20 09:29:55 -07:00
Caio Rocha
c1071318c8 Include a static synchronization check in the DSL. (#806) 2026-05-19 13:06:53 -07:00
Binyang Li
60a6d7219f Clean up completed communicator receives (#804)
## Summary
- Release the reference after last requests are ready.
- Keep ordered receive chaining for repeated rank/tag operations while
cleaning up completed receive bookkeeping.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-15 21:06:50 +00:00
Changho Hwang
252a422030 Handle PortChannel flush asynchronously from the host proxy (#802)
When a `PortChannel` requests `flush()`, the host-side proxy was being
blocked, which may cause head-of-line blocking of other parallel
`PortChannel`s' requests. Now the proxy handles `flush()` requests
asynchronously. This feature especially helps performance when we need
multiple IB QPs and need to flush QPs.
2026-05-15 11:50:43 -07:00
Changho Hwang
5d608feaa5 Enhance cross-node CudaIpc availability check (#803) 2026-05-14 14:06:12 -07:00
Caio Rocha
40295df4c4 Adding Support to bf16 Executor Tests (#801)
This pull request adds support for the `bfloat16` (bf16) data type to
the test executor, including both Python and CUDA components. The
changes ensure that `bfloat16` is handled consistently across argument
parsing, data type conversion, and test kernel implementations.
Additionally, the CUDA verification kernels are refactored to use
parameterized tolerances for improved numerical accuracy checks.

**Support for bfloat16 data type:**

* Added handling for `bfloat16`/`bf16` in the Python test executor's
argument parsing, data type conversion (`parse_dtype`,
`dtype_to_mscclpp_dtype`), and help text.
[[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28)
[[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135)
[[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251)
* Updated output to display the correct data type string for `bfloat16`.

**CUDA kernel and test improvements:**

* Included `bfloat16` headers and defined test data fill and gather
kernels for `bfloat16` on both CUDA and HIP platforms.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35)
[[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59)
[[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133)
* Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to
use an explicit tolerance parameter (`Eps`) and added correct tolerances
for each data type, including `bfloat16`.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113)

These changes ensure full support for `bfloat16` in the test executor
and improve the accuracy and maintainability of the CUDA test kernels.

---------

Co-authored-by: Caio Rocha <caiorocha@microsof.com>
2026-05-14 09:56:11 -07:00
Caio Rocha
0c9b9abfd5 Adding Support 4 Nodes AllReduce Small Message Size (#794)
Results on 4 Nodes H200:

| Size | NCCL  | MSCCL++ 57TB | MSCCL++ 29TB |
|------|-------|--------------|--------------|
| 8K   | 45.75 | 17.74        | 18.18        |
| 16K  | 47.08 | 18.9         | 18.42        |
| 32K  | 47.29 | 19.48        | 19.12        |
| 64K  | 50.34 | 20.51        | 19.29        |
| 128K | 59.65 | 21.37        | 20.25        |
| 256K | 87.46 | 23.87        | 23.51        |
| 512K | 106.55| 29.15        | 29.51        |
| 1M   | 115   | 40.64        | 41.83        |
| 2M   | 135.89| 63.73        | 70.45        |
| 4M   | 177.59| 121.76       | 128.79       |
| 8M   | 251.17| 228.5        | 251.36       |

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsof.com>
sglang-v0.9.1
2026-05-12 13:45:55 -07:00
Mahdieh Ghazi
822fbb2351 Adding necessary macros for enabling mrc support (#797)
This PR adds necessary macros and instructions for enabling mrc support with no atomic.
2026-05-05 17:17:41 -04:00
Binyang Li
9ec26fa4d1 Reset GPU tokens before reuse (#795)
Fixes a token-reuse bug in `TokenPool` that's independent of MNNVL.

## Bug

`TokenPool` hands out 8-byte device-memory slots used as
device-semaphore counters. The deleter only cleared the bitmap — the
underlying GPU memory was left as-is. When a token was freed and later
re-allocated, the new semaphore inherited the previous counter value
instead of starting at 0, breaking subsequent `signal()/wait()` math.

## Fix

* Add a synchronous `gpuMemset` host helper (mirrors `gpuMemcpy` /
`gpuMemcpyAsync`).
* Zero the slot inside the `TokenPool` deleter so recycled tokens hand
out a clean counter. The very-first allocation is already zeroed by
`gpuCallocPhysical` (`src/core/gpu_utils.cc:227-228`), so first-time
tokens are also clean — the deleter only has to handle the recycle case.

## Notes

* Public wrapper is named `mscclpp::gpuMemset` (not `mscclpp::memset`)
for symmetry with `gpuMemcpy` and to avoid shadowing `std::memset` in
TUs that pull the namespace in.
* Zeroing happens on release rather than acquire so the cost is paid in
the typically less perf-sensitive teardown path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-04 15:11:47 -07:00
Binyang Li
2c52937b26 Fix FP8 ROCm build/test issues and dtype naming (#792)
## Summary
- Fix ROCm FP8 build failure by using the actual FP8 `DataType` enum
constants in allreduce packet tuning.
- Fix FP8 E4M3FNUZ test encoding so small negative values do not produce
the FNUZ NaN byte (`0x80`).
- Align FP8 `DataType` enum constants and Python bindings with
torch-style names (`FLOAT8_E4M3FN`, `FLOAT8_E4M3FNUZ`, `FLOAT8_E5M2FNUZ`
/ `float8_e4m3fn`, `float8_e4m3fnuz`, `float8_e5m2fnuz`).

## Validation
- `./tools/lint.sh`
- `make -j` from `build/`
- `mpirun --allow-run-as-root -np 8 python3 -m pytest
python/test/test_fp8_accum.py -q` (`36 passed, 9 skipped`)
- `DTYPE=float8_e4m3fnuz ACCUM_DTYPE=float32 torchrun --nnodes=1
--nproc_per_node=8
examples/torch-integration/customized_comm_with_tuning.py`

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 15:02:22 -07:00
Changho Hwang
c97be492d5 GDRCopy status message to string (#793) 2026-04-27 10:32:20 -07:00
Copilot
e874bf1666 fix: isCuMemMapAllocated crashes on non-NVLS systems even with MSCCLPP_FORCE_DISABLE_NVLS=true (#790)
- [x] Fix `isCuMemMapAllocated()` to just return `true/false` without
throwing when NVLS is not supported
- [x] Fix `isNvlsSupported()` caching bug where `result`/`isChecked`
were never updated
- [x] Restore `[[maybe_unused]]` on `result` and `isChecked` statics —
needed in HIP/ROCm env where `CUDA_NVLS_API_AVAILABLE` is not defined
and the variables would otherwise be unused
- [x] Run linter (`./tools/lint.sh`)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-04-22 10:12:40 -07:00
Binyang Li
eeea00b298 Support python wheel build (#787)
## Support Python wheel build

This PR modernizes the Python packaging for MSCCL++ by defining
dependencies and optional extras in `pyproject.toml`, enabling proper
wheel builds with `pip install ".[cuda12]"`.

### Changes

**`pyproject.toml`**
- Add `dependencies` (numpy, blake3, pybind11, sortedcontainers)
- Add `optional-dependencies` for platform-specific CuPy (`cuda11`,
`cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras
- Bump minimum Python version from 3.8 to 3.10

**`test/deploy/setup.sh`**
- Use `pip install ".[<platform>,benchmark,test]"` instead of separate
`pip install -r requirements_*.txt` + `pip install .` steps
- Add missing CUDA 13 case

**`docs/quickstart.md`**
- Update install instructions to use extras (e.g., `pip install
".[cuda12]"`)
- Document all available extras and clarify that `rocm6` builds CuPy
from source
- Update Python version references to 3.10

**`python/csrc/CMakeLists.txt`**, **`python/test/CMakeLists.txt`**
- Update `find_package(Python)` from 3.8 to 3.10

### Notes
- The `requirements_*.txt` files are kept for Docker base image builds
where only dependencies (not the project itself) should be installed.
- CuPy is intentionally not in base dependencies — users must specify a
platform extra to get the correct pre-built wheel (or source build for
ROCm).

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 21:24:45 -07:00
Binyang Li
572028ea3d Fix nccl-test CI building for all GPU architectures (#786)
## Problem

`nccl-test.yml` was the only CI template calling `deploy.yml` without
passing `gpuArch`. Since the CI build machine has no GPU, CMake fell
back to building for **all** supported architectures (`80;90;100;120`),
unnecessarily slowing down CI builds.

## Fix

- Add `gpuArch` parameter to `nccl-test.yml` and forward it to
`deploy.yml`
- Pass `gpuArch: '80'` (A100) and `gpuArch: '90'` (H100) from
`nccl-api-test.yml`

All other templates were already passing `gpuArch` correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-15 12:55:40 -07:00
Binyang Li
ecd33722d4 Fix multi-node H100 CI: CUDA compat, deploy improvements (#781)
## Summary

- **Multi-node H100 CI setup**: Improve architecture detection and GPU
configuration
- **Remove hardcoded VMSS hostnames** from deploy files
- **Fix CUDA compat library issue**: Remove stale compat paths from
Docker image for CUDA 12+. Instead, `peer_access_test` now returns a
distinct exit code (2) for CUDA init failure, and `setup.sh`
conditionally adds compat libs only when needed. This fixes
`cudaErrorSystemNotReady` (error 803) when the host driver is newer than
the container's compat libs.
- **Speed up deploy**: Replace recursive `parallel-scp` with
tar+scp+untar to avoid per-file SSH overhead.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-13 21:51:29 -07:00
Caio Rocha
b6d0ca13ca Adding CI Test to DSL Executor (#782) 2026-04-13 13:55:45 -07:00
Caio Rocha
b59e6d7f00 Updating NpKit (#785) 2026-04-13 13:36:42 -07:00
Binyang Li
5380a4ac6e Add MSCCLPP_IB_GID_INDEX env (#780)
Use MSCCLPP_IB_GID_INDEX to control ib gid index

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-13 09:59:42 -07:00