Commit Graph

7 Commits

Author SHA1 Message Date
Binyang Li
dc0b8d75f3 GB200 support: SendRecv DSL collective and per-channel executor connections (#810)
## Summary
 
GB200 support work: introduces point-to-point send/receive in the
MSCCL++ DSL
and extends the executor for split-NVL-domain topologies where some
ranks are
NVL-connected within a node and other ranks must communicate across the
network.
 
 ### DSL
 - New `SendRecv` collective with separate input/output buffers
   (`python/mscclpp/language/collectives.py`).
 - New multi-node sendrecv DSL example
(`python/mscclpp/language/tests/multi_node/send_recv.py`) with
`--split_mask`
(group size − 1) and `--instances` CLI options. Documents the
channel-ordering
   trick that keeps signal tags cross-matched between paired peers when
   `prev == next`.
 - `BaseBuffer.__getitem__` now accepts slices with `None` start/stop
   (e.g., `buf[:]`).
 
 ### Executor
 - One connection (unique QP) per channel entry instead of one per peer.
Required for HostNoAtomic IB mode where each QP can forward signals to a
single semaphore. Uses per-peer tag counters so paired ranks agree on
tag
ordering regardless of the order peers appear in each rank's
`connected_to`
   list.
- MEMORY channels now unconditionally use `Transport::CudaIpc`; only
PORT
   channels can use IB. Matches the invariant already enforced by
   `getTransportFlags`.
- `ExecutionContext::connections` is now a `vector<Connection>` indexed
by
channel order (was `unordered_map<int, Connection>` keyed by peer).
Removes
   redundant semaphore fields from `ExecutionContext`.
 - TODO: explicit NVL-domain check in `useIB`

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2026-06-19 13:19:01 -07:00
Caio Rocha
40295df4c4 Adding Support to bf16 Executor Tests (#801)
This pull request adds support for the `bfloat16` (bf16) data type to
the test executor, including both Python and CUDA components. The
changes ensure that `bfloat16` is handled consistently across argument
parsing, data type conversion, and test kernel implementations.
Additionally, the CUDA verification kernels are refactored to use
parameterized tolerances for improved numerical accuracy checks.

**Support for bfloat16 data type:**

* Added handling for `bfloat16`/`bf16` in the Python test executor's
argument parsing, data type conversion (`parse_dtype`,
`dtype_to_mscclpp_dtype`), and help text.
[[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28)
[[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135)
[[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251)
* Updated output to display the correct data type string for `bfloat16`.

**CUDA kernel and test improvements:**

* Included `bfloat16` headers and defined test data fill and gather
kernels for `bfloat16` on both CUDA and HIP platforms.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35)
[[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59)
[[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133)
* Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to
use an explicit tolerance parameter (`Eps`) and added correct tolerances
for each data type, including `bfloat16`.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113)

These changes ensure full support for `bfloat16` in the test executor
and improve the accuracy and maintainability of the CUDA test kernels.

---------

Co-authored-by: Caio Rocha <caiorocha@microsof.com>
2026-05-14 09:56:11 -07:00
Caio Rocha
9261b1d278 AlltoAll Test Support (#606)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-08-15 16:00:41 -07:00
Caio Rocha
55789bc551 Support ReduceScatter in the NCCL interface (#460)
Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net>
Co-authored-by: Caio Rocha <aiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-02-11 13:28:19 -08:00
Caio Rocha
ff18bb8d0b Providing reduce-scatter test support (#390) 2024-11-28 09:19:30 -08:00
Caio Rocha
b3dc74c020 Small Adjust in Test Data AllGather at Executor Test (#384) 2024-11-16 15:21:00 +08:00
Ziyue Yang
9526d76fc7 Add kernel-based verification for executor_test (#378)
Add kernels to fill and test data for correctness test in
executor_test.py.
2024-11-07 14:14:20 +08:00