mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 02:47:23 +00:00

Author	SHA1	Message	Date
Binyang Li	dc0b8d75f3	GB200 support: SendRecv DSL collective and per-channel executor connections (#810 ) ## Summary GB200 support work: introduces point-to-point send/receive in the MSCCL++ DSL and extends the executor for split-NVL-domain topologies where some ranks are NVL-connected within a node and other ranks must communicate across the network. ### DSL - New `SendRecv` collective with separate input/output buffers (`python/mscclpp/language/collectives.py`). - New multi-node sendrecv DSL example (`python/mscclpp/language/tests/multi_node/send_recv.py`) with `--split_mask` (group size − 1) and `--instances` CLI options. Documents the channel-ordering trick that keeps signal tags cross-matched between paired peers when `prev == next`. - `BaseBuffer.__getitem__` now accepts slices with `None` start/stop (e.g., `buf[:]`). ### Executor - One connection (unique QP) per channel entry instead of one per peer. Required for HostNoAtomic IB mode where each QP can forward signals to a single semaphore. Uses per-peer tag counters so paired ranks agree on tag ordering regardless of the order peers appear in each rank's `connected_to` list. - MEMORY channels now unconditionally use `Transport::CudaIpc`; only PORT channels can use IB. Matches the invariant already enforced by `getTransportFlags`. - `ExecutionContext::connections` is now a `vector<Connection>` indexed by channel order (was `unordered_map<int, Connection>` keyed by peer). Removes redundant semaphore fields from `ExecutionContext`. - TODO: explicit NVL-domain check in `useIB` --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-06-19 13:19:01 -07:00
Caio Rocha	40295df4c4	Adding Support to bf16 Executor Tests (#801 ) This pull request adds support for the `bfloat16` (bf16) data type to the test executor, including both Python and CUDA components. The changes ensure that `bfloat16` is handled consistently across argument parsing, data type conversion, and test kernel implementations. Additionally, the CUDA verification kernels are refactored to use parameterized tolerances for improved numerical accuracy checks. Support for bfloat16 data type: * Added handling for `bfloat16`/`bf16` in the Python test executor's argument parsing, data type conversion (`parse_dtype`, `dtype_to_mscclpp_dtype`), and help text. [[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28) [[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135) [[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251) * Updated output to display the correct data type string for `bfloat16`. CUDA kernel and test improvements: * Included `bfloat16` headers and defined test data fill and gather kernels for `bfloat16` on both CUDA and HIP platforms. [[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11) [[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35) [[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59) [[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133) * Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to use an explicit tolerance parameter (`Eps`) and added correct tolerances for each data type, including `bfloat16`. [[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85) [[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113) These changes ensure full support for `bfloat16` in the test executor and improve the accuracy and maintainability of the CUDA test kernels. --------- Co-authored-by: Caio Rocha <caiorocha@microsof.com>	2026-05-14 09:56:11 -07:00
Caio Rocha	9261b1d278	AlltoAll Test Support (#606 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-15 16:00:41 -07:00
Caio Rocha	55789bc551	Support ReduceScatter in the NCCL interface (#460 ) Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net> Co-authored-by: Caio Rocha <aiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-02-11 13:28:19 -08:00
Caio Rocha	ff18bb8d0b	Providing reduce-scatter test support (#390 )	2024-11-28 09:19:30 -08:00
Caio Rocha	b3dc74c020	Small Adjust in Test Data AllGather at Executor Test (#384 )	2024-11-16 15:21:00 +08:00
Ziyue Yang	9526d76fc7	Add kernel-based verification for executor_test (#378 ) Add kernels to fill and test data for correctness test in executor_test.py.	2024-11-07 14:14:20 +08:00

7 Commits