mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Caio Rocha	0a26f9d5de	wip	2026-04-11 21:07:09 +00:00
Binyang Li	bf946ea51e	Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms (#759 ) ## Summary This PR addresses a multicast resource leak, fixes `cuMemMap` offset handling for multicast handles, renames NVLS allreduce algorithm classes for clarity, and adds a new unit test for `SwitchChannel`. ### Bug Fixes #### 1. Fix multicast allocation handle leak in `createMulticast()` (`gpu_ipc_mem.cc`) `GpuIpcMemHandle::createMulticast()` called `cuMulticastCreate(&allocHandle, ...)` but never released the local `allocHandle` after exporting it to shareable handles (POSIX FD / Fabric). This caused a reference count leak — the multicast object was never freed even after all mappings and imported handles were released. Per the [CUDA Driver API docs for `cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html): > "The memory allocation will be freed when all outstanding mappings to the memory are unmapped and when all outstanding references to the handle (including its shareable counterparts) are also released." The fix adds `cuMemRelease(allocHandle)` after export, matching the existing pattern used for regular allocations in `GpuIpcMemHandle::create()`. Impact: Without this fix, repeated creation/destruction of NVLS connections causes OOM after ~120 iterations when allocating 1GB multicast buffers on H100. #### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`) `cuMemMap` requires `offset=0` for multicast handles. Previously, the code attempted to map at a non-zero offset within the multicast object, leading to errors when binding multiple buffers to the same `NvlsConnection`. The fix maps the entire range `[0, mcOffset + bufferSize)` and returns the pointer offset by `mcOffset`. This only consumes extra virtual address space; no additional physical memory is used. ### Refactoring #### 3. Rename NVLS allreduce algorithm classes Renamed for clarity: - `AllreduceNvls` → `AllreduceNvlsZeroCopy` - `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline` - `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline` Updated all references in builder, selector, docs, and examples. #### 4. Move `nvlsConnections` setup to `initialize()` Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this member) to individual algorithm class members, initialized in their `initialize()` methods. ### Tests #### 5. Add `TwoChannelsSameConnection` test New unit test that creates two `SwitchChannel` instances from the same `NvlsConnection`, performs reduce operations on both, and verifies correctness. This exercises the multi-bind path that triggered the `cuMemMap` offset fix. ### Files Changed - `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap offset fix - `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` — nvlsConnections fix - `src/ext/collectives/include/allreduce/*.hpp` — renamed headers - `src/ext/collectives/algorithm_collection_builder.cc` — updated references - `src/ext/nccl/algorithm_selector.cc` — updated algorithm names - `test/mp_unit/switch_channel_tests.cu` — new test - `docs/guide/mscclpp-torch-integration.md` — updated names - `examples/torch-integration/customized_comm_with_default_algo.py` — updated names	2026-03-09 10:22:45 -07:00
Binyang Li	25435acf5d	Add new algos for GB200 (#747 ) - Add new algos (allreduce_rsag, allreduce_rsag_pipeline and allreduce_rsag_zero_copy) for GB200. - Add IB stub for non-IB env - Provides example for algorithm tunning with different nblocks/nthreads Perf for allreduce_rsag ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 25.16 41.67 62.51 0 23.73 44.18 66.27 0 2097152 524288 float sum -1 26.06 80.47 120.71 0 25.31 82.86 124.29 0 4194304 1048576 float sum -1 31.09 134.93 202.39 0 30.75 136.39 204.58 0 8388608 2097152 float sum -1 45.52 184.29 276.43 0 45.13 185.87 278.80 0 16777216 4194304 float sum -1 75.73 221.53 332.30 0 75.51 222.18 333.27 0 33554432 8388608 float sum -1 137.25 244.48 366.72 0 137.22 244.54 366.81 0 67108864 16777216 float sum -1 271.34 247.32 370.99 0 270.86 247.76 371.65 0 134217728 33554432 float sum -1 534.25 251.22 376.84 0 534.43 251.14 376.71 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 264.454 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_pipeline ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 61.57 17.03 25.55 0 61.51 17.05 25.57 0 2097152 524288 float sum -1 61.31 34.20 51.31 0 61.23 34.25 51.38 0 4194304 1048576 float sum -1 61.62 68.06 102.10 0 61.84 67.83 101.74 0 8388608 2097152 float sum -1 61.97 135.37 203.06 0 61.89 135.53 203.30 0 16777216 4194304 float sum -1 63.15 265.65 398.48 0 62.89 266.76 400.15 0 33554432 8388608 float sum -1 100.63 333.46 500.19 0 99.76 336.34 504.51 0 67108864 16777216 float sum -1 180.04 372.75 559.13 0 179.75 373.34 560.01 0 134217728 33554432 float sum -1 339.60 395.23 592.84 0 338.16 396.91 595.36 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 304.665 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_zero_copy ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 14.99 69.93 104.90 0 14.44 72.61 108.92 0 2097152 524288 float sum -1 16.19 129.56 194.33 0 15.85 132.32 198.48 0 4194304 1048576 float sum -1 21.19 197.98 296.97 0 20.64 203.20 304.81 0 8388608 2097152 float sum -1 31.04 270.27 405.41 0 30.68 273.44 410.16 0 16777216 4194304 float sum -1 50.34 333.26 499.89 0 50.15 334.51 501.77 0 33554432 8388608 float sum -1 89.58 374.56 561.84 0 88.65 378.48 567.73 0 67108864 16777216 float sum -1 165.69 405.03 607.54 0 163.64 410.10 615.16 0 134217728 33554432 float sum -1 323.19 415.28 622.93 0 318.01 422.05 633.07 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 414.619 # # Collective test concluded: all_reduce_perf ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-24 16:43:23 -08:00
mahdiehghazim	2a6f1c1192	Mahdieh/switchchannel test clean (#751 ) This PR adds an example code for switch channel testing. It validates switch channel on single node and multi node environments. We need to add the description of the algorithms and the explanation of the code under doc. example outputs: rank0: ./bidir_switch_channel 10.0.5.233:45571 0 0 Rank 0 (GPU 0): Preparing for tests ... Rank 0 (GPU 0): bytes 4096, elapsed 0.0062328 ms/iter, BW 0.657169 GB/s Rank 0 (GPU 0): bytes 4.1943e+06, elapsed 0.0164577 ms/iter, BW 254.854 GB/s Rank 0 (GPU 0): bytes 1.34218e+08, elapsed 0.33628 ms/iter, BW 399.125 GB/s Rank 0: Succeed! rank1: ./bidir_switch_channel 10.0.5.233:45571 1 0 Rank 1 (GPU 0): Preparing for tests ... Rank 1: Succeed!	2026-02-20 22:46:32 -05:00
Binyang Li	bd68319e3e	Refactor algo selection logic and introduce symmetric_memory env (#741 ) This PR refactors the algorithm selection logic in MSCCL++ and introduces support for symmetric memory configuration through environment variables. 1. Algorithm Selection Refactoring Use separate class for algo selection. Could introduce more complex logic for algo selection based on message size, arch, if cuda graph is enabled and memory allocation method 2. Symmetric Memory Support Introduced symmetricMemory parameter in algorithm context key generation. Remove disableChannelCache env as is ambiguous 3. Add new args for build_default_algorithms Add flag_buffer, and flag_buffer_size args to build default algorithm. Then we could use unified flag buffer for different algorithms, avoid application hanging when switch algo for different message size. --------- Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-12 19:06:18 -08:00
Qinghua Zhou	620378b4fb	Fix cpplint error in main branch (#740 ) Fix the legacy cpplint error in main branch. --------- Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-05 09:25:12 -08:00
Changho Hwang	03b1936ddb	Support multi-node in `MemoryChannel` tutorial (#726 ) Co-authored-by: mahdiehghazim <mahdiehghazi@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-02 15:50:45 -08:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Changho Hwang	da60eb7f46	Add an IB multi-node tutorial (#702 )	2025-12-11 15:15:58 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Qinghua Zhou	a38c2ee784	FP8 support for Allreduce (#646 ) Add FP8 support for Allreduce on both NVIDIA and AMD platform. Add new data type: fp8_e4m3 and fp8_e5m2 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-10-27 14:51:48 -07:00
Binyang Li	70b8297c56	Revise NCCL API implementation (#617 ) - Make nccl interface extensible. Customer can register their own algo to NCCL API. User can provide customized algo selection function. - Fallback to NCCL/RCCL if no algo is selected based on algo selection function - MSCCLPP interfaces now works for any scale	2025-09-26 10:08:12 -07:00
Changho Hwang	547a9ae65c	Fixed cpp linter (#619 )	2025-08-25 12:15:45 -07:00
Changho Hwang	9650e5c37e	Update documentation (#576 ) Documentation overhaul	2025-08-07 15:37:37 -07:00
Changho Hwang	c580e4c503	Support CudaIpc connection within a single process (#593 ) * Allow CudaIpc connection between GPUs in a single process * Added an example of connection in a single process * Minor interface updates --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-02 12:59:36 +08:00

16 Commits