mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 09:17:06 +00:00

Author	SHA1	Message	Date
empyreus	99f2faced2	fixes	2026-03-26 22:32:09 +00:00
empyreus	01b7af9733	sanity check	2026-03-26 22:23:11 +00:00
empyreus	57eedc915a	Merge branch 'main' into rjsouza/sglang-tests	2026-03-26 22:21:42 +00:00
empyreus	7f016cb7f0	fix venv	2026-03-26 22:01:48 +00:00
empyreus	257735cb51	setup sglang python venv	2026-03-26 21:43:56 +00:00
empyreus	8b007892df	check hostname	2026-03-26 21:31:22 +00:00
empyreus	4cf9cd721b	change cuda version	2026-03-26 20:50:32 +00:00
empyreus	12d935ff81	fix docker name	2026-03-26 20:22:51 +00:00
empyreus	d33458db8a	moidfy to setup sglang	2026-03-26 19:23:24 +00:00
empyreus	c883994556	add git clone msccl	2026-03-26 18:21:44 +00:00
empyreus	dd3c3ed7cb	change image	2026-03-25 22:05:18 +00:00
empyreus	ee771ec4c0	fix dockerfile	2026-03-25 21:52:09 +00:00
empyreus	fa24653d8d	update docker image	2026-03-25 21:45:29 +00:00
Copilot	93f6eeaa6b	Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 ) - Removes the GTest dependency, replacing it with a minimal custom framework (`test/framework.`) that covers only what the tests actually use — a unified `TEST()` macro with SFINAE-based fixture auto-detection, `EXPECT_`/`ASSERT_*` assertions, environments, and setup/teardown. - `--exclude-perf-tests` flag and substring-based negative filtering - `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to Codecov - Merges standalone `test/perf/` into main test targets - Refactors Azure pipelines to reduce redundancies & make more readable --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-03-24 23:34:38 -04:00
empyreus	b7adec0e60	create sglang docker image	2026-03-19 20:03:00 +00:00
Binyang Li	5d18835417	Fix use-after-free for fabric allocation handle in GpuIpcMemHandle (#764 ) ## Summary Fix a use-after-free where the CUDA allocation handle (`CUmemGenericAllocationHandle`) was released prematurely while the exported fabric handle still referenced it. ## Problem Unlike POSIX FD handles (where the kernel keeps the allocation alive via the open file descriptor), fabric handles do not hold their own reference to the underlying allocation. The original code called `cuMemRelease(allocHandle)` immediately after exporting the fabric handle, freeing the allocation. When a remote process later tries to `cuMemImportFromShareableHandle` using that fabric handle, it references a freed allocation — a use-after-free. This affected both code paths: 1. `GpuIpcMemHandle::create()`: The local `allocHandle` obtained via `cuMemRetainAllocationHandle` was released right after fabric export, leaving the fabric handle dangling. 2. `GpuIpcMemHandle::createMulticast()`: The `allocHandle` from `cuMulticastCreate` was unconditionally released, even when it was the only thing keeping the multicast object alive for the fabric handle. ## Fix - Added `allocHandle` field to the `fabric` struct in `GpuIpcMemHandle` to store the allocation handle and keep it alive for the lifetime of the `GpuIpcMemHandle`. - `create()`: Retain an additional reference via `cuMemRetainAllocationHandle` and store it in `fabric.allocHandle` when a fabric handle is successfully exported. - `createMulticast()`: Store the `allocHandle` directly in `fabric.allocHandle` instead of unconditionally releasing it. Only release if fabric export was not used. - `deleter()`: Release `fabric.allocHandle` via `cuMemRelease` when the handle type includes `Fabric`, ensuring proper cleanup. - `GpuIpcMem` constructor (importer side): Clear `fabric.allocHandle` after importing, since the importer gets its own handle via `cuMemImportFromShareableHandle` and should not release the exporter's allocation handle. ## Files Changed - `src/core/include/gpu_ipc_mem.hpp` — Added `CUmemGenericAllocationHandle allocHandle` to fabric struct. - `src/core/gpu_ipc_mem.cc` — Retain/release allocation handle properly across create, createMulticast, deleter, and importer paths.	2026-03-19 11:52:09 -07:00
empyreus	c38c3517fd	attempting to gix az cli	2026-03-18 19:36:40 +00:00
empyreus	08092653b2	install pip systemwide	2026-03-18 19:10:56 +00:00
empyreus	b7ede93f13	move from apt-get to pip	2026-03-18 18:55:07 +00:00
empyreus	4742dfef39	fix sudo issue	2026-03-18 18:30:09 +00:00
empyreus	343c3671ef	fix sudo	2026-03-18 18:07:25 +00:00
empyreus	ffa120f6b1	rework template	2026-03-17 21:58:01 +00:00
Empyreus	8686d81de5	testing	2026-03-17 19:45:07 +00:00
Empyreus	371dfb3cc3	fix pip	2026-03-17 19:19:28 +00:00
Empyreus	431234f0a4	inital pipeline test	2026-03-17 18:45:42 +00:00
Empyreus	5f42426dc8	inital creation of test files	2026-03-16 17:47:48 +00:00
Binyang Li	bf946ea51e	Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms (#759 ) ## Summary This PR addresses a multicast resource leak, fixes `cuMemMap` offset handling for multicast handles, renames NVLS allreduce algorithm classes for clarity, and adds a new unit test for `SwitchChannel`. ### Bug Fixes #### 1. Fix multicast allocation handle leak in `createMulticast()` (`gpu_ipc_mem.cc`) `GpuIpcMemHandle::createMulticast()` called `cuMulticastCreate(&allocHandle, ...)` but never released the local `allocHandle` after exporting it to shareable handles (POSIX FD / Fabric). This caused a reference count leak — the multicast object was never freed even after all mappings and imported handles were released. Per the [CUDA Driver API docs for `cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html): > "The memory allocation will be freed when all outstanding mappings to the memory are unmapped and when all outstanding references to the handle (including its shareable counterparts) are also released." The fix adds `cuMemRelease(allocHandle)` after export, matching the existing pattern used for regular allocations in `GpuIpcMemHandle::create()`. Impact: Without this fix, repeated creation/destruction of NVLS connections causes OOM after ~120 iterations when allocating 1GB multicast buffers on H100. #### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`) `cuMemMap` requires `offset=0` for multicast handles. Previously, the code attempted to map at a non-zero offset within the multicast object, leading to errors when binding multiple buffers to the same `NvlsConnection`. The fix maps the entire range `[0, mcOffset + bufferSize)` and returns the pointer offset by `mcOffset`. This only consumes extra virtual address space; no additional physical memory is used. ### Refactoring #### 3. Rename NVLS allreduce algorithm classes Renamed for clarity: - `AllreduceNvls` → `AllreduceNvlsZeroCopy` - `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline` - `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline` Updated all references in builder, selector, docs, and examples. #### 4. Move `nvlsConnections` setup to `initialize()` Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this member) to individual algorithm class members, initialized in their `initialize()` methods. ### Tests #### 5. Add `TwoChannelsSameConnection` test New unit test that creates two `SwitchChannel` instances from the same `NvlsConnection`, performs reduce operations on both, and verifies correctness. This exercises the multi-bind path that triggered the `cuMemMap` offset fix. ### Files Changed - `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap offset fix - `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` — nvlsConnections fix - `src/ext/collectives/include/allreduce/*.hpp` — renamed headers - `src/ext/collectives/algorithm_collection_builder.cc` — updated references - `src/ext/nccl/algorithm_selector.cc` — updated algorithm names - `test/mp_unit/switch_channel_tests.cu` — new test - `docs/guide/mscclpp-torch-integration.md` — updated names - `examples/torch-integration/customized_comm_with_default_algo.py` — updated names	2026-03-09 10:22:45 -07:00
Binyang Li	3751f0299b	Fix NCCL fallback comm destroy and use latest NCCL release in CI (#760 ) ## Summary Fix NCCL fallback communicator cleanup errors and update CI to use stable NCCL releases. ## Problem When using `LD_PRELOAD=libmscclpp_nccl.so` with NCCL fallback enabled, the following warnings appear at program exit: ``` NCCL WARN commReclaim: cleanup comm 0x55a0dcadaa90 rank 3 failed in destroy/abort, error 3 ``` This is caused by three bugs in the NCCL fallback communicator lifecycle management. ## Root Causes & Fixes ### 1. Symbol interposition during NCCL cleanup (`RTLD_DEEPBIND`) Root cause: When the fallback NCCL library is loaded via `dlopen`, its internal calls to its own public API functions (e.g., `ncclCommWindowDeregister`, `ncclMemFree`) during `commFree` cleanup are intercepted by our `LD_PRELOAD`'d stub functions, which return errors. This causes NCCL's `commReclaim` to report `error 3` (`ncclSystemError`). Fix: Add `RTLD_DEEPBIND` to the `dlopen` flags. This makes the dlopen'd NCCL library resolve its own symbols internally first, bypassing our interposition layer for internal calls. ### 2. Missing `ncclCommFinalize` forwarding Root cause: `CommFinalize` was not in the `mscclppNcclOps_t` struct and was never loaded via `dlsym`. So `ncclCommFinalize` never forwarded to the real NCCL's finalize, which is required before `ncclCommDestroy` in NCCL 2.29+. Fix: Add `CommFinalize` to the ops struct and load it via `dlsym`. Forward the call in `ncclCommFinalize`. ### 3. CI: Use latest NCCL release tag The CI pipeline was cloning the NCCL default branch (which may contain unreleased/unstable code). Updated to fetch the latest release tag via GitHub API and clone that specific tag. ## Testing Verified with the exact CI command: ```bash mpirun -np 8 --bind-to numa --allow-run-as-root \ -x LD_PRELOAD=libmscclpp_nccl.so \ -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE \ -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce" \ -x MSCCLPP_NCCL_LIB_PATH=/root/nccl/build/lib/libnccl.so \ all_reduce_perf -b 1K -e 1G -f 2 -d half -G 1 -w 10 -n 100 ``` - Before: `commReclaim: error 3` warnings on all 8 ranks at exit - After: Clean exit, no warnings, correct results ## Files Changed - `src/ext/nccl/nccl.cc` — Fix comm destroy lifecycle (RTLD_DEEPBIND, CommFinalize forwarding, destroy order) - `.azure-pipelines/templates/nccl-test.yaml` — Use latest NCCL release tag in CI	2026-03-06 16:33:35 -08:00
Xingbo Wu	69565a2f32	Do threadInit/cudaSetDevice before other cuda calls (#757 ) I recently encountered a weird memory usage issue. After starting the proxy service on a cuda device X > 0, I notice an unexpected thread entity apprear on both the GPU X and GPU 0, where GPU 0's share is about 500MB. Note that when the device is 0, there is no extra memory usage. The image clearly shows that when 8 ranks each using one GPU and starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB extra memory. <img width="1247" height="1367" alt="Screenshot 2026-02-28 000153" src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4" /> After tracking down to when it happens, I identified the root cause in Proxy thread initialization. // never capture in a proxy thread auto mode = cudaStreamCaptureModeRelaxed; MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode)); pimpl_->threadInit(); The call to cudaThreadExchangeStreamCaptureMode() actually triggers some resource allocation on the "current device" which is still 0 for the starting thread. The later threadInit() is too late to set the correct GPU number. The fix is simple: call threadInit() before the first cuda call: pimpl_->threadInit(); // never capture in a proxy thread auto mode = cudaStreamCaptureModeRelaxed; MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode)); This guarantees that the current device is properly set before calling any resource-allocating cuda functions. This is the memory usage after the fix. The extra memory usages are gone. <img width="1242" height="459" alt="Image (1)" src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9" /> --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-03-02 15:53:59 -08:00
Caio Rocha	4bc1999001	Adding Support to Setting Message Size Range in Native Algorithm API (#758 )	2026-02-27 17:50:43 -08:00
Binyang Li	ab49386839	Add doc for perf tunning (#756 )	2026-02-27 10:59:36 -08:00
Binyang Li	25435acf5d	Add new algos for GB200 (#747 ) - Add new algos (allreduce_rsag, allreduce_rsag_pipeline and allreduce_rsag_zero_copy) for GB200. - Add IB stub for non-IB env - Provides example for algorithm tunning with different nblocks/nthreads Perf for allreduce_rsag ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 25.16 41.67 62.51 0 23.73 44.18 66.27 0 2097152 524288 float sum -1 26.06 80.47 120.71 0 25.31 82.86 124.29 0 4194304 1048576 float sum -1 31.09 134.93 202.39 0 30.75 136.39 204.58 0 8388608 2097152 float sum -1 45.52 184.29 276.43 0 45.13 185.87 278.80 0 16777216 4194304 float sum -1 75.73 221.53 332.30 0 75.51 222.18 333.27 0 33554432 8388608 float sum -1 137.25 244.48 366.72 0 137.22 244.54 366.81 0 67108864 16777216 float sum -1 271.34 247.32 370.99 0 270.86 247.76 371.65 0 134217728 33554432 float sum -1 534.25 251.22 376.84 0 534.43 251.14 376.71 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 264.454 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_pipeline ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 61.57 17.03 25.55 0 61.51 17.05 25.57 0 2097152 524288 float sum -1 61.31 34.20 51.31 0 61.23 34.25 51.38 0 4194304 1048576 float sum -1 61.62 68.06 102.10 0 61.84 67.83 101.74 0 8388608 2097152 float sum -1 61.97 135.37 203.06 0 61.89 135.53 203.30 0 16777216 4194304 float sum -1 63.15 265.65 398.48 0 62.89 266.76 400.15 0 33554432 8388608 float sum -1 100.63 333.46 500.19 0 99.76 336.34 504.51 0 67108864 16777216 float sum -1 180.04 372.75 559.13 0 179.75 373.34 560.01 0 134217728 33554432 float sum -1 339.60 395.23 592.84 0 338.16 396.91 595.36 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 304.665 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_zero_copy ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 14.99 69.93 104.90 0 14.44 72.61 108.92 0 2097152 524288 float sum -1 16.19 129.56 194.33 0 15.85 132.32 198.48 0 4194304 1048576 float sum -1 21.19 197.98 296.97 0 20.64 203.20 304.81 0 8388608 2097152 float sum -1 31.04 270.27 405.41 0 30.68 273.44 410.16 0 16777216 4194304 float sum -1 50.34 333.26 499.89 0 50.15 334.51 501.77 0 33554432 8388608 float sum -1 89.58 374.56 561.84 0 88.65 378.48 567.73 0 67108864 16777216 float sum -1 165.69 405.03 607.54 0 163.64 410.10 615.16 0 134217728 33554432 float sum -1 323.19 415.28 622.93 0 318.01 422.05 633.07 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 414.619 # # Collective test concluded: all_reduce_perf ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-24 16:43:23 -08:00
Binyang Li	184dcbf9d7	Add CI pipeline for no-IB environment testing (#755 ) ## Summary Add CI pipeline support for testing in environments without InfiniBand (IB) hardware. ## Changes ### IB stubs for no-IB builds (`src/core/ib.cc`) - Added stub implementations for `IbMr` and `IbQp` classes in the `#else // !defined(USE_IBVERBS)` block so the library links successfully when built with `-DMSCCLPP_USE_IB=OFF`. ### Environment variable to disable IB tests (`MSCCLPP_DISABLE_IB_TESTS`) - Added `disableIbTests` field to the `Env` class (`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from `MSCCLPP_DISABLE_IB_TESTS` env var. - Exposed as `disable_ib_tests` in Python bindings (`python/csrc/env_py.cpp`). - Updated `python/test/test_mscclpp.py` to skip IB-dependent tests (`create_group_and_connection` with IB transport, `test_h2h_semaphores`, `test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is true. ### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`) The no-IB environment pipeline runs two phases: 1. No-IB build phase: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run unit tests, multi-process unit tests, and pytests (with `MSCCLPP_DISABLE_IB_TESTS=1`). 2. IB build phase: Rebuild with IB enabled (default), stop the existing container, redeploy, and run pytests (with `MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build works correctly in a non-IB environment when IB tests are skipped. Also increased the job timeout from 40 to 60 minutes to accommodate the two-phase pipeline.	2026-02-24 15:55:59 -08:00
Caio Rocha	7738603d63	Adjusting Communicator in Python API (#752 )	2026-02-23 16:33:52 -08:00
Caio Rocha	b5256032fe	Disabling Nanobind Memory Leak Warnings in Release Builds (#745 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-23 11:55:17 -08:00
mahdiehghazim	2a6f1c1192	Mahdieh/switchchannel test clean (#751 ) This PR adds an example code for switch channel testing. It validates switch channel on single node and multi node environments. We need to add the description of the algorithms and the explanation of the code under doc. example outputs: rank0: ./bidir_switch_channel 10.0.5.233:45571 0 0 Rank 0 (GPU 0): Preparing for tests ... Rank 0 (GPU 0): bytes 4096, elapsed 0.0062328 ms/iter, BW 0.657169 GB/s Rank 0 (GPU 0): bytes 4.1943e+06, elapsed 0.0164577 ms/iter, BW 254.854 GB/s Rank 0 (GPU 0): bytes 1.34218e+08, elapsed 0.33628 ms/iter, BW 399.125 GB/s Rank 0: Succeed! rank1: ./bidir_switch_channel 10.0.5.233:45571 1 0 Rank 1 (GPU 0): Preparing for tests ... Rank 1: Succeed!	2026-02-20 22:46:32 -05:00
Binyang Li	3962574bcb	Address installation issue in some env (#750 ) This pull request updates the way the `nlohmann/json` library is fetched and upgrades it to a newer version in both the main build and test configuration files. Addressed installation issue in some env	2026-02-20 16:11:16 -08:00
Caio Rocha	e2acf7f1c8	Removing MPI Dependency (#743 )	2026-02-20 16:04:12 -08:00
Binyang Li	39865c218b	address flagBuffer ownership issue (#749 ) This pull request updates the handling of the default flag buffer in the C++ and Python bindings to ensure proper memory management when interfacing with Python. Make sure the buffer will not be deallocated when transfer ownership from cpp to python	2026-02-20 13:42:29 -08:00
Binyang Li	4701ae3a95	Update dtype name (#748 ) - Change FP8_E4M3/FP8_E5M2 to FLOAT8_E4M3/FLOAT8_E5M2 - Add torch.uint8 to DataType.uint8 mapping	2026-02-18 10:35:44 -08:00
Binyang Li	d0d5a8c034	Add new CI pipeline for RCCL test (#746 ) Add rccl allreduce/allgather test in ci pipeline Fix hang issue which introduced by PR #741 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-13 10:50:10 -08:00
Qinghua Zhou	edc9c38751	Support uint8 data type for Allreduce (#736 ) Support uint8 data type for Allreduce. Current limitation: uint8 is not supported for NVLS. Performance results with RCCL-test with MSCCLPP on MI300X: \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 512 \| half \| sum \| -1 \| 5.39 \| 0.19 \| 0.33 \| 0 \| 5.45 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 1024 \| half \| sum \| -1 \| 5.53 \| 0.37 \| 0.65 \| 0 \| 5.63 \| 0.36 \| 0.64 \| 0 4096 \| 2048 \| half \| sum \| -1 \| 5.55 \| 0.74 \| 1.29 \| 0 \| 5.56 \| 0.74 \| 1.29 \| 0 8192 \| 4096 \| half \| sum \| -1 \| 5.8 \| 1.41 \| 2.47 \| 0 \| 5.84 \| 1.4 \| 2.46 \| 0 16384 \| 8192 \| half \| sum \| -1 \| 6.57 \| 2.49 \| 4.36 \| 0 \| 6.56 \| 2.5 \| 4.37 \| 0 32768 \| 16384 \| half \| sum \| -1 \| 8.02 \| 4.09 \| 7.15 \| 0 \| 8.06 \| 4.07 \| 7.11 \| 0 65536 \| 32768 \| half \| sum \| -1 \| 8.77 \| 7.47 \| 13.07 \| 0 \| 8.82 \| 7.43 \| 13 \| 0 131072 \| 65536 \| half \| sum \| -1 \| 9.61 \| 13.64 \| 23.87 \| 0 \| 9.78 \| 13.4 \| 23.45 \| 0 262144 \| 131072 \| half \| sum \| -1 \| 11.68 \| 22.44 \| 39.27 \| 0 \| 12.1 \| 21.67 \| 37.93 \| 0 524288 \| 262144 \| half \| sum \| -1 \| 13.77 \| 38.08 \| 66.64 \| 0 \| 13.87 \| 37.79 \| 66.13 \| 0 1048576 \| 524288 \| half \| sum \| -1 \| 19.11 \| 54.87 \| 96.03 \| 0 \| 19.27 \| 54.42 \| 95.24 \| 0 2097152 \| 1048576 \| half \| sum \| -1 \| 24.1 \| 87 \| 152.26 \| 0 \| 24.24 \| 86.52 \| 151.41 \| 0 4194304 \| 2097152 \| half \| sum \| -1 \| 37.16 \| 112.87 \| 197.52 \| 0 \| 37.44 \| 112.03 \| 196.06 \| 0 8388608 \| 4194304 \| half \| sum \| -1 \| 61.53 \| 136.33 \| 238.58 \| 0 \| 61.68 \| 135.99 \| 237.99 \| 0 16777216 \| 8388608 \| half \| sum \| -1 \| 108.8 \| 154.22 \| 269.88 \| 0 \| 109.2 \| 153.6 \| 268.79 \| 0 33554432 \| 16777216 \| half \| sum \| -1 \| 197.8 \| 169.68 \| 296.94 \| 0 \| 198.6 \| 168.92 \| 295.61 \| 0 67108864 \| 33554432 \| half \| sum \| -1 \| 384.6 \| 174.51 \| 305.39 \| 0 \| 385.1 \| 174.27 \| 304.98 \| 0 134217728 \| 67108864 \| half \| sum \| -1 \| 754.1 \| 177.99 \| 311.48 \| 0 \| 754.9 \| 177.78 \| 311.12 \| 0 268435456 \| 134217728 \| half \| sum \| -1 \| 1491.8 \| 179.94 \| 314.89 \| 0 \| 1493.2 \| 179.77 \| 314.6 \| 0 536870912 \| 268435456 \| half \| sum \| -1 \| 2979.6 \| 180.18 \| 315.31 \| 0 \| 2983.9 \| 179.92 \| 314.87 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| fp8_e4m3 \| sum \| -1 \| 5.4 \| 0.19 \| 0.33 \| 0 \| 5.45 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| fp8_e4m3 \| sum \| -1 \| 5.5 \| 0.37 \| 0.65 \| 0 \| 5.6 \| 0.37 \| 0.64 \| 0 4096 \| 4096 \| fp8_e4m3 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.68 \| 0.72 \| 1.26 \| 0 8192 \| 8192 \| fp8_e4m3 \| sum \| -1 \| 5.96 \| 1.38 \| 2.41 \| 0 \| 5.98 \| 1.37 \| 2.4 \| 0 16384 \| 16384 \| fp8_e4m3 \| sum \| -1 \| 6.49 \| 2.52 \| 4.42 \| 0 \| 6.58 \| 2.49 \| 4.36 \| 0 32768 \| 32768 \| fp8_e4m3 \| sum \| -1 \| 8.09 \| 4.05 \| 7.09 \| 0 \| 8.15 \| 4.02 \| 7.03 \| 0 65536 \| 65536 \| fp8_e4m3 \| sum \| -1 \| 8.58 \| 7.64 \| 13.37 \| 0 \| 8.7 \| 7.53 \| 13.18 \| 0 131072 \| 131072 \| fp8_e4m3 \| sum \| -1 \| 9.44 \| 13.88 \| 24.29 \| 0 \| 9.62 \| 13.63 \| 23.85 \| 0 262144 \| 262144 \| fp8_e4m3 \| sum \| -1 \| 10.12 \| 25.9 \| 45.32 \| 0 \| 10.37 \| 25.27 \| 44.22 \| 0 524288 \| 524288 \| fp8_e4m3 \| sum \| -1 \| 13.73 \| 38.19 \| 66.82 \| 0 \| 13.89 \| 37.74 \| 66.04 \| 0 1048576 \| 1048576 \| fp8_e4m3 \| sum \| -1 \| 18.66 \| 56.2 \| 98.34 \| 0 \| 18.92 \| 55.41 \| 96.97 \| 0 2097152 \| 2097152 \| fp8_e4m3 \| sum \| -1 \| 24.54 \| 85.46 \| 149.56 \| 0 \| 24.63 \| 85.16 \| 149.03 \| 0 4194304 \| 4194304 \| fp8_e4m3 \| sum \| -1 \| 37.79 \| 110.98 \| 194.21 \| 0 \| 38.05 \| 110.22 \| 192.88 \| 0 8388608 \| 8388608 \| fp8_e4m3 \| sum \| -1 \| 62.22 \| 134.82 \| 235.94 \| 0 \| 62.63 \| 133.94 \| 234.4 \| 0 16777216 \| 16777216 \| fp8_e4m3 \| sum \| -1 \| 109.9 \| 152.62 \| 267.09 \| 0 \| 110.4 \| 151.9 \| 265.83 \| 0 33554432 \| 33554432 \| fp8_e4m3 \| sum \| -1 \| 201.1 \| 166.82 \| 291.94 \| 0 \| 202.3 \| 165.84 \| 290.22 \| 0 67108864 \| 67108864 \| fp8_e4m3 \| sum \| -1 \| 390 \| 172.06 \| 301.11 \| 0 \| 390.2 \| 171.99 \| 300.99 \| 0 134217728 \| 134217728 \| fp8_e4m3 \| sum \| -1 \| 763.9 \| 175.7 \| 307.47 \| 0 \| 764.2 \| 175.62 \| 307.34 \| 0 268435456 \| 268435456 \| fp8_e4m3 \| sum \| -1 \| 1509.5 \| 177.83 \| 311.2 \| 0 \| 1510.1 \| 177.76 \| 311.08 \| 0 536870912 \| 536870912 \| fp8_e4m3 \| sum \| -1 \| 3010.2 \| 178.35 \| 312.11 \| 0 \| 3014.2 \| 178.11 \| 311.7 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| fp8_e5m2 \| sum \| -1 \| 5.41 \| 0.19 \| 0.33 \| 0 \| 5.44 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| fp8_e5m2 \| sum \| -1 \| 5.5 \| 0.37 \| 0.65 \| 0 \| 5.67 \| 0.36 \| 0.63 \| 0 4096 \| 4096 \| fp8_e5m2 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.69 \| 0.72 \| 1.26 \| 0 8192 \| 8192 \| fp8_e5m2 \| sum \| -1 \| 5.96 \| 1.37 \| 2.4 \| 0 \| 6 \| 1.36 \| 2.39 \| 0 16384 \| 16384 \| fp8_e5m2 \| sum \| -1 \| 6.63 \| 2.47 \| 4.32 \| 0 \| 6.59 \| 2.49 \| 4.35 \| 0 32768 \| 32768 \| fp8_e5m2 \| sum \| -1 \| 8.07 \| 4.06 \| 7.1 \| 0 \| 8.16 \| 4.02 \| 7.03 \| 0 65536 \| 65536 \| fp8_e5m2 \| sum \| -1 \| 8.62 \| 7.61 \| 13.31 \| 0 \| 8.73 \| 7.51 \| 13.14 \| 0 131072 \| 131072 \| fp8_e5m2 \| sum \| -1 \| 9.43 \| 13.9 \| 24.33 \| 0 \| 9.6 \| 13.66 \| 23.9 \| 0 262144 \| 262144 \| fp8_e5m2 \| sum \| -1 \| 10.11 \| 25.94 \| 45.39 \| 0 \| 10.38 \| 25.26 \| 44.21 \| 0 524288 \| 524288 \| fp8_e5m2 \| sum \| -1 \| 13.73 \| 38.19 \| 66.84 \| 0 \| 13.87 \| 37.79 \| 66.13 \| 0 1048576 \| 1048576 \| fp8_e5m2 \| sum \| -1 \| 18.65 \| 56.22 \| 98.39 \| 0 \| 18.93 \| 55.38 \| 96.92 \| 0 2097152 \| 2097152 \| fp8_e5m2 \| sum \| -1 \| 24.54 \| 85.47 \| 149.57 \| 0 \| 24.63 \| 85.16 \| 149.03 \| 0 4194304 \| 4194304 \| fp8_e5m2 \| sum \| -1 \| 37.84 \| 110.83 \| 193.96 \| 0 \| 38.01 \| 110.36 \| 193.12 \| 0 8388608 \| 8388608 \| fp8_e5m2 \| sum \| -1 \| 62.32 \| 134.61 \| 235.58 \| 0 \| 62.55 \| 134.12 \| 234.71 \| 0 16777216 \| 16777216 \| fp8_e5m2 \| sum \| -1 \| 110 \| 152.58 \| 267.01 \| 0 \| 110.3 \| 152.12 \| 266.21 \| 0 33554432 \| 33554432 \| fp8_e5m2 \| sum \| -1 \| 201.1 \| 166.9 \| 292.07 \| 0 \| 201.8 \| 166.26 \| 290.96 \| 0 67108864 \| 67108864 \| fp8_e5m2 \| sum \| -1 \| 390 \| 172.07 \| 301.12 \| 0 \| 390.5 \| 171.87 \| 300.78 \| 0 134217728 \| 134217728 \| fp8_e5m2 \| sum \| -1 \| 763.9 \| 175.69 \| 307.46 \| 0 \| 764.5 \| 175.56 \| 307.23 \| 0 268435456 \| 268435456 \| fp8_e5m2 \| sum \| -1 \| 1509.4 \| 177.84 \| 311.22 \| 0 \| 1509.8 \| 177.8 \| 311.14 \| 0 536870912 \| 536870912 \| fp8_e5m2 \| sum \| -1 \| 3013 \| 178.18 \| 311.82 \| 0 \| 3018 \| 177.89 \| 311.31 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| uint8 \| sum \| -1 \| 5.46 \| 0.19 \| 0.33 \| 0 \| 5.46 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| uint8 \| sum \| -1 \| 5.54 \| 0.37 \| 0.65 \| 0 \| 5.63 \| 0.36 \| 0.64 \| 0 4096 \| 4096 \| uint8 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.63 \| 0.73 \| 1.27 \| 0 8192 \| 8192 \| uint8 \| sum \| -1 \| 5.9 \| 1.39 \| 2.43 \| 0 \| 5.9 \| 1.39 \| 2.43 \| 0 16384 \| 16384 \| uint8 \| sum \| -1 \| 6.6 \| 2.48 \| 4.35 \| 0 \| 6.64 \| 2.47 \| 4.32 \| 0 32768 \| 32768 \| uint8 \| sum \| -1 \| 8.99 \| 3.65 \| 6.38 \| 0 \| 8.99 \| 3.64 \| 6.38 \| 0 65536 \| 65536 \| uint8 \| sum \| -1 \| 9.44 \| 6.94 \| 12.15 \| 0 \| 9.58 \| 6.84 \| 11.98 \| 0 131072 \| 131072 \| uint8 \| sum \| -1 \| 11.72 \| 11.18 \| 19.57 \| 0 \| 11.83 \| 11.08 \| 19.4 \| 0 262144 \| 262144 \| uint8 \| sum \| -1 \| 12.29 \| 21.32 \| 37.31 \| 0 \| 12.45 \| 21.05 \| 36.84 \| 0 524288 \| 524288 \| uint8 \| sum \| -1 \| 13.87 \| 37.8 \| 66.15 \| 0 \| 13.93 \| 37.64 \| 65.88 \| 0 1048576 \| 1048576 \| uint8 \| sum \| -1 \| 19.11 \| 54.88 \| 96.04 \| 0 \| 19.3 \| 54.33 \| 95.08 \| 0 2097152 \| 2097152 \| uint8 \| sum \| -1 \| 24.38 \| 86.01 \| 150.51 \| 0 \| 24.52 \| 85.53 \| 149.67 \| 0 4194304 \| 4194304 \| uint8 \| sum \| -1 \| 37.52 \| 111.78 \| 195.61 \| 0 \| 37.76 \| 111.08 \| 194.39 \| 0 8388608 \| 8388608 \| uint8 \| sum \| -1 \| 62.4 \| 134.44 \| 235.26 \| 0 \| 62.56 \| 134.1 \| 234.67 \| 0 16777216 \| 16777216 \| uint8 \| sum \| -1 \| 110.2 \| 152.22 \| 266.39 \| 0 \| 110.3 \| 152.04 \| 266.08 \| 0 33554432 \| 33554432 \| uint8 \| sum \| -1 \| 199.8 \| 167.94 \| 293.9 \| 0 \| 197.5 \| 169.88 \| 297.29 \| 0 67108864 \| 67108864 \| uint8 \| sum \| -1 \| 386.3 \| 173.73 \| 304.03 \| 0 \| 378.4 \| 177.37 \| 310.39 \| 0 134217728 \| 134217728 \| uint8 \| sum \| -1 \| 758 \| 177.07 \| 309.87 \| 0 \| 741.1 \| 181.12 \| 316.95 \| 0 268435456 \| 268435456 \| uint8 \| sum \| -1 \| 1500.1 \| 178.95 \| 313.16 \| 0 \| 1466.2 \| 183.09 \| 320.4 \| 0 536870912 \| 536870912 \| uint8 \| sum \| -1 \| 2991.7 \| 179.45 \| 314.04 \| 0 \| 2924.8 \| 183.56 \| 321.23 \| 0 --------- Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com>	2026-02-13 10:49:25 -08:00
Binyang Li	bd68319e3e	Refactor algo selection logic and introduce symmetric_memory env (#741 ) This PR refactors the algorithm selection logic in MSCCL++ and introduces support for symmetric memory configuration through environment variables. 1. Algorithm Selection Refactoring Use separate class for algo selection. Could introduce more complex logic for algo selection based on message size, arch, if cuda graph is enabled and memory allocation method 2. Symmetric Memory Support Introduced symmetricMemory parameter in algorithm context key generation. Remove disableChannelCache env as is ambiguous 3. Add new args for build_default_algorithms Add flag_buffer, and flag_buffer_size args to build default algorithm. Then we could use unified flag buffer for different algorithms, avoid application hanging when switch algo for different message size. --------- Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-12 19:06:18 -08:00
Caio Rocha	dff3bc7bbb	Support Fusion for ReadPutPacket Operation at DSL (#742 ) Support is being added for fusing the ReadPutPacket operation on DSL, which reduces the overhead caused by reading packet data multiple times in the scratch buffer. Fusion will occur when two rppkt operations are executed consecutively with the same src_buffer: rppkt(src, dst0) + rppkt(src, dst1) -> rppkt(src, [dst0, dst1] Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-12 17:27:20 -08:00
Changho Hwang	42be3660e0	Add a new IB stack impl that doesn't use RDMA atomics (#728 ) * Added configurable InfiniBand (IB) signaling mode. `EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`, `HostNoAtomic`). `Default` is equivalent to `Host` unless specified different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the previous implementation using RDMA atomics for signaling, while `HostNoAtomic` uses write-with-immediate instead. * Regarding updates in Python bindings and API.	2026-02-10 01:07:53 +00:00
Binyang Li	c12822a7af	create CI pipeline for rocm (#718 ) Create CI pipeline for AMD GPU.	2026-02-09 16:55:16 -08:00
Changho Hwang	d7925448f3	Update `copilot-instructions.md` (#722 )	2026-02-06 11:27:01 -08:00
Qinghua Zhou	620378b4fb	Fix cpplint error in main branch (#740 ) Fix the legacy cpplint error in main branch. --------- Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-05 09:25:12 -08:00
Binyang Li	dc747b1522	Refactor reduce kernel (#738 ) - Put the common reduce kernel to reduce_kernel.hpp - Implement operator overloading for the vector type - Clean up the duplicated code at `executor_ kernel.hpp` and `allreduce/common.hpp`	2026-02-05 09:23:43 -08:00
Binyang Li	e21513791a	Address comments for PR #692 (#733 ) Rename nanobind-exposed C++ types to Cpp* Replace MSCCLPP_EXECUTION_PLAN_DIR / MSCCLPP_NATIVE_CACHE_DIR with MSCCLPP_CACHE_DIR across C++ and Python.	2026-02-03 10:13:20 -08:00

1 2 3 4 5 ...

937 Commits