mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-13 09:46:00 +00:00

Author	SHA1	Message	Date
Changho Hwang	bcb392ffdf	updates	2026-03-08 03:33:51 +00:00
Changho Hwang	375bc13831	fix	2026-03-07 02:53:54 +00:00
Changho Hwang	c40a233f55	fix	2026-03-07 02:48:08 +00:00
Changho Hwang	e0c7ddb5ff	fix	2026-03-07 02:33:20 +00:00
Changho Hwang	75ac8be225	fix	2026-03-07 02:31:51 +00:00
Changho Hwang	284d9139c9	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-03-06 18:26:02 -08:00
Changho Hwang	c699b8a784	az pipeline refactoring	2026-03-07 02:23:30 +00:00
Binyang Li	3751f0299b	Fix NCCL fallback comm destroy and use latest NCCL release in CI (#760 ) ## Summary Fix NCCL fallback communicator cleanup errors and update CI to use stable NCCL releases. ## Problem When using `LD_PRELOAD=libmscclpp_nccl.so` with NCCL fallback enabled, the following warnings appear at program exit: ``` NCCL WARN commReclaim: cleanup comm 0x55a0dcadaa90 rank 3 failed in destroy/abort, error 3 ``` This is caused by three bugs in the NCCL fallback communicator lifecycle management. ## Root Causes & Fixes ### 1. Symbol interposition during NCCL cleanup (`RTLD_DEEPBIND`) Root cause: When the fallback NCCL library is loaded via `dlopen`, its internal calls to its own public API functions (e.g., `ncclCommWindowDeregister`, `ncclMemFree`) during `commFree` cleanup are intercepted by our `LD_PRELOAD`'d stub functions, which return errors. This causes NCCL's `commReclaim` to report `error 3` (`ncclSystemError`). Fix: Add `RTLD_DEEPBIND` to the `dlopen` flags. This makes the dlopen'd NCCL library resolve its own symbols internally first, bypassing our interposition layer for internal calls. ### 2. Missing `ncclCommFinalize` forwarding Root cause: `CommFinalize` was not in the `mscclppNcclOps_t` struct and was never loaded via `dlsym`. So `ncclCommFinalize` never forwarded to the real NCCL's finalize, which is required before `ncclCommDestroy` in NCCL 2.29+. Fix: Add `CommFinalize` to the ops struct and load it via `dlsym`. Forward the call in `ncclCommFinalize`. ### 3. CI: Use latest NCCL release tag The CI pipeline was cloning the NCCL default branch (which may contain unreleased/unstable code). Updated to fetch the latest release tag via GitHub API and clone that specific tag. ## Testing Verified with the exact CI command: ```bash mpirun -np 8 --bind-to numa --allow-run-as-root \ -x LD_PRELOAD=libmscclpp_nccl.so \ -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE \ -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce" \ -x MSCCLPP_NCCL_LIB_PATH=/root/nccl/build/lib/libnccl.so \ all_reduce_perf -b 1K -e 1G -f 2 -d half -G 1 -w 10 -n 100 ``` - Before: `commReclaim: error 3` warnings on all 8 ranks at exit - After: Clean exit, no warnings, correct results ## Files Changed - `src/ext/nccl/nccl.cc` — Fix comm destroy lifecycle (RTLD_DEEPBIND, CommFinalize forwarding, destroy order) - `.azure-pipelines/templates/nccl-test.yaml` — Use latest NCCL release tag in CI	2026-03-06 16:33:35 -08:00
Changho Hwang	00583da21b	separate pipeline for codecov	2026-03-06 21:31:04 +00:00
Changho Hwang	60ff32c014	updates	2026-03-06 19:40:34 +00:00
Changho Hwang	bbb9c10a1e	Update Docker image	2026-03-06 19:15:04 +00:00
Changho Hwang	f4b8574a1c	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-03-03 15:49:01 -08:00
Xingbo Wu	69565a2f32	Do threadInit/cudaSetDevice before other cuda calls (#757 ) I recently encountered a weird memory usage issue. After starting the proxy service on a cuda device X > 0, I notice an unexpected thread entity apprear on both the GPU X and GPU 0, where GPU 0's share is about 500MB. Note that when the device is 0, there is no extra memory usage. The image clearly shows that when 8 ranks each using one GPU and starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB extra memory. <img width="1247" height="1367" alt="Screenshot 2026-02-28 000153" src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4" /> After tracking down to when it happens, I identified the root cause in Proxy thread initialization. // never capture in a proxy thread auto mode = cudaStreamCaptureModeRelaxed; MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode)); pimpl_->threadInit(); The call to cudaThreadExchangeStreamCaptureMode() actually triggers some resource allocation on the "current device" which is still 0 for the starting thread. The later threadInit() is too late to set the correct GPU number. The fix is simple: call threadInit() before the first cuda call: pimpl_->threadInit(); // never capture in a proxy thread auto mode = cudaStreamCaptureModeRelaxed; MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode)); This guarantees that the current device is properly set before calling any resource-allocating cuda functions. This is the memory usage after the fix. The extra memory usages are gone. <img width="1242" height="459" alt="Image (1)" src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9" /> --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-03-02 15:53:59 -08:00
Caio Rocha	4bc1999001	Adding Support to Setting Message Size Range in Native Algorithm API (#758 )	2026-02-27 17:50:43 -08:00
Binyang Li	ab49386839	Add doc for perf tunning (#756 )	2026-02-27 10:59:36 -08:00
Changho Hwang	8c3a4362cd	update CI	2026-02-26 19:37:06 -08:00
Changho Hwang	eb99a266e6	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-02-26 19:25:58 -08:00
Binyang Li	25435acf5d	Add new algos for GB200 (#747 ) - Add new algos (allreduce_rsag, allreduce_rsag_pipeline and allreduce_rsag_zero_copy) for GB200. - Add IB stub for non-IB env - Provides example for algorithm tunning with different nblocks/nthreads Perf for allreduce_rsag ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 25.16 41.67 62.51 0 23.73 44.18 66.27 0 2097152 524288 float sum -1 26.06 80.47 120.71 0 25.31 82.86 124.29 0 4194304 1048576 float sum -1 31.09 134.93 202.39 0 30.75 136.39 204.58 0 8388608 2097152 float sum -1 45.52 184.29 276.43 0 45.13 185.87 278.80 0 16777216 4194304 float sum -1 75.73 221.53 332.30 0 75.51 222.18 333.27 0 33554432 8388608 float sum -1 137.25 244.48 366.72 0 137.22 244.54 366.81 0 67108864 16777216 float sum -1 271.34 247.32 370.99 0 270.86 247.76 371.65 0 134217728 33554432 float sum -1 534.25 251.22 376.84 0 534.43 251.14 376.71 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 264.454 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_pipeline ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 61.57 17.03 25.55 0 61.51 17.05 25.57 0 2097152 524288 float sum -1 61.31 34.20 51.31 0 61.23 34.25 51.38 0 4194304 1048576 float sum -1 61.62 68.06 102.10 0 61.84 67.83 101.74 0 8388608 2097152 float sum -1 61.97 135.37 203.06 0 61.89 135.53 203.30 0 16777216 4194304 float sum -1 63.15 265.65 398.48 0 62.89 266.76 400.15 0 33554432 8388608 float sum -1 100.63 333.46 500.19 0 99.76 336.34 504.51 0 67108864 16777216 float sum -1 180.04 372.75 559.13 0 179.75 373.34 560.01 0 134217728 33554432 float sum -1 339.60 395.23 592.84 0 338.16 396.91 595.36 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 304.665 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_zero_copy ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 14.99 69.93 104.90 0 14.44 72.61 108.92 0 2097152 524288 float sum -1 16.19 129.56 194.33 0 15.85 132.32 198.48 0 4194304 1048576 float sum -1 21.19 197.98 296.97 0 20.64 203.20 304.81 0 8388608 2097152 float sum -1 31.04 270.27 405.41 0 30.68 273.44 410.16 0 16777216 4194304 float sum -1 50.34 333.26 499.89 0 50.15 334.51 501.77 0 33554432 8388608 float sum -1 89.58 374.56 561.84 0 88.65 378.48 567.73 0 67108864 16777216 float sum -1 165.69 405.03 607.54 0 163.64 410.10 615.16 0 134217728 33554432 float sum -1 323.19 415.28 622.93 0 318.01 422.05 633.07 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 414.619 # # Collective test concluded: all_reduce_perf ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-24 16:43:23 -08:00
Binyang Li	184dcbf9d7	Add CI pipeline for no-IB environment testing (#755 ) ## Summary Add CI pipeline support for testing in environments without InfiniBand (IB) hardware. ## Changes ### IB stubs for no-IB builds (`src/core/ib.cc`) - Added stub implementations for `IbMr` and `IbQp` classes in the `#else // !defined(USE_IBVERBS)` block so the library links successfully when built with `-DMSCCLPP_USE_IB=OFF`. ### Environment variable to disable IB tests (`MSCCLPP_DISABLE_IB_TESTS`) - Added `disableIbTests` field to the `Env` class (`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from `MSCCLPP_DISABLE_IB_TESTS` env var. - Exposed as `disable_ib_tests` in Python bindings (`python/csrc/env_py.cpp`). - Updated `python/test/test_mscclpp.py` to skip IB-dependent tests (`create_group_and_connection` with IB transport, `test_h2h_semaphores`, `test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is true. ### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`) The no-IB environment pipeline runs two phases: 1. No-IB build phase: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run unit tests, multi-process unit tests, and pytests (with `MSCCLPP_DISABLE_IB_TESTS=1`). 2. IB build phase: Rebuild with IB enabled (default), stop the existing container, redeploy, and run pytests (with `MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build works correctly in a non-IB environment when IB tests are skipped. Also increased the job timeout from 40 to 60 minutes to accommodate the two-phase pipeline.	2026-02-24 15:55:59 -08:00
Changho Hwang	11e27e2978	Update coverage report commands to handle errors and adjust paths	2026-02-23 18:33:11 -08:00
Changho Hwang	d88ee8de9c	Refine coverage report to include only mscclpp source and include directories	2026-02-23 18:27:14 -08:00
Changho Hwang	2f27d7d7fe	Update coverage report to exclude additional directories in lcov command	2026-02-23 18:25:10 -08:00
Changho Hwang	2adf4a48e2	use variable group	2026-02-23 16:49:39 -08:00
Changho Hwang	2f02d383c4	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-02-23 16:43:35 -08:00
Caio Rocha	7738603d63	Adjusting Communicator in Python API (#752 )	2026-02-23 16:33:52 -08:00
Changho Hwang	edda25df6b	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-02-23 14:48:04 -08:00
Changho Hwang	d0c709ea82	Fix Codecov token usage in coverage upload step	2026-02-23 14:45:29 -08:00
Caio Rocha	b5256032fe	Disabling Nanobind Memory Leak Warnings in Release Builds (#745 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-23 11:55:17 -08:00
Changho Hwang	6c2bc8f4b3	coverage fix	2026-02-23 11:32:50 -08:00
Changho Hwang	04ebd9ba6e	fix coverage file path	2026-02-23 10:39:39 -08:00
Changho Hwang	c4afbe12d9	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-02-23 10:25:22 -08:00
mahdiehghazim	2a6f1c1192	Mahdieh/switchchannel test clean (#751 ) This PR adds an example code for switch channel testing. It validates switch channel on single node and multi node environments. We need to add the description of the algorithms and the explanation of the code under doc. example outputs: rank0: ./bidir_switch_channel 10.0.5.233:45571 0 0 Rank 0 (GPU 0): Preparing for tests ... Rank 0 (GPU 0): bytes 4096, elapsed 0.0062328 ms/iter, BW 0.657169 GB/s Rank 0 (GPU 0): bytes 4.1943e+06, elapsed 0.0164577 ms/iter, BW 254.854 GB/s Rank 0 (GPU 0): bytes 1.34218e+08, elapsed 0.33628 ms/iter, BW 399.125 GB/s Rank 0: Succeed! rank1: ./bidir_switch_channel 10.0.5.233:45571 1 0 Rank 1 (GPU 0): Preparing for tests ... Rank 1: Succeed!	2026-02-20 22:46:32 -05:00
Binyang Li	3962574bcb	Address installation issue in some env (#750 ) This pull request updates the way the `nlohmann/json` library is fetched and upgrades it to a newer version in both the main build and test configuration files. Addressed installation issue in some env	2026-02-20 16:11:16 -08:00
Caio Rocha	e2acf7f1c8	Removing MPI Dependency (#743 )	2026-02-20 16:04:12 -08:00
Changho Hwang	41695bab94	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-02-20 14:04:27 -08:00
Changho Hwang	b9609f83a0	add coverage flags	2026-02-20 14:03:54 -08:00
Changho Hwang	caeec7590a	updates	2026-02-20 13:43:32 -08:00
Binyang Li	39865c218b	address flagBuffer ownership issue (#749 ) This pull request updates the handling of the default flag buffer in the C++ and Python bindings to ensure proper memory management when interfacing with Python. Make sure the buffer will not be deallocated when transfer ownership from cpp to python	2026-02-20 13:42:29 -08:00
Changho Hwang	dcdd3febd1	update UT CI	2026-02-20 13:35:32 -08:00
Changho Hwang	b64536f28e	Merge branch 'main' into copilot/remove-gtest-use-custom-framework	2026-02-18 20:35:34 -08:00
Changho Hwang	2b4adcc4ad	fix lint	2026-02-18 20:33:57 -08:00
Changho Hwang	b693d1b3fc	lint issue	2026-02-18 20:31:25 -08:00
Changho Hwang	4d9aceac6f	badge	2026-02-18 20:25:50 -08:00
Changho Hwang	bed85b56cb	codecov upload	2026-02-18 20:23:42 -08:00
Changho Hwang	e40c72bd2b	license text update	2026-02-18 20:12:32 -08:00
Changho Hwang	4afbf780ed	minor	2026-02-18 19:54:37 -08:00
Changho Hwang	d2efc2fd3b	coverage update	2026-02-18 19:48:29 -08:00
Changho Hwang	b6ce0f2ede	simplify	2026-02-18 19:16:21 -08:00
Changho Hwang	30b9891180	simplifying	2026-02-18 18:35:33 -08:00
Binyang Li	4701ae3a95	Update dtype name (#748 ) - Change FP8_E4M3/FP8_E5M2 to FLOAT8_E4M3/FLOAT8_E5M2 - Add torch.uint8 to DataType.uint8 mapping	2026-02-18 10:35:44 -08:00

1 2 3 4 5 ...

968 Commits