mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 17:26:04 +00:00

Author	SHA1	Message	Date
Changho Hwang	3b56b08bcb	data direct	2026-03-04 23:36:39 +00:00
Changho Hwang	6b2f8199c6	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-02-26 12:41:19 -08:00
Changho Hwang	060982d253	updates	2026-02-26 12:40:58 -08:00
Changho Hwang	67d170674d	optimized recv loop	2026-02-25 19:59:19 -08:00
Changho Hwang	fd7358d9fb	License, lint	2026-02-24 20:30:37 -08:00
Changho Hwang	8effd97bad	License	2026-02-24 20:29:12 -08:00
Changho Hwang	72407af2c1	License	2026-02-24 20:28:32 -08:00
Changho Hwang	ac022c333c	a few updates	2026-02-24 20:25:25 -08:00
Binyang Li	25435acf5d	Add new algos for GB200 (#747 ) - Add new algos (allreduce_rsag, allreduce_rsag_pipeline and allreduce_rsag_zero_copy) for GB200. - Add IB stub for non-IB env - Provides example for algorithm tunning with different nblocks/nthreads Perf for allreduce_rsag ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 25.16 41.67 62.51 0 23.73 44.18 66.27 0 2097152 524288 float sum -1 26.06 80.47 120.71 0 25.31 82.86 124.29 0 4194304 1048576 float sum -1 31.09 134.93 202.39 0 30.75 136.39 204.58 0 8388608 2097152 float sum -1 45.52 184.29 276.43 0 45.13 185.87 278.80 0 16777216 4194304 float sum -1 75.73 221.53 332.30 0 75.51 222.18 333.27 0 33554432 8388608 float sum -1 137.25 244.48 366.72 0 137.22 244.54 366.81 0 67108864 16777216 float sum -1 271.34 247.32 370.99 0 270.86 247.76 371.65 0 134217728 33554432 float sum -1 534.25 251.22 376.84 0 534.43 251.14 376.71 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 264.454 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_pipeline ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 61.57 17.03 25.55 0 61.51 17.05 25.57 0 2097152 524288 float sum -1 61.31 34.20 51.31 0 61.23 34.25 51.38 0 4194304 1048576 float sum -1 61.62 68.06 102.10 0 61.84 67.83 101.74 0 8388608 2097152 float sum -1 61.97 135.37 203.06 0 61.89 135.53 203.30 0 16777216 4194304 float sum -1 63.15 265.65 398.48 0 62.89 266.76 400.15 0 33554432 8388608 float sum -1 100.63 333.46 500.19 0 99.76 336.34 504.51 0 67108864 16777216 float sum -1 180.04 372.75 559.13 0 179.75 373.34 560.01 0 134217728 33554432 float sum -1 339.60 395.23 592.84 0 338.16 396.91 595.36 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 304.665 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_zero_copy ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 14.99 69.93 104.90 0 14.44 72.61 108.92 0 2097152 524288 float sum -1 16.19 129.56 194.33 0 15.85 132.32 198.48 0 4194304 1048576 float sum -1 21.19 197.98 296.97 0 20.64 203.20 304.81 0 8388608 2097152 float sum -1 31.04 270.27 405.41 0 30.68 273.44 410.16 0 16777216 4194304 float sum -1 50.34 333.26 499.89 0 50.15 334.51 501.77 0 33554432 8388608 float sum -1 89.58 374.56 561.84 0 88.65 378.48 567.73 0 67108864 16777216 float sum -1 165.69 405.03 607.54 0 163.64 410.10 615.16 0 134217728 33554432 float sum -1 323.19 415.28 622.93 0 318.01 422.05 633.07 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 414.619 # # Collective test concluded: all_reduce_perf ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-24 16:43:23 -08:00
Binyang Li	184dcbf9d7	Add CI pipeline for no-IB environment testing (#755 ) ## Summary Add CI pipeline support for testing in environments without InfiniBand (IB) hardware. ## Changes ### IB stubs for no-IB builds (`src/core/ib.cc`) - Added stub implementations for `IbMr` and `IbQp` classes in the `#else // !defined(USE_IBVERBS)` block so the library links successfully when built with `-DMSCCLPP_USE_IB=OFF`. ### Environment variable to disable IB tests (`MSCCLPP_DISABLE_IB_TESTS`) - Added `disableIbTests` field to the `Env` class (`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from `MSCCLPP_DISABLE_IB_TESTS` env var. - Exposed as `disable_ib_tests` in Python bindings (`python/csrc/env_py.cpp`). - Updated `python/test/test_mscclpp.py` to skip IB-dependent tests (`create_group_and_connection` with IB transport, `test_h2h_semaphores`, `test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is true. ### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`) The no-IB environment pipeline runs two phases: 1. No-IB build phase: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run unit tests, multi-process unit tests, and pytests (with `MSCCLPP_DISABLE_IB_TESTS=1`). 2. IB build phase: Rebuild with IB enabled (default), stop the existing container, redeploy, and run pytests (with `MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build works correctly in a non-IB environment when IB tests are skipped. Also increased the job timeout from 40 to 60 minutes to accommodate the two-phase pipeline.	2026-02-24 15:55:59 -08:00
Changho Hwang	ac4d713062	updates	2026-02-23 20:08:15 -08:00
Changho Hwang	75dfdd9e20	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-02-23 19:14:13 -08:00
Changho Hwang	25f31b499e	updates	2026-02-23 19:13:10 -08:00
Changho Hwang	22e5efb8dd	gdrcopy install in container	2026-02-23 18:15:38 -08:00
Changho Hwang	98b023adc6	rocm fixes	2026-02-23 18:13:57 -08:00
Caio Rocha	7738603d63	Adjusting Communicator in Python API (#752 )	2026-02-23 16:33:52 -08:00
Caio Rocha	b5256032fe	Disabling Nanobind Memory Leak Warnings in Release Builds (#745 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-23 11:55:17 -08:00
Changho Hwang	54e46ba8a6	rocm fix wip	2026-02-23 11:31:33 -08:00
Changho Hwang	febdbf9230	WIP; need amd fix	2026-02-21 00:02:03 -08:00
mahdiehghazim	2a6f1c1192	Mahdieh/switchchannel test clean (#751 ) This PR adds an example code for switch channel testing. It validates switch channel on single node and multi node environments. We need to add the description of the algorithms and the explanation of the code under doc. example outputs: rank0: ./bidir_switch_channel 10.0.5.233:45571 0 0 Rank 0 (GPU 0): Preparing for tests ... Rank 0 (GPU 0): bytes 4096, elapsed 0.0062328 ms/iter, BW 0.657169 GB/s Rank 0 (GPU 0): bytes 4.1943e+06, elapsed 0.0164577 ms/iter, BW 254.854 GB/s Rank 0 (GPU 0): bytes 1.34218e+08, elapsed 0.33628 ms/iter, BW 399.125 GB/s Rank 0: Succeed! rank1: ./bidir_switch_channel 10.0.5.233:45571 1 0 Rank 1 (GPU 0): Preparing for tests ... Rank 1: Succeed!	2026-02-20 22:46:32 -05:00
Binyang Li	3962574bcb	Address installation issue in some env (#750 ) This pull request updates the way the `nlohmann/json` library is fetched and upgrades it to a newer version in both the main build and test configuration files. Addressed installation issue in some env	2026-02-20 16:11:16 -08:00
Caio Rocha	e2acf7f1c8	Removing MPI Dependency (#743 )	2026-02-20 16:04:12 -08:00
Binyang Li	39865c218b	address flagBuffer ownership issue (#749 ) This pull request updates the handling of the default flag buffer in the C++ and Python bindings to ensure proper memory management when interfacing with Python. Make sure the buffer will not be deallocated when transfer ownership from cpp to python	2026-02-20 13:42:29 -08:00
Binyang Li	4701ae3a95	Update dtype name (#748 ) - Change FP8_E4M3/FP8_E5M2 to FLOAT8_E4M3/FLOAT8_E5M2 - Add torch.uint8 to DataType.uint8 mapping	2026-02-18 10:35:44 -08:00
Binyang Li	d0d5a8c034	Add new CI pipeline for RCCL test (#746 ) Add rccl allreduce/allgather test in ci pipeline Fix hang issue which introduced by PR #741 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-13 10:50:10 -08:00
Qinghua Zhou	edc9c38751	Support uint8 data type for Allreduce (#736 ) Support uint8 data type for Allreduce. Current limitation: uint8 is not supported for NVLS. Performance results with RCCL-test with MSCCLPP on MI300X: \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 512 \| half \| sum \| -1 \| 5.39 \| 0.19 \| 0.33 \| 0 \| 5.45 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 1024 \| half \| sum \| -1 \| 5.53 \| 0.37 \| 0.65 \| 0 \| 5.63 \| 0.36 \| 0.64 \| 0 4096 \| 2048 \| half \| sum \| -1 \| 5.55 \| 0.74 \| 1.29 \| 0 \| 5.56 \| 0.74 \| 1.29 \| 0 8192 \| 4096 \| half \| sum \| -1 \| 5.8 \| 1.41 \| 2.47 \| 0 \| 5.84 \| 1.4 \| 2.46 \| 0 16384 \| 8192 \| half \| sum \| -1 \| 6.57 \| 2.49 \| 4.36 \| 0 \| 6.56 \| 2.5 \| 4.37 \| 0 32768 \| 16384 \| half \| sum \| -1 \| 8.02 \| 4.09 \| 7.15 \| 0 \| 8.06 \| 4.07 \| 7.11 \| 0 65536 \| 32768 \| half \| sum \| -1 \| 8.77 \| 7.47 \| 13.07 \| 0 \| 8.82 \| 7.43 \| 13 \| 0 131072 \| 65536 \| half \| sum \| -1 \| 9.61 \| 13.64 \| 23.87 \| 0 \| 9.78 \| 13.4 \| 23.45 \| 0 262144 \| 131072 \| half \| sum \| -1 \| 11.68 \| 22.44 \| 39.27 \| 0 \| 12.1 \| 21.67 \| 37.93 \| 0 524288 \| 262144 \| half \| sum \| -1 \| 13.77 \| 38.08 \| 66.64 \| 0 \| 13.87 \| 37.79 \| 66.13 \| 0 1048576 \| 524288 \| half \| sum \| -1 \| 19.11 \| 54.87 \| 96.03 \| 0 \| 19.27 \| 54.42 \| 95.24 \| 0 2097152 \| 1048576 \| half \| sum \| -1 \| 24.1 \| 87 \| 152.26 \| 0 \| 24.24 \| 86.52 \| 151.41 \| 0 4194304 \| 2097152 \| half \| sum \| -1 \| 37.16 \| 112.87 \| 197.52 \| 0 \| 37.44 \| 112.03 \| 196.06 \| 0 8388608 \| 4194304 \| half \| sum \| -1 \| 61.53 \| 136.33 \| 238.58 \| 0 \| 61.68 \| 135.99 \| 237.99 \| 0 16777216 \| 8388608 \| half \| sum \| -1 \| 108.8 \| 154.22 \| 269.88 \| 0 \| 109.2 \| 153.6 \| 268.79 \| 0 33554432 \| 16777216 \| half \| sum \| -1 \| 197.8 \| 169.68 \| 296.94 \| 0 \| 198.6 \| 168.92 \| 295.61 \| 0 67108864 \| 33554432 \| half \| sum \| -1 \| 384.6 \| 174.51 \| 305.39 \| 0 \| 385.1 \| 174.27 \| 304.98 \| 0 134217728 \| 67108864 \| half \| sum \| -1 \| 754.1 \| 177.99 \| 311.48 \| 0 \| 754.9 \| 177.78 \| 311.12 \| 0 268435456 \| 134217728 \| half \| sum \| -1 \| 1491.8 \| 179.94 \| 314.89 \| 0 \| 1493.2 \| 179.77 \| 314.6 \| 0 536870912 \| 268435456 \| half \| sum \| -1 \| 2979.6 \| 180.18 \| 315.31 \| 0 \| 2983.9 \| 179.92 \| 314.87 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| fp8_e4m3 \| sum \| -1 \| 5.4 \| 0.19 \| 0.33 \| 0 \| 5.45 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| fp8_e4m3 \| sum \| -1 \| 5.5 \| 0.37 \| 0.65 \| 0 \| 5.6 \| 0.37 \| 0.64 \| 0 4096 \| 4096 \| fp8_e4m3 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.68 \| 0.72 \| 1.26 \| 0 8192 \| 8192 \| fp8_e4m3 \| sum \| -1 \| 5.96 \| 1.38 \| 2.41 \| 0 \| 5.98 \| 1.37 \| 2.4 \| 0 16384 \| 16384 \| fp8_e4m3 \| sum \| -1 \| 6.49 \| 2.52 \| 4.42 \| 0 \| 6.58 \| 2.49 \| 4.36 \| 0 32768 \| 32768 \| fp8_e4m3 \| sum \| -1 \| 8.09 \| 4.05 \| 7.09 \| 0 \| 8.15 \| 4.02 \| 7.03 \| 0 65536 \| 65536 \| fp8_e4m3 \| sum \| -1 \| 8.58 \| 7.64 \| 13.37 \| 0 \| 8.7 \| 7.53 \| 13.18 \| 0 131072 \| 131072 \| fp8_e4m3 \| sum \| -1 \| 9.44 \| 13.88 \| 24.29 \| 0 \| 9.62 \| 13.63 \| 23.85 \| 0 262144 \| 262144 \| fp8_e4m3 \| sum \| -1 \| 10.12 \| 25.9 \| 45.32 \| 0 \| 10.37 \| 25.27 \| 44.22 \| 0 524288 \| 524288 \| fp8_e4m3 \| sum \| -1 \| 13.73 \| 38.19 \| 66.82 \| 0 \| 13.89 \| 37.74 \| 66.04 \| 0 1048576 \| 1048576 \| fp8_e4m3 \| sum \| -1 \| 18.66 \| 56.2 \| 98.34 \| 0 \| 18.92 \| 55.41 \| 96.97 \| 0 2097152 \| 2097152 \| fp8_e4m3 \| sum \| -1 \| 24.54 \| 85.46 \| 149.56 \| 0 \| 24.63 \| 85.16 \| 149.03 \| 0 4194304 \| 4194304 \| fp8_e4m3 \| sum \| -1 \| 37.79 \| 110.98 \| 194.21 \| 0 \| 38.05 \| 110.22 \| 192.88 \| 0 8388608 \| 8388608 \| fp8_e4m3 \| sum \| -1 \| 62.22 \| 134.82 \| 235.94 \| 0 \| 62.63 \| 133.94 \| 234.4 \| 0 16777216 \| 16777216 \| fp8_e4m3 \| sum \| -1 \| 109.9 \| 152.62 \| 267.09 \| 0 \| 110.4 \| 151.9 \| 265.83 \| 0 33554432 \| 33554432 \| fp8_e4m3 \| sum \| -1 \| 201.1 \| 166.82 \| 291.94 \| 0 \| 202.3 \| 165.84 \| 290.22 \| 0 67108864 \| 67108864 \| fp8_e4m3 \| sum \| -1 \| 390 \| 172.06 \| 301.11 \| 0 \| 390.2 \| 171.99 \| 300.99 \| 0 134217728 \| 134217728 \| fp8_e4m3 \| sum \| -1 \| 763.9 \| 175.7 \| 307.47 \| 0 \| 764.2 \| 175.62 \| 307.34 \| 0 268435456 \| 268435456 \| fp8_e4m3 \| sum \| -1 \| 1509.5 \| 177.83 \| 311.2 \| 0 \| 1510.1 \| 177.76 \| 311.08 \| 0 536870912 \| 536870912 \| fp8_e4m3 \| sum \| -1 \| 3010.2 \| 178.35 \| 312.11 \| 0 \| 3014.2 \| 178.11 \| 311.7 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| fp8_e5m2 \| sum \| -1 \| 5.41 \| 0.19 \| 0.33 \| 0 \| 5.44 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| fp8_e5m2 \| sum \| -1 \| 5.5 \| 0.37 \| 0.65 \| 0 \| 5.67 \| 0.36 \| 0.63 \| 0 4096 \| 4096 \| fp8_e5m2 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.69 \| 0.72 \| 1.26 \| 0 8192 \| 8192 \| fp8_e5m2 \| sum \| -1 \| 5.96 \| 1.37 \| 2.4 \| 0 \| 6 \| 1.36 \| 2.39 \| 0 16384 \| 16384 \| fp8_e5m2 \| sum \| -1 \| 6.63 \| 2.47 \| 4.32 \| 0 \| 6.59 \| 2.49 \| 4.35 \| 0 32768 \| 32768 \| fp8_e5m2 \| sum \| -1 \| 8.07 \| 4.06 \| 7.1 \| 0 \| 8.16 \| 4.02 \| 7.03 \| 0 65536 \| 65536 \| fp8_e5m2 \| sum \| -1 \| 8.62 \| 7.61 \| 13.31 \| 0 \| 8.73 \| 7.51 \| 13.14 \| 0 131072 \| 131072 \| fp8_e5m2 \| sum \| -1 \| 9.43 \| 13.9 \| 24.33 \| 0 \| 9.6 \| 13.66 \| 23.9 \| 0 262144 \| 262144 \| fp8_e5m2 \| sum \| -1 \| 10.11 \| 25.94 \| 45.39 \| 0 \| 10.38 \| 25.26 \| 44.21 \| 0 524288 \| 524288 \| fp8_e5m2 \| sum \| -1 \| 13.73 \| 38.19 \| 66.84 \| 0 \| 13.87 \| 37.79 \| 66.13 \| 0 1048576 \| 1048576 \| fp8_e5m2 \| sum \| -1 \| 18.65 \| 56.22 \| 98.39 \| 0 \| 18.93 \| 55.38 \| 96.92 \| 0 2097152 \| 2097152 \| fp8_e5m2 \| sum \| -1 \| 24.54 \| 85.47 \| 149.57 \| 0 \| 24.63 \| 85.16 \| 149.03 \| 0 4194304 \| 4194304 \| fp8_e5m2 \| sum \| -1 \| 37.84 \| 110.83 \| 193.96 \| 0 \| 38.01 \| 110.36 \| 193.12 \| 0 8388608 \| 8388608 \| fp8_e5m2 \| sum \| -1 \| 62.32 \| 134.61 \| 235.58 \| 0 \| 62.55 \| 134.12 \| 234.71 \| 0 16777216 \| 16777216 \| fp8_e5m2 \| sum \| -1 \| 110 \| 152.58 \| 267.01 \| 0 \| 110.3 \| 152.12 \| 266.21 \| 0 33554432 \| 33554432 \| fp8_e5m2 \| sum \| -1 \| 201.1 \| 166.9 \| 292.07 \| 0 \| 201.8 \| 166.26 \| 290.96 \| 0 67108864 \| 67108864 \| fp8_e5m2 \| sum \| -1 \| 390 \| 172.07 \| 301.12 \| 0 \| 390.5 \| 171.87 \| 300.78 \| 0 134217728 \| 134217728 \| fp8_e5m2 \| sum \| -1 \| 763.9 \| 175.69 \| 307.46 \| 0 \| 764.5 \| 175.56 \| 307.23 \| 0 268435456 \| 268435456 \| fp8_e5m2 \| sum \| -1 \| 1509.4 \| 177.84 \| 311.22 \| 0 \| 1509.8 \| 177.8 \| 311.14 \| 0 536870912 \| 536870912 \| fp8_e5m2 \| sum \| -1 \| 3013 \| 178.18 \| 311.82 \| 0 \| 3018 \| 177.89 \| 311.31 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| uint8 \| sum \| -1 \| 5.46 \| 0.19 \| 0.33 \| 0 \| 5.46 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| uint8 \| sum \| -1 \| 5.54 \| 0.37 \| 0.65 \| 0 \| 5.63 \| 0.36 \| 0.64 \| 0 4096 \| 4096 \| uint8 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.63 \| 0.73 \| 1.27 \| 0 8192 \| 8192 \| uint8 \| sum \| -1 \| 5.9 \| 1.39 \| 2.43 \| 0 \| 5.9 \| 1.39 \| 2.43 \| 0 16384 \| 16384 \| uint8 \| sum \| -1 \| 6.6 \| 2.48 \| 4.35 \| 0 \| 6.64 \| 2.47 \| 4.32 \| 0 32768 \| 32768 \| uint8 \| sum \| -1 \| 8.99 \| 3.65 \| 6.38 \| 0 \| 8.99 \| 3.64 \| 6.38 \| 0 65536 \| 65536 \| uint8 \| sum \| -1 \| 9.44 \| 6.94 \| 12.15 \| 0 \| 9.58 \| 6.84 \| 11.98 \| 0 131072 \| 131072 \| uint8 \| sum \| -1 \| 11.72 \| 11.18 \| 19.57 \| 0 \| 11.83 \| 11.08 \| 19.4 \| 0 262144 \| 262144 \| uint8 \| sum \| -1 \| 12.29 \| 21.32 \| 37.31 \| 0 \| 12.45 \| 21.05 \| 36.84 \| 0 524288 \| 524288 \| uint8 \| sum \| -1 \| 13.87 \| 37.8 \| 66.15 \| 0 \| 13.93 \| 37.64 \| 65.88 \| 0 1048576 \| 1048576 \| uint8 \| sum \| -1 \| 19.11 \| 54.88 \| 96.04 \| 0 \| 19.3 \| 54.33 \| 95.08 \| 0 2097152 \| 2097152 \| uint8 \| sum \| -1 \| 24.38 \| 86.01 \| 150.51 \| 0 \| 24.52 \| 85.53 \| 149.67 \| 0 4194304 \| 4194304 \| uint8 \| sum \| -1 \| 37.52 \| 111.78 \| 195.61 \| 0 \| 37.76 \| 111.08 \| 194.39 \| 0 8388608 \| 8388608 \| uint8 \| sum \| -1 \| 62.4 \| 134.44 \| 235.26 \| 0 \| 62.56 \| 134.1 \| 234.67 \| 0 16777216 \| 16777216 \| uint8 \| sum \| -1 \| 110.2 \| 152.22 \| 266.39 \| 0 \| 110.3 \| 152.04 \| 266.08 \| 0 33554432 \| 33554432 \| uint8 \| sum \| -1 \| 199.8 \| 167.94 \| 293.9 \| 0 \| 197.5 \| 169.88 \| 297.29 \| 0 67108864 \| 67108864 \| uint8 \| sum \| -1 \| 386.3 \| 173.73 \| 304.03 \| 0 \| 378.4 \| 177.37 \| 310.39 \| 0 134217728 \| 134217728 \| uint8 \| sum \| -1 \| 758 \| 177.07 \| 309.87 \| 0 \| 741.1 \| 181.12 \| 316.95 \| 0 268435456 \| 268435456 \| uint8 \| sum \| -1 \| 1500.1 \| 178.95 \| 313.16 \| 0 \| 1466.2 \| 183.09 \| 320.4 \| 0 536870912 \| 536870912 \| uint8 \| sum \| -1 \| 2991.7 \| 179.45 \| 314.04 \| 0 \| 2924.8 \| 183.56 \| 321.23 \| 0 --------- Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com>	2026-02-13 10:49:25 -08:00
Binyang Li	bd68319e3e	Refactor algo selection logic and introduce symmetric_memory env (#741 ) This PR refactors the algorithm selection logic in MSCCL++ and introduces support for symmetric memory configuration through environment variables. 1. Algorithm Selection Refactoring Use separate class for algo selection. Could introduce more complex logic for algo selection based on message size, arch, if cuda graph is enabled and memory allocation method 2. Symmetric Memory Support Introduced symmetricMemory parameter in algorithm context key generation. Remove disableChannelCache env as is ambiguous 3. Add new args for build_default_algorithms Add flag_buffer, and flag_buffer_size args to build default algorithm. Then we could use unified flag buffer for different algorithms, avoid application hanging when switch algo for different message size. --------- Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-12 19:06:18 -08:00
Caio Rocha	dff3bc7bbb	Support Fusion for ReadPutPacket Operation at DSL (#742 ) Support is being added for fusing the ReadPutPacket operation on DSL, which reduces the overhead caused by reading packet data multiple times in the scratch buffer. Fusion will occur when two rppkt operations are executed consecutively with the same src_buffer: rppkt(src, dst0) + rppkt(src, dst1) -> rppkt(src, [dst0, dst1] Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-12 17:27:20 -08:00
Changho Hwang	42be3660e0	Add a new IB stack impl that doesn't use RDMA atomics (#728 ) * Added configurable InfiniBand (IB) signaling mode. `EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`, `HostNoAtomic`). `Default` is equivalent to `Host` unless specified different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the previous implementation using RDMA atomics for signaling, while `HostNoAtomic` uses write-with-immediate instead. * Regarding updates in Python bindings and API.	2026-02-10 01:07:53 +00:00
Binyang Li	c12822a7af	create CI pipeline for rocm (#718 ) Create CI pipeline for AMD GPU.	2026-02-09 16:55:16 -08:00
Changho Hwang	d7925448f3	Update `copilot-instructions.md` (#722 )	2026-02-06 11:27:01 -08:00
Qinghua Zhou	620378b4fb	Fix cpplint error in main branch (#740 ) Fix the legacy cpplint error in main branch. --------- Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-05 09:25:12 -08:00
Binyang Li	dc747b1522	Refactor reduce kernel (#738 ) - Put the common reduce kernel to reduce_kernel.hpp - Implement operator overloading for the vector type - Clean up the duplicated code at `executor_ kernel.hpp` and `allreduce/common.hpp`	2026-02-05 09:23:43 -08:00
Binyang Li	e21513791a	Address comments for PR #692 (#733 ) Rename nanobind-exposed C++ types to Cpp* Replace MSCCLPP_EXECUTION_PLAN_DIR / MSCCLPP_NATIVE_CACHE_DIR with MSCCLPP_CACHE_DIR across C++ and Python.	2026-02-03 10:13:20 -08:00
Changho Hwang	03b1936ddb	Support multi-node in `MemoryChannel` tutorial (#726 ) Co-authored-by: mahdiehghazim <mahdiehghazi@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-02 15:50:45 -08:00
Qinghua Zhou	41bf96abc2	Fix the relative path extraction on github page (#739 ) Fix missing 'mscclpp' base directory during version switching on GitHub Pages. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-02 13:16:11 -08:00
Qinghua Zhou	f0441ee4ea	Update document versioning for PR #724 (#735 ) This PR fix the issue of generating docs when we take https://github.com/microsoft/mscclpp/pull/724 into main branch. Build docs for main branch separately. Use HEAD request instead of GET to check if a page exist. Filter out versions before v0.4.0 in generate_versions.py. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-01 19:52:01 -08:00
mahdiehghazim	08589bf332	Use native GPU architecture when NVIDIA GPU is detected; otherwise fall back to multi-arch build. (#732 ) This change makes MSCCL++ automatically select CUDA architectures based on the build environment. If an NVIDIA GPU is detected, the build targets the native GPU architecture for optimal performance; otherwise, it falls back to building for multiple architectures for portability. When building for the native architecture, FP8 support is automatically enabled for “a-series” GPUs (e.g., sm_100a), allowing the appropriate optimized code paths to be picked up.	2026-01-26 15:53:36 -05:00
Qinghua Zhou	cc797abc87	Revert "Support versioning for mscclpp document (#724 )" (#734 ) This PR reverts commit 69d3b7 to avoid the github page issue.	2026-01-23 16:42:54 -08:00
Qinghua Zhou	69d3b79ecd	Support versioning for mscclpp document (#724 ) Show all the versions of mscclpp document on the webpage https://microsoft.github.io/mscclpp/ Add sphinx-multiversion to generate documents for different versions. Add version selector on document webpage. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-01-23 09:45:41 -08:00
mahdiehghazim	071dc92d38	fp8 nvls support (e5m2 and e4m3) (#730 ) This PR adds FP8 support to the nvls code. For compilation, we need to add this flag to the cmake command: -DMSCCLPP_GPU_ARCHS=100a	2026-01-23 10:38:38 -05:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	78ce9fac8d	Fix ci pipeline failure (#729 )	2026-01-21 13:28:14 -05:00
Binyang Li	abbdb7f630	Fix ci issue (#727 ) Solve the CI failure when cuda version newer than driver version	2026-01-15 22:21:02 -08:00
Changho Hwang	105239fc6c	Use `GpuIpcMem` for NVLS connections (#719 ) * Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast memory handling. * Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API handles this automatically). * Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a shared pointer with custom deleter for unmapping, which prevents misuse of raw pointers and reduces states to be stored in the `GpuIpcMem` instance. * Now for `RuntimeIpc` type handles, for consistency with other types, `cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of the ctor of `GpuIpcMem`. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-15 13:16:04 +08:00
Changho Hwang	c2a87302bd	Reduce CI build time (#723 ) Specify GPU architecture during CI build to reduce build time	2026-01-15 10:45:40 +08:00
Changho Hwang	a02ba3b1bd	Add `GpuIpcMemHandle` (#704 ) Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-14 10:49:31 +08:00
Changho Hwang	4dd075602c	Bypassing SSCA alerts (#721 ) Remove default image tags to bypass SSCA alerts	2026-01-12 23:46:27 +08:00
Changho Hwang	b8a1b0a134	Add CUDA 13.0 Docker images (#720 ) * Updated Dockerfiles and the build script to support CUDA 13.0 * Added Python3 venv which is required since Python 3.12 * Updated the default MLNX-OFED version to the LTS version * Added docker push instruction for multi-arch manifest	2026-01-09 19:03:33 +08:00
Binyang Li	eab2afb8b9	Update container images for pipeline (#717 ) - Remove cuda11 support for nccl-test pipeline, since nccl build failed for cuda11. - Update to cuda12.9 for CI pipeline. Will consider dropping cuda11 support add cuda13 support in near future	2026-01-07 14:10:49 +08:00

1 2 3 4 5 ...

921 Commits