mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-04-19 22:39:11 +00:00

Author	SHA1	Message	Date
Binyang Li	39865c218b	address flagBuffer ownership issue (#749 ) This pull request updates the handling of the default flag buffer in the C++ and Python bindings to ensure proper memory management when interfacing with Python. Make sure the buffer will not be deallocated when transfer ownership from cpp to python	2026-02-20 13:42:29 -08:00
Binyang Li	4701ae3a95	Update dtype name (#748 ) - Change FP8_E4M3/FP8_E5M2 to FLOAT8_E4M3/FLOAT8_E5M2 - Add torch.uint8 to DataType.uint8 mapping	2026-02-18 10:35:44 -08:00
Binyang Li	d0d5a8c034	Add new CI pipeline for RCCL test (#746 ) Add rccl allreduce/allgather test in ci pipeline Fix hang issue which introduced by PR #741 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-13 10:50:10 -08:00
Qinghua Zhou	edc9c38751	Support uint8 data type for Allreduce (#736 ) Support uint8 data type for Allreduce. Current limitation: uint8 is not supported for NVLS. Performance results with RCCL-test with MSCCLPP on MI300X: \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 512 \| half \| sum \| -1 \| 5.39 \| 0.19 \| 0.33 \| 0 \| 5.45 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 1024 \| half \| sum \| -1 \| 5.53 \| 0.37 \| 0.65 \| 0 \| 5.63 \| 0.36 \| 0.64 \| 0 4096 \| 2048 \| half \| sum \| -1 \| 5.55 \| 0.74 \| 1.29 \| 0 \| 5.56 \| 0.74 \| 1.29 \| 0 8192 \| 4096 \| half \| sum \| -1 \| 5.8 \| 1.41 \| 2.47 \| 0 \| 5.84 \| 1.4 \| 2.46 \| 0 16384 \| 8192 \| half \| sum \| -1 \| 6.57 \| 2.49 \| 4.36 \| 0 \| 6.56 \| 2.5 \| 4.37 \| 0 32768 \| 16384 \| half \| sum \| -1 \| 8.02 \| 4.09 \| 7.15 \| 0 \| 8.06 \| 4.07 \| 7.11 \| 0 65536 \| 32768 \| half \| sum \| -1 \| 8.77 \| 7.47 \| 13.07 \| 0 \| 8.82 \| 7.43 \| 13 \| 0 131072 \| 65536 \| half \| sum \| -1 \| 9.61 \| 13.64 \| 23.87 \| 0 \| 9.78 \| 13.4 \| 23.45 \| 0 262144 \| 131072 \| half \| sum \| -1 \| 11.68 \| 22.44 \| 39.27 \| 0 \| 12.1 \| 21.67 \| 37.93 \| 0 524288 \| 262144 \| half \| sum \| -1 \| 13.77 \| 38.08 \| 66.64 \| 0 \| 13.87 \| 37.79 \| 66.13 \| 0 1048576 \| 524288 \| half \| sum \| -1 \| 19.11 \| 54.87 \| 96.03 \| 0 \| 19.27 \| 54.42 \| 95.24 \| 0 2097152 \| 1048576 \| half \| sum \| -1 \| 24.1 \| 87 \| 152.26 \| 0 \| 24.24 \| 86.52 \| 151.41 \| 0 4194304 \| 2097152 \| half \| sum \| -1 \| 37.16 \| 112.87 \| 197.52 \| 0 \| 37.44 \| 112.03 \| 196.06 \| 0 8388608 \| 4194304 \| half \| sum \| -1 \| 61.53 \| 136.33 \| 238.58 \| 0 \| 61.68 \| 135.99 \| 237.99 \| 0 16777216 \| 8388608 \| half \| sum \| -1 \| 108.8 \| 154.22 \| 269.88 \| 0 \| 109.2 \| 153.6 \| 268.79 \| 0 33554432 \| 16777216 \| half \| sum \| -1 \| 197.8 \| 169.68 \| 296.94 \| 0 \| 198.6 \| 168.92 \| 295.61 \| 0 67108864 \| 33554432 \| half \| sum \| -1 \| 384.6 \| 174.51 \| 305.39 \| 0 \| 385.1 \| 174.27 \| 304.98 \| 0 134217728 \| 67108864 \| half \| sum \| -1 \| 754.1 \| 177.99 \| 311.48 \| 0 \| 754.9 \| 177.78 \| 311.12 \| 0 268435456 \| 134217728 \| half \| sum \| -1 \| 1491.8 \| 179.94 \| 314.89 \| 0 \| 1493.2 \| 179.77 \| 314.6 \| 0 536870912 \| 268435456 \| half \| sum \| -1 \| 2979.6 \| 180.18 \| 315.31 \| 0 \| 2983.9 \| 179.92 \| 314.87 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| fp8_e4m3 \| sum \| -1 \| 5.4 \| 0.19 \| 0.33 \| 0 \| 5.45 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| fp8_e4m3 \| sum \| -1 \| 5.5 \| 0.37 \| 0.65 \| 0 \| 5.6 \| 0.37 \| 0.64 \| 0 4096 \| 4096 \| fp8_e4m3 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.68 \| 0.72 \| 1.26 \| 0 8192 \| 8192 \| fp8_e4m3 \| sum \| -1 \| 5.96 \| 1.38 \| 2.41 \| 0 \| 5.98 \| 1.37 \| 2.4 \| 0 16384 \| 16384 \| fp8_e4m3 \| sum \| -1 \| 6.49 \| 2.52 \| 4.42 \| 0 \| 6.58 \| 2.49 \| 4.36 \| 0 32768 \| 32768 \| fp8_e4m3 \| sum \| -1 \| 8.09 \| 4.05 \| 7.09 \| 0 \| 8.15 \| 4.02 \| 7.03 \| 0 65536 \| 65536 \| fp8_e4m3 \| sum \| -1 \| 8.58 \| 7.64 \| 13.37 \| 0 \| 8.7 \| 7.53 \| 13.18 \| 0 131072 \| 131072 \| fp8_e4m3 \| sum \| -1 \| 9.44 \| 13.88 \| 24.29 \| 0 \| 9.62 \| 13.63 \| 23.85 \| 0 262144 \| 262144 \| fp8_e4m3 \| sum \| -1 \| 10.12 \| 25.9 \| 45.32 \| 0 \| 10.37 \| 25.27 \| 44.22 \| 0 524288 \| 524288 \| fp8_e4m3 \| sum \| -1 \| 13.73 \| 38.19 \| 66.82 \| 0 \| 13.89 \| 37.74 \| 66.04 \| 0 1048576 \| 1048576 \| fp8_e4m3 \| sum \| -1 \| 18.66 \| 56.2 \| 98.34 \| 0 \| 18.92 \| 55.41 \| 96.97 \| 0 2097152 \| 2097152 \| fp8_e4m3 \| sum \| -1 \| 24.54 \| 85.46 \| 149.56 \| 0 \| 24.63 \| 85.16 \| 149.03 \| 0 4194304 \| 4194304 \| fp8_e4m3 \| sum \| -1 \| 37.79 \| 110.98 \| 194.21 \| 0 \| 38.05 \| 110.22 \| 192.88 \| 0 8388608 \| 8388608 \| fp8_e4m3 \| sum \| -1 \| 62.22 \| 134.82 \| 235.94 \| 0 \| 62.63 \| 133.94 \| 234.4 \| 0 16777216 \| 16777216 \| fp8_e4m3 \| sum \| -1 \| 109.9 \| 152.62 \| 267.09 \| 0 \| 110.4 \| 151.9 \| 265.83 \| 0 33554432 \| 33554432 \| fp8_e4m3 \| sum \| -1 \| 201.1 \| 166.82 \| 291.94 \| 0 \| 202.3 \| 165.84 \| 290.22 \| 0 67108864 \| 67108864 \| fp8_e4m3 \| sum \| -1 \| 390 \| 172.06 \| 301.11 \| 0 \| 390.2 \| 171.99 \| 300.99 \| 0 134217728 \| 134217728 \| fp8_e4m3 \| sum \| -1 \| 763.9 \| 175.7 \| 307.47 \| 0 \| 764.2 \| 175.62 \| 307.34 \| 0 268435456 \| 268435456 \| fp8_e4m3 \| sum \| -1 \| 1509.5 \| 177.83 \| 311.2 \| 0 \| 1510.1 \| 177.76 \| 311.08 \| 0 536870912 \| 536870912 \| fp8_e4m3 \| sum \| -1 \| 3010.2 \| 178.35 \| 312.11 \| 0 \| 3014.2 \| 178.11 \| 311.7 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| fp8_e5m2 \| sum \| -1 \| 5.41 \| 0.19 \| 0.33 \| 0 \| 5.44 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| fp8_e5m2 \| sum \| -1 \| 5.5 \| 0.37 \| 0.65 \| 0 \| 5.67 \| 0.36 \| 0.63 \| 0 4096 \| 4096 \| fp8_e5m2 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.69 \| 0.72 \| 1.26 \| 0 8192 \| 8192 \| fp8_e5m2 \| sum \| -1 \| 5.96 \| 1.37 \| 2.4 \| 0 \| 6 \| 1.36 \| 2.39 \| 0 16384 \| 16384 \| fp8_e5m2 \| sum \| -1 \| 6.63 \| 2.47 \| 4.32 \| 0 \| 6.59 \| 2.49 \| 4.35 \| 0 32768 \| 32768 \| fp8_e5m2 \| sum \| -1 \| 8.07 \| 4.06 \| 7.1 \| 0 \| 8.16 \| 4.02 \| 7.03 \| 0 65536 \| 65536 \| fp8_e5m2 \| sum \| -1 \| 8.62 \| 7.61 \| 13.31 \| 0 \| 8.73 \| 7.51 \| 13.14 \| 0 131072 \| 131072 \| fp8_e5m2 \| sum \| -1 \| 9.43 \| 13.9 \| 24.33 \| 0 \| 9.6 \| 13.66 \| 23.9 \| 0 262144 \| 262144 \| fp8_e5m2 \| sum \| -1 \| 10.11 \| 25.94 \| 45.39 \| 0 \| 10.38 \| 25.26 \| 44.21 \| 0 524288 \| 524288 \| fp8_e5m2 \| sum \| -1 \| 13.73 \| 38.19 \| 66.84 \| 0 \| 13.87 \| 37.79 \| 66.13 \| 0 1048576 \| 1048576 \| fp8_e5m2 \| sum \| -1 \| 18.65 \| 56.22 \| 98.39 \| 0 \| 18.93 \| 55.38 \| 96.92 \| 0 2097152 \| 2097152 \| fp8_e5m2 \| sum \| -1 \| 24.54 \| 85.47 \| 149.57 \| 0 \| 24.63 \| 85.16 \| 149.03 \| 0 4194304 \| 4194304 \| fp8_e5m2 \| sum \| -1 \| 37.84 \| 110.83 \| 193.96 \| 0 \| 38.01 \| 110.36 \| 193.12 \| 0 8388608 \| 8388608 \| fp8_e5m2 \| sum \| -1 \| 62.32 \| 134.61 \| 235.58 \| 0 \| 62.55 \| 134.12 \| 234.71 \| 0 16777216 \| 16777216 \| fp8_e5m2 \| sum \| -1 \| 110 \| 152.58 \| 267.01 \| 0 \| 110.3 \| 152.12 \| 266.21 \| 0 33554432 \| 33554432 \| fp8_e5m2 \| sum \| -1 \| 201.1 \| 166.9 \| 292.07 \| 0 \| 201.8 \| 166.26 \| 290.96 \| 0 67108864 \| 67108864 \| fp8_e5m2 \| sum \| -1 \| 390 \| 172.07 \| 301.12 \| 0 \| 390.5 \| 171.87 \| 300.78 \| 0 134217728 \| 134217728 \| fp8_e5m2 \| sum \| -1 \| 763.9 \| 175.69 \| 307.46 \| 0 \| 764.5 \| 175.56 \| 307.23 \| 0 268435456 \| 268435456 \| fp8_e5m2 \| sum \| -1 \| 1509.4 \| 177.84 \| 311.22 \| 0 \| 1509.8 \| 177.8 \| 311.14 \| 0 536870912 \| 536870912 \| fp8_e5m2 \| sum \| -1 \| 3013 \| 178.18 \| 311.82 \| 0 \| 3018 \| 177.89 \| 311.31 \| 0 \# out-of-place in-place \# size count type redop root time algbw busbw #wrong time algbw busbw #wrong \# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 \| 1024 \| uint8 \| sum \| -1 \| 5.46 \| 0.19 \| 0.33 \| 0 \| 5.46 \| 0.19 \| 0.33 \| 0 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- 2048 \| 2048 \| uint8 \| sum \| -1 \| 5.54 \| 0.37 \| 0.65 \| 0 \| 5.63 \| 0.36 \| 0.64 \| 0 4096 \| 4096 \| uint8 \| sum \| -1 \| 5.61 \| 0.73 \| 1.28 \| 0 \| 5.63 \| 0.73 \| 1.27 \| 0 8192 \| 8192 \| uint8 \| sum \| -1 \| 5.9 \| 1.39 \| 2.43 \| 0 \| 5.9 \| 1.39 \| 2.43 \| 0 16384 \| 16384 \| uint8 \| sum \| -1 \| 6.6 \| 2.48 \| 4.35 \| 0 \| 6.64 \| 2.47 \| 4.32 \| 0 32768 \| 32768 \| uint8 \| sum \| -1 \| 8.99 \| 3.65 \| 6.38 \| 0 \| 8.99 \| 3.64 \| 6.38 \| 0 65536 \| 65536 \| uint8 \| sum \| -1 \| 9.44 \| 6.94 \| 12.15 \| 0 \| 9.58 \| 6.84 \| 11.98 \| 0 131072 \| 131072 \| uint8 \| sum \| -1 \| 11.72 \| 11.18 \| 19.57 \| 0 \| 11.83 \| 11.08 \| 19.4 \| 0 262144 \| 262144 \| uint8 \| sum \| -1 \| 12.29 \| 21.32 \| 37.31 \| 0 \| 12.45 \| 21.05 \| 36.84 \| 0 524288 \| 524288 \| uint8 \| sum \| -1 \| 13.87 \| 37.8 \| 66.15 \| 0 \| 13.93 \| 37.64 \| 65.88 \| 0 1048576 \| 1048576 \| uint8 \| sum \| -1 \| 19.11 \| 54.88 \| 96.04 \| 0 \| 19.3 \| 54.33 \| 95.08 \| 0 2097152 \| 2097152 \| uint8 \| sum \| -1 \| 24.38 \| 86.01 \| 150.51 \| 0 \| 24.52 \| 85.53 \| 149.67 \| 0 4194304 \| 4194304 \| uint8 \| sum \| -1 \| 37.52 \| 111.78 \| 195.61 \| 0 \| 37.76 \| 111.08 \| 194.39 \| 0 8388608 \| 8388608 \| uint8 \| sum \| -1 \| 62.4 \| 134.44 \| 235.26 \| 0 \| 62.56 \| 134.1 \| 234.67 \| 0 16777216 \| 16777216 \| uint8 \| sum \| -1 \| 110.2 \| 152.22 \| 266.39 \| 0 \| 110.3 \| 152.04 \| 266.08 \| 0 33554432 \| 33554432 \| uint8 \| sum \| -1 \| 199.8 \| 167.94 \| 293.9 \| 0 \| 197.5 \| 169.88 \| 297.29 \| 0 67108864 \| 67108864 \| uint8 \| sum \| -1 \| 386.3 \| 173.73 \| 304.03 \| 0 \| 378.4 \| 177.37 \| 310.39 \| 0 134217728 \| 134217728 \| uint8 \| sum \| -1 \| 758 \| 177.07 \| 309.87 \| 0 \| 741.1 \| 181.12 \| 316.95 \| 0 268435456 \| 268435456 \| uint8 \| sum \| -1 \| 1500.1 \| 178.95 \| 313.16 \| 0 \| 1466.2 \| 183.09 \| 320.4 \| 0 536870912 \| 536870912 \| uint8 \| sum \| -1 \| 2991.7 \| 179.45 \| 314.04 \| 0 \| 2924.8 \| 183.56 \| 321.23 \| 0 --------- Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com>	2026-02-13 10:49:25 -08:00
Binyang Li	bd68319e3e	Refactor algo selection logic and introduce symmetric_memory env (#741 ) This PR refactors the algorithm selection logic in MSCCL++ and introduces support for symmetric memory configuration through environment variables. 1. Algorithm Selection Refactoring Use separate class for algo selection. Could introduce more complex logic for algo selection based on message size, arch, if cuda graph is enabled and memory allocation method 2. Symmetric Memory Support Introduced symmetricMemory parameter in algorithm context key generation. Remove disableChannelCache env as is ambiguous 3. Add new args for build_default_algorithms Add flag_buffer, and flag_buffer_size args to build default algorithm. Then we could use unified flag buffer for different algorithms, avoid application hanging when switch algo for different message size. --------- Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-12 19:06:18 -08:00
Caio Rocha	dff3bc7bbb	Support Fusion for ReadPutPacket Operation at DSL (#742 ) Support is being added for fusing the ReadPutPacket operation on DSL, which reduces the overhead caused by reading packet data multiple times in the scratch buffer. Fusion will occur when two rppkt operations are executed consecutively with the same src_buffer: rppkt(src, dst0) + rppkt(src, dst1) -> rppkt(src, [dst0, dst1] Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-12 17:27:20 -08:00
Changho Hwang	42be3660e0	Add a new IB stack impl that doesn't use RDMA atomics (#728 ) * Added configurable InfiniBand (IB) signaling mode. `EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`, `HostNoAtomic`). `Default` is equivalent to `Host` unless specified different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the previous implementation using RDMA atomics for signaling, while `HostNoAtomic` uses write-with-immediate instead. * Regarding updates in Python bindings and API.	2026-02-10 01:07:53 +00:00
Binyang Li	c12822a7af	create CI pipeline for rocm (#718 ) Create CI pipeline for AMD GPU.	2026-02-09 16:55:16 -08:00
Changho Hwang	d7925448f3	Update `copilot-instructions.md` (#722 )	2026-02-06 11:27:01 -08:00
Qinghua Zhou	620378b4fb	Fix cpplint error in main branch (#740 ) Fix the legacy cpplint error in main branch. --------- Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-05 09:25:12 -08:00
Binyang Li	dc747b1522	Refactor reduce kernel (#738 ) - Put the common reduce kernel to reduce_kernel.hpp - Implement operator overloading for the vector type - Clean up the duplicated code at `executor_ kernel.hpp` and `allreduce/common.hpp`	2026-02-05 09:23:43 -08:00
Binyang Li	e21513791a	Address comments for PR #692 (#733 ) Rename nanobind-exposed C++ types to Cpp* Replace MSCCLPP_EXECUTION_PLAN_DIR / MSCCLPP_NATIVE_CACHE_DIR with MSCCLPP_CACHE_DIR across C++ and Python.	2026-02-03 10:13:20 -08:00
Changho Hwang	03b1936ddb	Support multi-node in `MemoryChannel` tutorial (#726 ) Co-authored-by: mahdiehghazim <mahdiehghazi@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-02 15:50:45 -08:00
Qinghua Zhou	41bf96abc2	Fix the relative path extraction on github page (#739 ) Fix missing 'mscclpp' base directory during version switching on GitHub Pages. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-02 13:16:11 -08:00
Qinghua Zhou	f0441ee4ea	Update document versioning for PR #724 (#735 ) This PR fix the issue of generating docs when we take https://github.com/microsoft/mscclpp/pull/724 into main branch. Build docs for main branch separately. Use HEAD request instead of GET to check if a page exist. Filter out versions before v0.4.0 in generate_versions.py. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-01 19:52:01 -08:00
mahdiehghazim	08589bf332	Use native GPU architecture when NVIDIA GPU is detected; otherwise fall back to multi-arch build. (#732 ) This change makes MSCCL++ automatically select CUDA architectures based on the build environment. If an NVIDIA GPU is detected, the build targets the native GPU architecture for optimal performance; otherwise, it falls back to building for multiple architectures for portability. When building for the native architecture, FP8 support is automatically enabled for “a-series” GPUs (e.g., sm_100a), allowing the appropriate optimized code paths to be picked up.	2026-01-26 15:53:36 -05:00
Qinghua Zhou	cc797abc87	Revert "Support versioning for mscclpp document (#724 )" (#734 ) This PR reverts commit 69d3b7 to avoid the github page issue.	2026-01-23 16:42:54 -08:00
Qinghua Zhou	69d3b79ecd	Support versioning for mscclpp document (#724 ) Show all the versions of mscclpp document on the webpage https://microsoft.github.io/mscclpp/ Add sphinx-multiversion to generate documents for different versions. Add version selector on document webpage. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-01-23 09:45:41 -08:00
mahdiehghazim	071dc92d38	fp8 nvls support (e5m2 and e4m3) (#730 ) This PR adds FP8 support to the nvls code. For compilation, we need to add this flag to the cmake command: -DMSCCLPP_GPU_ARCHS=100a	2026-01-23 10:38:38 -05:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	78ce9fac8d	Fix ci pipeline failure (#729 )	2026-01-21 13:28:14 -05:00
Binyang Li	abbdb7f630	Fix ci issue (#727 ) Solve the CI failure when cuda version newer than driver version	2026-01-15 22:21:02 -08:00
Changho Hwang	105239fc6c	Use `GpuIpcMem` for NVLS connections (#719 ) * Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast memory handling. * Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API handles this automatically). * Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a shared pointer with custom deleter for unmapping, which prevents misuse of raw pointers and reduces states to be stored in the `GpuIpcMem` instance. * Now for `RuntimeIpc` type handles, for consistency with other types, `cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of the ctor of `GpuIpcMem`. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-15 13:16:04 +08:00
Changho Hwang	c2a87302bd	Reduce CI build time (#723 ) Specify GPU architecture during CI build to reduce build time	2026-01-15 10:45:40 +08:00
Changho Hwang	a02ba3b1bd	Add `GpuIpcMemHandle` (#704 ) Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-14 10:49:31 +08:00
Changho Hwang	4dd075602c	Bypassing SSCA alerts (#721 ) Remove default image tags to bypass SSCA alerts	2026-01-12 23:46:27 +08:00
Changho Hwang	b8a1b0a134	Add CUDA 13.0 Docker images (#720 ) * Updated Dockerfiles and the build script to support CUDA 13.0 * Added Python3 venv which is required since Python 3.12 * Updated the default MLNX-OFED version to the LTS version * Added docker push instruction for multi-arch manifest	2026-01-09 19:03:33 +08:00
Binyang Li	eab2afb8b9	Update container images for pipeline (#717 ) - Remove cuda11 support for nccl-test pipeline, since nccl build failed for cuda11. - Update to cuda12.9 for CI pipeline. Will consider dropping cuda11 support add cuda13 support in near future	2026-01-07 14:10:49 +08:00
Qinghua Zhou	168a6c7037	Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 (#694 ) Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-01-06 08:59:59 +08:00
Changho Hwang	fc221e234d	Remove UB `std::` declarations (#709 ) Remove custom delcarations inside `std::` of which behaviors are undefined by the standard	2026-01-05 11:11:46 +08:00
Changho Hwang	2cf14ff723	Minor fixes (#715 )	2026-01-05 11:09:48 +08:00
Changho Hwang	bb555277ad	Rename `P2P` log subsys into `GPU` (#716 )	2026-01-05 11:08:43 +08:00
Binyang Li	ca6a4a3274	Replace `__HIP_PLATFORM_AMD__` to use internal macro (#712 ) Replacing most of checks for `__HIP_PLATFORM_AMD__` with `MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source file.	2026-01-04 04:47:58 -08:00
qishilu	b2d96e8ba5	Use uncached memory on Rocm platform to avoid hang (#711 ) MSCCLPP_DEVICE_HIP is undefined because it is defined in device.hpp. Use __HIP_PLATFORM_AMD__ here.	2025-12-24 10:49:36 +08:00
Changho Hwang	7b18a42274	Add copilot-instructions.md (#602 )	2025-12-22 22:15:40 -08:00
Binyang Li	eda74a7f29	Add handle cache for AMD platform (#698 ) Introduce handle cache for AMD platform. Avoid reaching handle limitation if we open too much IPC handles For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-21 18:39:12 -08:00
Caio Rocha	8d998820a3	Improve DSL Documentation (#707 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-19 15:17:08 -08:00
Changho Hwang	9e076da3d4	Make IB more configurable (#703 ) * Added `port` and `gidIndex` field in the IB endpoint config (and `deviceIndex` field for future usages) * Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so * Added `--ib_gid_index` CLI option to `mp_unit_tests` * Other minor fixes	2025-12-18 13:21:07 -08:00
Caio Rocha	11b7b35832	Creating Documentation Section for MSCCL++ DSL (#706 )	2025-12-15 15:07:01 -08:00
Changho Hwang	da60eb7f46	Add an IB multi-node tutorial (#702 )	2025-12-11 15:15:58 -08:00
Changho Hwang	51a86630ff	Build fixes (#696 ) * Fix CMake build for CUDA 13 * Add a missing header file	2025-11-26 20:02:01 -08:00
Changho Hwang	8b75634d31	Optimized logger (#693 ) * Leverage constant folding * Use `shouldLog()` function for early exit * Per-thread timestamp caching to remove mutex	2025-11-25 08:58:17 -08:00
Changho Hwang	ddf84a6b9d	Add `CudaDeviceGuard` (#691 ) Add an RAII guard that sets a proper GPU device before a CUDA API call. We may change this stateful in the future to minimize `cudaGetDevice()` calls. This PR fixes a bug of the tutorial 01.	2025-11-24 13:38:44 -08:00
Caio Rocha	17247cd695	DSL Quick Start (#689 ) Fix #675 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-21 14:45:49 -08:00
Changho Hwang	8b8593ba51	Fix Python bindings and tests (#690 ) Minimal fix to make things work. We need a more careful look at preventing silent fallback of nanobind when it fails to (properly) construct a C++ STL object with mscclpp instances.	2025-11-21 12:53:12 -08:00
Caio Rocha	060c35fec6	No IB Env CI Test (#687 )	2025-11-19 11:13:03 -08:00
Caio Rocha	bbdeafb3ca	Fix Error in Non IB Env at Executor (#686 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-17 16:35:57 -08:00
Qinghua Zhou	b9428341a2	Revise the mscclpp datatype (#671 ) Use mscclpp::DataType to replace the following types in API interface: int dtype; ncclDataType_t dtype; Add data type conversion: Convert ncclDataType_t to mscclpp::DataType	2025-11-17 12:58:47 -08:00
Caio Rocha	a19bca9738	Fix Minor Issue Proxy Python Interface (#685 )	2025-11-17 09:03:00 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00

1 2 3 4 5 ...

899 Commits