Commit Graph

174 Commits

Author SHA1 Message Date
Xingbo Wu
69565a2f32 Do threadInit/cudaSetDevice before other cuda calls (#757)
I recently encountered a weird memory usage issue.
After starting the proxy service on a cuda device X > 0, I notice an
unexpected thread entity apprear on both the GPU X and GPU 0, where GPU
0's share is about 500MB. Note that when the device is 0, there is no
extra memory usage.
The image clearly shows that when 8 ranks each using one GPU and
starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB
extra memory.
<img width="1247" height="1367" alt="Screenshot 2026-02-28 000153"
src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4"
/>


After tracking down to when it happens, I identified the root cause in
Proxy thread initialization.

    // never capture in a proxy thread
    auto mode = cudaStreamCaptureModeRelaxed;
    MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));

    pimpl_->threadInit();

The call to cudaThreadExchangeStreamCaptureMode() actually triggers some
resource allocation on the "current device" which is still 0 for the
starting thread.
The later threadInit() is too late to set the correct GPU number.

The fix is simple: call threadInit() before the first cuda call:

    pimpl_->threadInit();
    // never capture in a proxy thread
    auto mode = cudaStreamCaptureModeRelaxed;
    MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));

This guarantees that the current device is properly set before calling
any resource-allocating cuda functions.

This is the memory usage after the fix. The extra memory usages are
gone.

<img width="1242" height="459" alt="Image (1)"
src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9"
/>

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-03-02 15:53:59 -08:00
Caio Rocha
4bc1999001 Adding Support to Setting Message Size Range in Native Algorithm API (#758) 2026-02-27 17:50:43 -08:00
Binyang Li
39865c218b address flagBuffer ownership issue (#749)
This pull request updates the handling of the default flag buffer in the
C++ and Python bindings to ensure proper memory management when
interfacing with Python.

Make sure the buffer will not be deallocated when transfer ownership
from cpp to python
2026-02-20 13:42:29 -08:00
Binyang Li
4701ae3a95 Update dtype name (#748)
- Change FP8_E4M3/FP8_E5M2 to FLOAT8_E4M3/FLOAT8_E5M2
- Add torch.uint8 to DataType.uint8 mapping
2026-02-18 10:35:44 -08:00
Qinghua Zhou
edc9c38751 Support uint8 data type for Allreduce (#736)
Support uint8 data type for Allreduce.
Current limitation: uint8 is not supported for NVLS.

Performance results with RCCL-test with MSCCLPP on MI300X:


\# out-of-place in-place
\# size count type redop root time algbw busbw #wrong time algbw busbw
#wrong
\# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 | 512 | half | sum | -1 | 5.39 | 0.19 | 0.33 | 0 | 5.45 | 0.19 |
0.33 | 0
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
2048 | 1024 | half | sum | -1 | 5.53 | 0.37 | 0.65 | 0 | 5.63 | 0.36 |
0.64 | 0
4096 | 2048 | half | sum | -1 | 5.55 | 0.74 | 1.29 | 0 | 5.56 | 0.74 |
1.29 | 0
8192 | 4096 | half | sum | -1 | 5.8 | 1.41 | 2.47 | 0 | 5.84 | 1.4 |
2.46 | 0
16384 | 8192 | half | sum | -1 | 6.57 | 2.49 | 4.36 | 0 | 6.56 | 2.5 |
4.37 | 0
32768 | 16384 | half | sum | -1 | 8.02 | 4.09 | 7.15 | 0 | 8.06 | 4.07 |
7.11 | 0
65536 | 32768 | half | sum | -1 | 8.77 | 7.47 | 13.07 | 0 | 8.82 | 7.43
| 13 | 0
131072 | 65536 | half | sum | -1 | 9.61 | 13.64 | 23.87 | 0 | 9.78 |
13.4 | 23.45 | 0
262144 | 131072 | half | sum | -1 | 11.68 | 22.44 | 39.27 | 0 | 12.1 |
21.67 | 37.93 | 0
524288 | 262144 | half | sum | -1 | 13.77 | 38.08 | 66.64 | 0 | 13.87 |
37.79 | 66.13 | 0
1048576 | 524288 | half | sum | -1 | 19.11 | 54.87 | 96.03 | 0 | 19.27 |
54.42 | 95.24 | 0
2097152 | 1048576 | half | sum | -1 | 24.1 | 87 | 152.26 | 0 | 24.24 |
86.52 | 151.41 | 0
4194304 | 2097152 | half | sum | -1 | 37.16 | 112.87 | 197.52 | 0 |
37.44 | 112.03 | 196.06 | 0
8388608 | 4194304 | half | sum | -1 | 61.53 | 136.33 | 238.58 | 0 |
61.68 | 135.99 | 237.99 | 0
16777216 | 8388608 | half | sum | -1 | 108.8 | 154.22 | 269.88 | 0 |
109.2 | 153.6 | 268.79 | 0
33554432 | 16777216 | half | sum | -1 | 197.8 | 169.68 | 296.94 | 0 |
198.6 | 168.92 | 295.61 | 0
67108864 | 33554432 | half | sum | -1 | 384.6 | 174.51 | 305.39 | 0 |
385.1 | 174.27 | 304.98 | 0
134217728 | 67108864 | half | sum | -1 | 754.1 | 177.99 | 311.48 | 0 |
754.9 | 177.78 | 311.12 | 0
268435456 | 134217728 | half | sum | -1 | 1491.8 | 179.94 | 314.89 | 0 |
1493.2 | 179.77 | 314.6 | 0
536870912 | 268435456 | half | sum | -1 | 2979.6 | 180.18 | 315.31 | 0 |
2983.9 | 179.92 | 314.87 | 0


\# out-of-place in-place
\# size count type redop root time algbw busbw #wrong time algbw busbw
#wrong
\# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 | 1024 | fp8_e4m3 | sum | -1 | 5.4 | 0.19 | 0.33 | 0 | 5.45 | 0.19
| 0.33 | 0
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
2048 | 2048 | fp8_e4m3 | sum | -1 | 5.5 | 0.37 | 0.65 | 0 | 5.6 | 0.37 |
0.64 | 0
4096 | 4096 | fp8_e4m3 | sum | -1 | 5.61 | 0.73 | 1.28 | 0 | 5.68 | 0.72
| 1.26 | 0
8192 | 8192 | fp8_e4m3 | sum | -1 | 5.96 | 1.38 | 2.41 | 0 | 5.98 | 1.37
| 2.4 | 0
16384 | 16384 | fp8_e4m3 | sum | -1 | 6.49 | 2.52 | 4.42 | 0 | 6.58 |
2.49 | 4.36 | 0
32768 | 32768 | fp8_e4m3 | sum | -1 | 8.09 | 4.05 | 7.09 | 0 | 8.15 |
4.02 | 7.03 | 0
65536 | 65536 | fp8_e4m3 | sum | -1 | 8.58 | 7.64 | 13.37 | 0 | 8.7 |
7.53 | 13.18 | 0
131072 | 131072 | fp8_e4m3 | sum | -1 | 9.44 | 13.88 | 24.29 | 0 | 9.62
| 13.63 | 23.85 | 0
262144 | 262144 | fp8_e4m3 | sum | -1 | 10.12 | 25.9 | 45.32 | 0 | 10.37
| 25.27 | 44.22 | 0
524288 | 524288 | fp8_e4m3 | sum | -1 | 13.73 | 38.19 | 66.82 | 0 |
13.89 | 37.74 | 66.04 | 0
1048576 | 1048576 | fp8_e4m3 | sum | -1 | 18.66 | 56.2 | 98.34 | 0 |
18.92 | 55.41 | 96.97 | 0
2097152 | 2097152 | fp8_e4m3 | sum | -1 | 24.54 | 85.46 | 149.56 | 0 |
24.63 | 85.16 | 149.03 | 0
4194304 | 4194304 | fp8_e4m3 | sum | -1 | 37.79 | 110.98 | 194.21 | 0 |
38.05 | 110.22 | 192.88 | 0
8388608 | 8388608 | fp8_e4m3 | sum | -1 | 62.22 | 134.82 | 235.94 | 0 |
62.63 | 133.94 | 234.4 | 0
16777216 | 16777216 | fp8_e4m3 | sum | -1 | 109.9 | 152.62 | 267.09 | 0
| 110.4 | 151.9 | 265.83 | 0
33554432 | 33554432 | fp8_e4m3 | sum | -1 | 201.1 | 166.82 | 291.94 | 0
| 202.3 | 165.84 | 290.22 | 0
67108864 | 67108864 | fp8_e4m3 | sum | -1 | 390 | 172.06 | 301.11 | 0 |
390.2 | 171.99 | 300.99 | 0
134217728 | 134217728 | fp8_e4m3 | sum | -1 | 763.9 | 175.7 | 307.47 | 0
| 764.2 | 175.62 | 307.34 | 0
268435456 | 268435456 | fp8_e4m3 | sum | -1 | 1509.5 | 177.83 | 311.2 |
0 | 1510.1 | 177.76 | 311.08 | 0
536870912 | 536870912 | fp8_e4m3 | sum | -1 | 3010.2 | 178.35 | 312.11 |
0 | 3014.2 | 178.11 | 311.7 | 0



\# out-of-place in-place
\# size count type redop root time algbw busbw #wrong time algbw busbw
#wrong
\# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 | 1024 | fp8_e5m2 | sum | -1 | 5.41 | 0.19 | 0.33 | 0 | 5.44 | 0.19
| 0.33 | 0
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
2048 | 2048 | fp8_e5m2 | sum | -1 | 5.5 | 0.37 | 0.65 | 0 | 5.67 | 0.36
| 0.63 | 0
4096 | 4096 | fp8_e5m2 | sum | -1 | 5.61 | 0.73 | 1.28 | 0 | 5.69 | 0.72
| 1.26 | 0
8192 | 8192 | fp8_e5m2 | sum | -1 | 5.96 | 1.37 | 2.4 | 0 | 6 | 1.36 |
2.39 | 0
16384 | 16384 | fp8_e5m2 | sum | -1 | 6.63 | 2.47 | 4.32 | 0 | 6.59 |
2.49 | 4.35 | 0
32768 | 32768 | fp8_e5m2 | sum | -1 | 8.07 | 4.06 | 7.1 | 0 | 8.16 |
4.02 | 7.03 | 0
65536 | 65536 | fp8_e5m2 | sum | -1 | 8.62 | 7.61 | 13.31 | 0 | 8.73 |
7.51 | 13.14 | 0
131072 | 131072 | fp8_e5m2 | sum | -1 | 9.43 | 13.9 | 24.33 | 0 | 9.6 |
13.66 | 23.9 | 0
262144 | 262144 | fp8_e5m2 | sum | -1 | 10.11 | 25.94 | 45.39 | 0 |
10.38 | 25.26 | 44.21 | 0
524288 | 524288 | fp8_e5m2 | sum | -1 | 13.73 | 38.19 | 66.84 | 0 |
13.87 | 37.79 | 66.13 | 0
1048576 | 1048576 | fp8_e5m2 | sum | -1 | 18.65 | 56.22 | 98.39 | 0 |
18.93 | 55.38 | 96.92 | 0
2097152 | 2097152 | fp8_e5m2 | sum | -1 | 24.54 | 85.47 | 149.57 | 0 |
24.63 | 85.16 | 149.03 | 0
4194304 | 4194304 | fp8_e5m2 | sum | -1 | 37.84 | 110.83 | 193.96 | 0 |
38.01 | 110.36 | 193.12 | 0
8388608 | 8388608 | fp8_e5m2 | sum | -1 | 62.32 | 134.61 | 235.58 | 0 |
62.55 | 134.12 | 234.71 | 0
16777216 | 16777216 | fp8_e5m2 | sum | -1 | 110 | 152.58 | 267.01 | 0 |
110.3 | 152.12 | 266.21 | 0
33554432 | 33554432 | fp8_e5m2 | sum | -1 | 201.1 | 166.9 | 292.07 | 0 |
201.8 | 166.26 | 290.96 | 0
67108864 | 67108864 | fp8_e5m2 | sum | -1 | 390 | 172.07 | 301.12 | 0 |
390.5 | 171.87 | 300.78 | 0
134217728 | 134217728 | fp8_e5m2 | sum | -1 | 763.9 | 175.69 | 307.46 |
0 | 764.5 | 175.56 | 307.23 | 0
268435456 | 268435456 | fp8_e5m2 | sum | -1 | 1509.4 | 177.84 | 311.22 |
0 | 1509.8 | 177.8 | 311.14 | 0
536870912 | 536870912 | fp8_e5m2 | sum | -1 | 3013 | 178.18 | 311.82 | 0
| 3018 | 177.89 | 311.31 | 0


\# out-of-place in-place
\# size count type redop root time algbw busbw #wrong time algbw busbw
#wrong
\# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 | 1024 | uint8 | sum | -1 | 5.46 | 0.19 | 0.33 | 0 | 5.46 | 0.19 |
0.33 | 0
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
2048 | 2048 | uint8 | sum | -1 | 5.54 | 0.37 | 0.65 | 0 | 5.63 | 0.36 |
0.64 | 0
4096 | 4096 | uint8 | sum | -1 | 5.61 | 0.73 | 1.28 | 0 | 5.63 | 0.73 |
1.27 | 0
8192 | 8192 | uint8 | sum | -1 | 5.9 | 1.39 | 2.43 | 0 | 5.9 | 1.39 |
2.43 | 0
16384 | 16384 | uint8 | sum | -1 | 6.6 | 2.48 | 4.35 | 0 | 6.64 | 2.47 |
4.32 | 0
32768 | 32768 | uint8 | sum | -1 | 8.99 | 3.65 | 6.38 | 0 | 8.99 | 3.64
| 6.38 | 0
65536 | 65536 | uint8 | sum | -1 | 9.44 | 6.94 | 12.15 | 0 | 9.58 | 6.84
| 11.98 | 0
131072 | 131072 | uint8 | sum | -1 | 11.72 | 11.18 | 19.57 | 0 | 11.83 |
11.08 | 19.4 | 0
262144 | 262144 | uint8 | sum | -1 | 12.29 | 21.32 | 37.31 | 0 | 12.45 |
21.05 | 36.84 | 0
524288 | 524288 | uint8 | sum | -1 | 13.87 | 37.8 | 66.15 | 0 | 13.93 |
37.64 | 65.88 | 0
1048576 | 1048576 | uint8 | sum | -1 | 19.11 | 54.88 | 96.04 | 0 | 19.3
| 54.33 | 95.08 | 0
2097152 | 2097152 | uint8 | sum | -1 | 24.38 | 86.01 | 150.51 | 0 |
24.52 | 85.53 | 149.67 | 0
4194304 | 4194304 | uint8 | sum | -1 | 37.52 | 111.78 | 195.61 | 0 |
37.76 | 111.08 | 194.39 | 0
8388608 | 8388608 | uint8 | sum | -1 | 62.4 | 134.44 | 235.26 | 0 |
62.56 | 134.1 | 234.67 | 0
16777216 | 16777216 | uint8 | sum | -1 | 110.2 | 152.22 | 266.39 | 0 |
110.3 | 152.04 | 266.08 | 0
33554432 | 33554432 | uint8 | sum | -1 | 199.8 | 167.94 | 293.9 | 0 |
197.5 | 169.88 | 297.29 | 0
67108864 | 67108864 | uint8 | sum | -1 | 386.3 | 173.73 | 304.03 | 0 |
378.4 | 177.37 | 310.39 | 0
134217728 | 134217728 | uint8 | sum | -1 | 758 | 177.07 | 309.87 | 0 |
741.1 | 181.12 | 316.95 | 0
268435456 | 268435456 | uint8 | sum | -1 | 1500.1 | 178.95 | 313.16 | 0
| 1466.2 | 183.09 | 320.4 | 0
536870912 | 536870912 | uint8 | sum | -1 | 2991.7 | 179.45 | 314.04 | 0
| 2924.8 | 183.56 | 321.23 | 0

---------

Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com>
2026-02-13 10:49:25 -08:00
Binyang Li
bd68319e3e Refactor algo selection logic and introduce symmetric_memory env (#741)
This PR refactors the algorithm selection logic in MSCCL++ and
introduces support for symmetric memory configuration through
environment variables.


1. Algorithm Selection Refactoring
Use separate class for algo selection. Could introduce more complex
logic for algo selection based on message size, arch, if cuda graph is
enabled and memory allocation method

2. Symmetric Memory Support
Introduced symmetricMemory parameter in algorithm context key
generation. Remove disableChannelCache env as is ambiguous

3. Add new args for build_default_algorithms 
Add flag_buffer, and flag_buffer_size args to build default algorithm.
Then we could use unified flag buffer for different algorithms, avoid
application hanging when switch algo for different message size.

---------

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2026-02-12 19:06:18 -08:00
Changho Hwang
42be3660e0 Add a new IB stack impl that doesn't use RDMA atomics (#728)
* Added configurable InfiniBand (IB) signaling mode.
`EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`,
`HostNoAtomic`). `Default` is equivalent to `Host` unless specified
different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the
previous implementation using RDMA atomics for signaling, while
`HostNoAtomic` uses write-with-immediate instead.
* Regarding updates in Python bindings and API.
2026-02-10 01:07:53 +00:00
Qinghua Zhou
620378b4fb Fix cpplint error in main branch (#740)
Fix the legacy cpplint error in main branch.

---------

Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-02-05 09:25:12 -08:00
Binyang Li
dc747b1522 Refactor reduce kernel (#738)
- Put the common reduce kernel to reduce_kernel.hpp
- Implement operator overloading for the vector type
- Clean up the duplicated code at `executor_ kernel.hpp` and
`allreduce/common.hpp`
2026-02-05 09:23:43 -08:00
Binyang Li
e21513791a Address comments for PR #692 (#733)
Rename nanobind-exposed C++ types to Cpp*
Replace MSCCLPP_EXECUTION_PLAN_DIR / MSCCLPP_NATIVE_CACHE_DIR with
MSCCLPP_CACHE_DIR across C++ and Python.
2026-02-03 10:13:20 -08:00
mahdiehghazim
071dc92d38 fp8 nvls support (e5m2 and e4m3) (#730)
This PR adds FP8 support to the nvls code. For compilation, we need to
add this flag to the cmake command:

-DMSCCLPP_GPU_ARCHS=100a
2026-01-23 10:38:38 -05:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Changho Hwang
105239fc6c Use GpuIpcMem for NVLS connections (#719)
* Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast
memory handling.
* Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API
handles this automatically).
* Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a
shared pointer with custom deleter for unmapping, which prevents misuse
of raw pointers and reduces states to be stored in the `GpuIpcMem`
instance.
* Now for `RuntimeIpc` type handles, for consistency with other types,
`cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of
the ctor of `GpuIpcMem`.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-01-15 13:16:04 +08:00
Changho Hwang
a02ba3b1bd Add GpuIpcMemHandle (#704)
Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers
all existing methods for GPU memory mapping. This PR fixes issues that
fail to properly fallback to a feasible type of memory handle on the
importing environment. It also consolidates code for creating or
destroying various memory handles into a single RAII wrapper.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-01-14 10:49:31 +08:00
Changho Hwang
fc221e234d Remove UB std:: declarations (#709)
Remove custom delcarations inside `std::` of which behaviors are
undefined by the standard
2026-01-05 11:11:46 +08:00
Changho Hwang
bb555277ad Rename P2P log subsys into GPU (#716) 2026-01-05 11:08:43 +08:00
Binyang Li
ca6a4a3274 Replace __HIP_PLATFORM_AMD__ to use internal macro (#712)
Replacing most of checks for `__HIP_PLATFORM_AMD__` with
`MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source
file.
2026-01-04 04:47:58 -08:00
Changho Hwang
9e076da3d4 Make IB more configurable (#703)
* Added `port` and `gidIndex` field in the IB endpoint config (and
`deviceIndex` field for future usages)
* Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so
* Added `--ib_gid_index` CLI option to `mp_unit_tests`
* Other minor fixes
2025-12-18 13:21:07 -08:00
Changho Hwang
ddf84a6b9d Add CudaDeviceGuard (#691)
Add an RAII guard that sets a proper GPU device before a CUDA API call.
We may change this stateful in the future to minimize `cudaGetDevice()`
calls. This PR fixes a bug of the tutorial 01.
2025-11-24 13:38:44 -08:00
Qinghua Zhou
b9428341a2 Revise the mscclpp datatype (#671)
Use mscclpp::DataType to replace the following types in API interface:
int dtype; 
ncclDataType_t dtype;

Add data type conversion:
Convert ncclDataType_t to mscclpp::DataType
2025-11-17 12:58:47 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Caio Rocha
7eb3ff701a Supporting New Packet Kernel Operation at Executor (#677)
This PR introduces three new operations to enhance flexibility and
performance at executor.

One operation can be invoked directly via the DSL API and two operations
are created through fusion of existing operations, reducing overhead and
improving efficiency.

1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt
format to the remote side in pkt format via the port channel. Both
source and destination buffers must be scratch.

2. Reduce Copy Packet (Fusion):
Reduce Packet+Copy Packet=Reduce Copy Packet
Triggered when the destination buffer of Reduce Packet matches the
source buffer of Copy Packet.
Purpose: Combine reduction and copy into a single step for better
performance.

3. Reduce Copy Send Packet (Fusion):
Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer
of Reduce Copy Packet matches src buffer of Put Packet)
Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt
buffer of Reduce Copy Packet matches src buffer of Read Put Packet)
Purpose: Combine reduction, copy, and send operations into one optimized
pipeline.


Fusion Diagram
Reduce Packet + Copy Packet → Reduce Copy Packet
Reduce Copy Packet + Put Packet → Reduce Copy Send Packet
Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet

Beyond this, this PR adjust the AllReduce 2 Node algorithm:

Message Size  |  Latency (µs)
        1K            |     15.34
        2K            |     15.88
        4K            |     15.71
        8K            |     16.01
        16K          |     15.88
        32K          |     16.21
        64K          |     16.90
        128K        |     18.24
        256K        |     20.39
        512K        |     25.26
        1M           |     32.74
        2M           |     53.64
2025-11-13 14:08:44 -08:00
Caio Rocha
eb202780f5 Support Synchronous Initialization for Proxy Service (#679) 2025-11-12 18:35:57 -08:00
Changho Hwang
ffafcaf6d6 IB stack enhancements & bug fixes (#673)
* Always use `ibv_reg_dmabuf_mr` when DMABUF is supported
* Do not check `nvidia-peermem` when unnecessary
* More rigorous check on IB port availability
* Fixed ibverbs wrappers
* Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test
2025-11-07 14:26:17 -08:00
Changho Hwang
960a8ddebd Add a new logger (#668)
* Add `logger.hpp` that will gradually replace `debug.h`
* Minor fixes in `ib.cc`

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2025-11-04 10:32:46 -08:00
Binyang Li
5acac93dbc Integrate MSCCL++ DSL to torch workload (#620)
Provides two integration ways for MSCCL++ DSL.
1. Integrate with customized communication group
2. Integrate with NCCL API

Introduce new Python APIs to make it work:
```python
mscclpp.compile # compile dsl to json based execution plan
mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery
mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size....
```
Fix #556

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-29 15:39:00 -07:00
Qinghua Zhou
a38c2ee784 FP8 support for Allreduce (#646)
Add FP8 support for Allreduce on both NVIDIA and AMD platform.
Add new data type: fp8_e4m3 and fp8_e5m2

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-10-27 14:51:48 -07:00
Binyang Li
2b05908635 Add token pool for cuCreate API (#628)
Create a tokenPool to allocate token. This feature is used to support
inter node NVL and try to reduce the footprint caused by cuCreate

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-27 11:19:21 -07:00
Changho Hwang
68b1f151f0 Rename nvls* files (#660)
Rename nvls* files to switch_channel*
2025-10-24 11:34:26 -07:00
Changho Hwang
200cdf946e Update EndpointConfig interfaces (#651)
* Separate IB-specific options into a nested struct
* Enable `connect()` by an `Endpoint`, not only by `EndpointConfig`
* Other minor changes
2025-10-22 10:39:39 -07:00
Binyang Li
3d94383696 Add MSCCLPP_GIT_COMMIT micro (#640)
- Add MSCCLPP_GIT_COMMIT micro
- Update docs
2025-10-06 15:57:28 -07:00
Binyang Li
70b8297c56 Revise NCCL API implementation (#617)
- Make nccl interface extensible. Customer can register their own algo
to NCCL API. User can provide customized algo selection function.
- Fallback to NCCL/RCCL if no algo is selected based on algo selection
function
- MSCCLPP interfaces now works for any scale
2025-09-26 10:08:12 -07:00
Binyang Li
5ac427610d Address teardown issue (#638)
Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point
2025-09-25 12:12:40 -07:00
Changho Hwang
43f160c8e6 Fix for safe process teardown (#633)
* `gpuFree*()` functions are usually called during process teardown, so
we let them ignore regarding errors.
* `AvoidCudaGraphCaptureGuard` is constructed in `gpuFree*()` functions,
so it needs the same fix.
2025-09-10 20:28:04 -07:00
Changho Hwang
571fee16fb Add FifoDeviceHandle::poll() (#630) 2025-09-09 23:32:35 +00:00
Changho Hwang
c4d8781390 Fix memory exchange within a single process (#624) 2025-09-04 12:53:51 -07:00
Binyang Li
bb76d27553 all2all implementation (#609)
Implement single node all2all via MSCCL++ C++API
perf kernel 3:
```
       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.41   44.78   39.19      0
     2097152         65536                                     23.95   87.56   76.61      0
     4194304        131072                                     27.50  152.51  133.45      0
     8388608        262144                                     35.14  238.73  208.89      0
    16777216        524288                                     57.54  291.55  255.11      0
    33554432       1048576                                     109.7  305.81  267.59      0
    67108864       2097152                                     212.3  316.07  276.56      0
   134217728       4194304                                     410.9  326.64  285.81      0
   268435456       8388608                                     784.9  341.99  299.24      0
```

kernel 2
```

#                                        in-place                       out-of-place
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.42   44.77   39.17      0
     2097152         65536                                     24.96   84.02   73.52      0
     4194304        131072                                     28.53  147.03  128.65      0
     8388608        262144                                     36.75  228.28  199.75      0
    16777216        524288                                     58.01  289.20  253.05      0
    33554432       1048576                                     110.4  303.83  265.85      0
    67108864       2097152                                     212.4  315.99  276.49      0
   134217728       4194304                                     407.8  329.12  287.98      0
   268435456       8388608                                     797.4  336.64  294.56      0
```

NCCL:
```
NCCL version 2.21.5+cuda12.4
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        524288      half    none      -1    38.70  216.75  189.66      0    39.25  213.72  187.00    N/A
    16777216       1048576      half    none      -1    71.39  234.99  205.62      0    68.41  245.25  214.60    N/A
    33554432       2097152      half    none      -1    119.7  280.22  245.20      0    119.8  280.17  245.15    N/A
    67108864       4194304      half    none      -1    211.9  316.66  277.08      0    212.7  315.53  276.09    N/A
   134217728       8388608      half    none      -1    408.4  328.61  287.53      0    393.8  340.87  298.26    N/A
   268435456      16777216      half    none      -1    761.6  352.47  308.41      0    763.3  351.70  307.73    N/A
   536870912      33554432      half    none      -1   1502.5  357.31  312.64      0   1467.3  365.89  320.16    N/A
```
2025-08-14 11:30:40 -07:00
Binyang Li
be6a941fba New DSL implementation (#579)
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline 
- Clean code and enhance document
C++ side: 
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-09 00:36:20 -07:00
Changho Hwang
542a10f69e Merge ChannelTrigger with ProxyTrigger (#601) 2025-08-08 19:07:50 +00:00
Changho Hwang
9650e5c37e Update documentation (#576)
Documentation overhaul
2025-08-07 15:37:37 -07:00
Binyang Li
4f6f23dae3 Use smart pointer for IB structure (#585)
Change to use smart pointer for IB structure. Registered memory will own
ibMr, ibCtx will not held the reference
- Use smart pointer for IbQp and IbMr
- Update memoryChannel API, keep localRegisteredMemory
- Close fd when registedMemory released

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-06 10:01:58 -07:00
Changho Hwang
d55ac96f5e Fixed the local channel test (#597) 2025-08-05 15:33:48 -07:00
Changho Hwang
334b232e36 Fix GpuStreamPool to be aware of the device ID of streams (#590)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-04 11:07:31 -07:00
Changho Hwang
c580e4c503 Support CudaIpc connection within a single process (#593)
* Allow CudaIpc connection between GPUs in a single process
* Added an example of connection in a single process
* Minor interface updates

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-08-02 12:59:36 +08:00
Changho Hwang
aa28b06bf5 Fix relaxedWait() (#594) 2025-08-01 12:51:30 +08:00
Binyang Li
5e991cf5c8 update readme & bump version (#550)
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-07-12 01:00:18 -07:00
Changho Hwang
199468bc47 Revise NVLS interface (#458)
* Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel`
* Minor interface improvements
2025-07-12 00:33:03 +00:00
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Changho Hwang
20eca28942 Fix a FIFO correctness bug (#549)
* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-07-11 23:53:59 +00:00
Changho Hwang
3de6d5b63a Fix #557 (#560)
* Page-locking `Host2DeviceSemaphore::outboundSemaphore_` caused
unexpected performance issues so reverting it back. We may revisit this
later.
* Removed reference to connections from context as now connections refer
to context.
2025-06-30 11:33:19 -07:00