## Summary
Fix a use-after-free where the CUDA allocation handle
(`CUmemGenericAllocationHandle`) was released prematurely while the
exported fabric handle still referenced it.
## Problem
Unlike POSIX FD handles (where the kernel keeps the allocation alive via
the open file descriptor), fabric handles do not hold their own
reference to the underlying allocation. The original code called
`cuMemRelease(allocHandle)` immediately after exporting the fabric
handle, freeing the allocation. When a remote process later tries to
`cuMemImportFromShareableHandle` using that fabric handle, it references
a freed allocation — a **use-after-free**.
This affected both code paths:
1. **`GpuIpcMemHandle::create()`**: The local `allocHandle` obtained via
`cuMemRetainAllocationHandle` was released right after fabric export,
leaving the fabric handle dangling.
2. **`GpuIpcMemHandle::createMulticast()`**: The `allocHandle` from
`cuMulticastCreate` was unconditionally released, even when it was the
only thing keeping the multicast object alive for the fabric handle.
## Fix
- **Added `allocHandle` field** to the `fabric` struct in
`GpuIpcMemHandle` to store the allocation handle and keep it alive for
the lifetime of the `GpuIpcMemHandle`.
- **`create()`**: Retain an additional reference via
`cuMemRetainAllocationHandle` and store it in `fabric.allocHandle` when
a fabric handle is successfully exported.
- **`createMulticast()`**: Store the `allocHandle` directly in
`fabric.allocHandle` instead of unconditionally releasing it. Only
release if fabric export was not used.
- **`deleter()`**: Release `fabric.allocHandle` via `cuMemRelease` when
the handle type includes `Fabric`, ensuring proper cleanup.
- **`GpuIpcMem` constructor (importer side)**: Clear
`fabric.allocHandle` after importing, since the importer gets its own
handle via `cuMemImportFromShareableHandle` and should not release the
exporter's allocation handle.
## Files Changed
- `src/core/include/gpu_ipc_mem.hpp` — Added
`CUmemGenericAllocationHandle allocHandle` to fabric struct.
- `src/core/gpu_ipc_mem.cc` — Retain/release allocation handle properly
across create, createMulticast, deleter, and importer paths.
## Summary
This PR addresses a multicast resource leak, fixes `cuMemMap` offset
handling for multicast handles, renames NVLS allreduce algorithm classes
for clarity, and adds a new unit test for `SwitchChannel`.
### Bug Fixes
#### 1. Fix multicast allocation handle leak in `createMulticast()`
(`gpu_ipc_mem.cc`)
`GpuIpcMemHandle::createMulticast()` called
`cuMulticastCreate(&allocHandle, ...)` but never released the local
`allocHandle` after exporting it to shareable handles (POSIX FD /
Fabric). This caused a reference count leak — the multicast object was
never freed even after all mappings and imported handles were released.
Per the [CUDA Driver API docs for
`cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html):
> *"The memory allocation will be freed when all outstanding mappings to
the memory are unmapped and when all outstanding references to the
handle (including its shareable counterparts) are also released."*
The fix adds `cuMemRelease(allocHandle)` after export, matching the
existing pattern used for regular allocations in
`GpuIpcMemHandle::create()`.
**Impact:** Without this fix, repeated creation/destruction of NVLS
connections causes OOM after ~120 iterations when allocating 1GB
multicast buffers on H100.
#### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`)
`cuMemMap` requires `offset=0` for multicast handles. Previously, the
code attempted to map at a non-zero offset within the multicast object,
leading to errors when binding multiple buffers to the same
`NvlsConnection`. The fix maps the entire range `[0, mcOffset +
bufferSize)` and returns the pointer offset by `mcOffset`. This only
consumes extra virtual address space; no additional physical memory is
used.
### Refactoring
#### 3. Rename NVLS allreduce algorithm classes
Renamed for clarity:
- `AllreduceNvls` → `AllreduceNvlsZeroCopy`
- `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline`
- `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline`
Updated all references in builder, selector, docs, and examples.
#### 4. Move `nvlsConnections` setup to `initialize()`
Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this
member) to individual algorithm class members, initialized in their
`initialize()` methods.
### Tests
#### 5. Add `TwoChannelsSameConnection` test
New unit test that creates two `SwitchChannel` instances from the same
`NvlsConnection`, performs reduce operations on both, and verifies
correctness. This exercises the multi-bind path that triggered the
`cuMemMap` offset fix.
### Files Changed
- `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap
offset fix
- `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` —
nvlsConnections fix
- `src/ext/collectives/include/allreduce/*.hpp` — renamed headers
- `src/ext/collectives/algorithm_collection_builder.cc` — updated
references
- `src/ext/nccl/algorithm_selector.cc` — updated algorithm names
- `test/mp_unit/switch_channel_tests.cu` — new test
- `docs/guide/mscclpp-torch-integration.md` — updated names
- `examples/torch-integration/customized_comm_with_default_algo.py` —
updated names
I recently encountered a weird memory usage issue.
After starting the proxy service on a cuda device X > 0, I notice an
unexpected thread entity apprear on both the GPU X and GPU 0, where GPU
0's share is about 500MB. Note that when the device is 0, there is no
extra memory usage.
The image clearly shows that when 8 ranks each using one GPU and
starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB
extra memory.
<img width="1247" height="1367" alt="Screenshot 2026-02-28 000153"
src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4"
/>
After tracking down to when it happens, I identified the root cause in
Proxy thread initialization.
// never capture in a proxy thread
auto mode = cudaStreamCaptureModeRelaxed;
MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));
pimpl_->threadInit();
The call to cudaThreadExchangeStreamCaptureMode() actually triggers some
resource allocation on the "current device" which is still 0 for the
starting thread.
The later threadInit() is too late to set the correct GPU number.
The fix is simple: call threadInit() before the first cuda call:
pimpl_->threadInit();
// never capture in a proxy thread
auto mode = cudaStreamCaptureModeRelaxed;
MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));
This guarantees that the current device is properly set before calling
any resource-allocating cuda functions.
This is the memory usage after the fix. The extra memory usages are
gone.
<img width="1242" height="459" alt="Image (1)"
src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9"
/>
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
## Summary
Add CI pipeline support for testing in environments without InfiniBand
(IB) hardware.
## Changes
### IB stubs for no-IB builds (`src/core/ib.cc`)
- Added stub implementations for `IbMr` and `IbQp` classes in the `#else
// !defined(USE_IBVERBS)` block so the library links successfully when
built with `-DMSCCLPP_USE_IB=OFF`.
### Environment variable to disable IB tests
(`MSCCLPP_DISABLE_IB_TESTS`)
- Added `disableIbTests` field to the `Env` class
(`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from
`MSCCLPP_DISABLE_IB_TESTS` env var.
- Exposed as `disable_ib_tests` in Python bindings
(`python/csrc/env_py.cpp`).
- Updated `python/test/test_mscclpp.py` to skip IB-dependent tests
(`create_group_and_connection` with IB transport, `test_h2h_semaphores`,
`test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is
true.
### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`)
The no-IB environment pipeline runs two phases:
1. **No-IB build phase**: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run
unit tests, multi-process unit tests, and pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`).
2. **IB build phase**: Rebuild with IB enabled (default), stop the
existing container, redeploy, and run pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build
works correctly in a non-IB environment when IB tests are skipped.
Also increased the job timeout from 40 to 60 minutes to accommodate the
two-phase pipeline.
This pull request updates the handling of the default flag buffer in the
C++ and Python bindings to ensure proper memory management when
interfacing with Python.
Make sure the buffer will not be deallocated when transfer ownership
from cpp to python
This PR refactors the algorithm selection logic in MSCCL++ and
introduces support for symmetric memory configuration through
environment variables.
1. Algorithm Selection Refactoring
Use separate class for algo selection. Could introduce more complex
logic for algo selection based on message size, arch, if cuda graph is
enabled and memory allocation method
2. Symmetric Memory Support
Introduced symmetricMemory parameter in algorithm context key
generation. Remove disableChannelCache env as is ambiguous
3. Add new args for build_default_algorithms
Add flag_buffer, and flag_buffer_size args to build default algorithm.
Then we could use unified flag buffer for different algorithms, avoid
application hanging when switch algo for different message size.
---------
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Support is being added for fusing the ReadPutPacket operation on DSL,
which reduces the overhead caused by reading packet data multiple times
in the scratch buffer. Fusion will occur when two rppkt operations are
executed consecutively with the same src_buffer:
rppkt(src, dst0) + rppkt(src, dst1) -> rppkt(src, [dst0, dst1]
Co-authored-by: Binyang Li <binyli@microsoft.com>
* Added configurable InfiniBand (IB) signaling mode.
`EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`,
`HostNoAtomic`). `Default` is equivalent to `Host` unless specified
different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the
previous implementation using RDMA atomics for signaling, while
`HostNoAtomic` uses write-with-immediate instead.
* Regarding updates in Python bindings and API.
- Put the common reduce kernel to reduce_kernel.hpp
- Implement operator overloading for the vector type
- Clean up the duplicated code at `executor_ kernel.hpp` and
`allreduce/common.hpp`