## Summary
- Replace the two-step `signal()` implementation (`incOutbound()` +
`atomicStore()`) with a single fire-and-forget PTX
`red.release.sys.global.add.u64` instruction
- This eliminates one local atomic fetch-add and replaces a remote store
with a remote atomic add that has no return value — more efficient on
both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to
fire-and-forget `flat_atomic_add_x2`)
- Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait
ping-pong latency
### Performance (H100, 2 ranks, signal+wait round-trip)
```
SemaphorePerfTest.SignalPingPong:
Store-based (old): 2.595 us/iter
Red-based (new): 2.345 us/iter
Speedup: 1.11x
```
## Test plan
- [x] Builds successfully (`make mp_unit_tests`)
- [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter
"SemaphorePerfTest"` — 1.11x speedup
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Removes the GTest dependency, replacing it with a minimal custom
framework (`test/framework.*`) that covers only what the tests actually
use — a unified `TEST()` macro with SFINAE-based fixture auto-detection,
`EXPECT_*`/`ASSERT_*` assertions, environments, and setup/teardown.
- `--exclude-perf-tests` flag and substring-based negative filtering
- `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to
Codecov
- Merges standalone `test/perf/` into main test targets
- Refactors Azure pipelines to reduce redundancies & make more readable
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
## Summary
This PR addresses a multicast resource leak, fixes `cuMemMap` offset
handling for multicast handles, renames NVLS allreduce algorithm classes
for clarity, and adds a new unit test for `SwitchChannel`.
### Bug Fixes
#### 1. Fix multicast allocation handle leak in `createMulticast()`
(`gpu_ipc_mem.cc`)
`GpuIpcMemHandle::createMulticast()` called
`cuMulticastCreate(&allocHandle, ...)` but never released the local
`allocHandle` after exporting it to shareable handles (POSIX FD /
Fabric). This caused a reference count leak — the multicast object was
never freed even after all mappings and imported handles were released.
Per the [CUDA Driver API docs for
`cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html):
> *"The memory allocation will be freed when all outstanding mappings to
the memory are unmapped and when all outstanding references to the
handle (including its shareable counterparts) are also released."*
The fix adds `cuMemRelease(allocHandle)` after export, matching the
existing pattern used for regular allocations in
`GpuIpcMemHandle::create()`.
**Impact:** Without this fix, repeated creation/destruction of NVLS
connections causes OOM after ~120 iterations when allocating 1GB
multicast buffers on H100.
#### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`)
`cuMemMap` requires `offset=0` for multicast handles. Previously, the
code attempted to map at a non-zero offset within the multicast object,
leading to errors when binding multiple buffers to the same
`NvlsConnection`. The fix maps the entire range `[0, mcOffset +
bufferSize)` and returns the pointer offset by `mcOffset`. This only
consumes extra virtual address space; no additional physical memory is
used.
### Refactoring
#### 3. Rename NVLS allreduce algorithm classes
Renamed for clarity:
- `AllreduceNvls` → `AllreduceNvlsZeroCopy`
- `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline`
- `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline`
Updated all references in builder, selector, docs, and examples.
#### 4. Move `nvlsConnections` setup to `initialize()`
Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this
member) to individual algorithm class members, initialized in their
`initialize()` methods.
### Tests
#### 5. Add `TwoChannelsSameConnection` test
New unit test that creates two `SwitchChannel` instances from the same
`NvlsConnection`, performs reduce operations on both, and verifies
correctness. This exercises the multi-bind path that triggered the
`cuMemMap` offset fix.
### Files Changed
- `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap
offset fix
- `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` —
nvlsConnections fix
- `src/ext/collectives/include/allreduce/*.hpp` — renamed headers
- `src/ext/collectives/algorithm_collection_builder.cc` — updated
references
- `src/ext/nccl/algorithm_selector.cc` — updated algorithm names
- `test/mp_unit/switch_channel_tests.cu` — new test
- `docs/guide/mscclpp-torch-integration.md` — updated names
- `examples/torch-integration/customized_comm_with_default_algo.py` —
updated names
- Remove test/perf/ directory (fifo_test.cu, framework.{cc,hpp}, CMakeLists.txt)
- Remove add_subdirectory(perf) from test/CMakeLists.txt
- Performance tests now integrated into unit_tests as fifo_perf_tests.cu
- Fix mp_unit_tests.cc to use framework functions without ::testing:: namespace
- Fix bootstrap_tests.cc ErrorCode comparison to use ASSERT_TRUE
- Fix switch_channel_tests.cu to not use streaming with ASSERT_EQ
- Add missing #include <unistd.h> to executor_tests.cc
All perf test functionality is now in unit_tests and can be filtered
with --exclude-perf-tests flag. The standalone test/perf/ directory
is no longer needed.
Verified builds:
- unit_tests: ✅
- mp_unit_tests: ✅
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Make MPI REQUIRED for test builds (clearer error messages)
- Add project include directories to test_framework library
- Fix core_tests.cc to use custom framework correctly
- Fix mp_unit_tests.hpp to use mscclpp::test namespace
- Add FAIL() macro with streaming support for test messages
- Building tests now works in Docker environment with GPU bypass
Tests can now be built using:
docker run --rm -v $(pwd):/workspace -w /workspace \
ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
"mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
-DMSCCLPP_USE_CUDA=ON .. && make -j"
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Modified test/mp_unit/mp_unit_tests.hpp to use ../framework.hpp instead of gtest/gtest.h
- Enhanced test/framework.hpp with GTest-compatible APIs:
- Added Environment base class for global test setup/teardown
- Added TestInfo and UnitTest classes for test metadata access
- Added GTEST_SKIP macro support via SkipHelper class
- Added namespace alias 'testing' for compatibility
- Added InitGoogleTest and AddGlobalTestEnvironment helper functions
- Updated test/framework.cc with implementations for new classes
- All mp_unit test files now use framework.hpp through mp_unit_tests.hpp
- Formatting applied via lint.sh
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
* Added configurable InfiniBand (IB) signaling mode.
`EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`,
`HostNoAtomic`). `Default` is equivalent to `Host` unless specified
different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the
previous implementation using RDMA atomics for signaling, while
`HostNoAtomic` uses write-with-immediate instead.
* Regarding updates in Python bindings and API.
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles
For nvidia, we don't need this feature since nvidia will count the
handle reference internally and reuse the same handle if already be
opened
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
* Added `port` and `gidIndex` field in the IB endpoint config (and
`deviceIndex` field for future usages)
* Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so
* Added `--ib_gid_index` CLI option to `mp_unit_tests`
* Other minor fixes
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
* Always use `ibv_reg_dmabuf_mr` when DMABUF is supported
* Do not check `nvidia-peermem` when unnecessary
* More rigorous check on IB port availability
* Fixed ibverbs wrappers
* Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test
* Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB
on/off
* Fix `nvidia-peermem` check; no need for DMABUF supported systems
* Fix `mp_unit_tests` to skip all IB tests when built with
`-DMSCCLPP_USE_IB=OFF`
Use unix socket to share fd to other processes. Used for nvls handle sharing
Update nccl interface to support worldSize=1
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline
- Clean code and enhance document
C++ side:
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario
---------
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Change to use smart pointer for IB structure. Registered memory will own
ibMr, ibCtx will not held the reference
- Use smart pointer for IbQp and IbMr
- Update memoryChannel API, keep localRegisteredMemory
- Close fd when registedMemory released
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.