The prior commit skipped r==rank in the semaphore and port-channel
build loops on the theory that the self-slot handshake skew was the
cause of LL direction asymmetry. That was wrong (the real bug was
int32 atomic alignment), and skipping self breaks other code paths
that assume every rank slot is represented -- cross-node HT and LL
failed with cudaErrorInvalidResourceHandle at the first barrier after
Buffer init. Restore the self-inclusive loop.
Dropping the self ipc_cfg connection caused cudaErrorInvalidResourceHandle
on multi-node launches. Keep the self connection (needed by other code
paths that assume every rank is in the connections map) but continue to
skip the self slot in the semaphore + port-channel construction loops so
the kernel's [local_expert*num_ranks + dst_rank] indexing hits only peer
handles; the self slot is a zero-initialized placeholder since the
kernel's same-rank branch uses a direct warp copy.
The low-latency dispatch/combine kernels signal recv counts via MSCCL++
PortChannel.atomicAdd, which lowers to IB IBV_WR_ATOMIC_FETCH_AND_ADD.
That opcode requires the remote address to be 8-byte aligned, but
LowLatencyLayout packed the per-expert signaling slots as int32. Odd
slots landed at offset %8 == 4; the NIC silently dropped those atomics
and the target rank spun forever in recv_hook (observed: even->odd
direction works, odd->even does not, across all tested topologies
including 2-rank intra-node, 8-rank intra-node, and 2-node 1-GPU-each).
Widen dispatch_rdma_recv_count_buffer / combine_rdma_recv_flag_buffer to
int64_t, update clean kernel + kernel signatures + next_clean pointers
accordingly, and add int64_t overloads for st_na_release /
ld_acquire_sys_global in utils.cuh.
Also drop the bogus self CUDA-IPC connection in Buffer::sync() that was
previously skewing the cross-rank buildAndAddSemaphore handshake order;
the kernel's same-rank branch uses a direct warp copy and never touches
the self port-channel slot (filled with a zero-initialized placeholder
so the [local_expert*num_ranks + dst_rank] indexing still holds).
Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).
Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports
per-iter latency (max across ranks, CUDA-event timed) and aggregate
effective bandwidth (sum across ranks, dispatch+combine payload bytes).
Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN.
Bench reuses the Buffer allocated for the correctness phase and
self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.
- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode.
DeepEP's original filter was safe because its LL path used NVSHMEM; this
port drives LL via PortChannel so the kernel indexes
port_channel_handles[local_expert*num_ranks + dst_rank] for every
dst_rank. All peers now get a real memory/connection/semaphore/port
channel entry.
- Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine
functional round-trip, BF16 only). Works cross-node in DeepEP's
1-GPU-per-node topology.
- Known limitation documented in src/ext/ep/README.md and the test docstring:
intra-node 8-GPU LL currently hangs because every peer transfer routes
through the CPU proxy over IB loopback between distinct HCAs on the same
host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which
mis-aligns the 32-bit rdma_recv_count slots when used for same-node
peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for
same-node, PortChannel for cross-node) or 64-bit counters.
Refresh status docs and comments now that internode HT dispatch and
combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via
test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover
their per-rank token payloads with zero diff).
- src/ext/ep/README.md: consolidate the previously duplicated README
into a single document; mark intranode and internode HT dispatch and
combine as validated in the status table; add a 'Running the tests'
section with torchrun examples for both the intranode and the 2x8
internode setups; record the dispatch->combine
torch.cuda.synchronize() + dist.barrier() requirement under Known
limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural
port, untested.
- python/mscclpp/ext/ep/buffer.py: update the module docstring and the
Buffer constructor docstring to say internode HT is validated and
clarify that LL mode is untested on multi-node hardware.
- src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported'
and 'low-latency paths still stubbed' comments. mscclpp_ep does not
use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL
paths are a structural port that is present but untested, not stubbed.
Note validation on 2x H100x8 in the internode section header.
Two issues prevented internode HT combine from completing on 2x8 H100:
1. Wrong prefix matrices passed to internode_combine. Combine runs in the
reverse direction of dispatch, so it must consume the receiver-side
matrices returned by dispatch (recv_rdma_channel_prefix_matrix,
recv_rdma_rank_prefix_sum, recv_gbl_channel_prefix_matrix), not the
sender-side rdma_channel_prefix_matrix / gbl_channel_prefix_matrix.
This matches DeepEP's deep_ep/buffer.py::internode_combine handle
unpacking. Without the fix the NVL forwarder's 'NVL check' timed out
because token_start_idx/token_end_idx were computed against the wrong
per-channel layout.
2. Cross-rank race between dispatch and combine. Even with the correct
matrices, launching combine immediately after dispatch deadlocked the
forwarder NVL check (tail stuck one short of expected_head) because
peers still had in-flight dispatch proxy traffic while fast ranks had
already started combine. A torch.cuda.synchronize() + dist.barrier()
between the two calls makes the test pass deterministically on 16
ranks (combine diff == 0, max|expected| up to 60.0).
The barrier in the test is a workaround; the real fix belongs in
Buffer::internode_dispatch / Buffer::internode_combine so the
dispatch->combine handoff fully fences outstanding proxy work across
ranks. Marked with an XXX comment in the test.
The `internode` kernels index device-side port channel handles as
`port_channel_handles[channel_id * num_ranks + peer_rank]`, where
`peer_rank` is a global rank in [0, num_ranks). `Buffer::sync` was
building that table by iterating `std::unordered_map<int, MemoryId>`
(and similarly for connections/semaphores), which yields hash order
rather than ascending rank order. Once the cross-node fan-out grew
beyond a single peer, a local rank's trigger for peer `r` landed on
the semaphore/memory pair of a different peer, so RDMA puts and
atomic tail updates went to the wrong destination and the forwarder
spun on a tail counter that never advanced.
Changes:
- Build `sema_ids` and `port_channel_handles` by iterating
`for (int r = 0; r < num_ranks; ++r)` and looking up the
connection / memory id for rank `r`, skipping ranks excluded by
low-latency mode (inserting a placeholder handle so the stride
stays `num_ranks`).
- Tag the RDMA-phase `sendMemory`/`recvMemory`/`connect` calls with
`kRdmaTag = 1` so they do not collide with NVL-phase tag-0
traffic between the same pair of ranks.
- Drop an unused `r` local in the NVL setup loop.
With this fix and a matched `libmscclpp.so` on both nodes, the
2-node x 8-GPU internode HT dispatch path completes successfully
(`[dispatch] OK`). Combine is still under investigation.
Also adds `test/python/ext/ep/test_internode_multirank.py`, a
torchrun-based 2-node functional test that exercises
`get_dispatch_layout` -> `internode_dispatch` -> `internode_combine`
and validates per-source-rank token values end-to-end.
Three issues blocked end-to-end intranode validation across multiple
ranks. This commit fixes them and adds a 2/4/8-rank functional test.
1. Combine receiver: OOB __shared__ read
In the combine receiver warp, the wait loop evaluated
`channel_tail_idx[recv_lane_id] <= expected_head` before the
`expected_head >= 0` guard. `channel_tail_idx` is a shared array
of size `kNumRanks`, but the loop runs on all 32 lanes of a warp,
so lanes with `recv_lane_id >= kNumRanks` indexed out of bounds.
compute-sanitizer reported "Invalid __shared__ read of size 4
bytes" at combine<bf16,2,768>+0xdd0, surfaced asynchronously as
cudaErrorIllegalAddress at the kernel launch site. Swap the
operands so the rank-bounds check short-circuits the shared read.
2. Python bindings: UniqueId ABI
`mscclpp::UniqueId` is a `std::array<uint8_t, N>` which pybind11
auto-converts to a Python `list`, silently overriding any
`py::class_<UniqueId>` wrapper. Expose `create_unique_id` /
`connect` as lambdas that produce/consume `py::bytes` and memcpy
into a local `UniqueId`. Also coerce `bytes`->`bytearray` at the
Python call site for `sync()` whose signature expects
`pybind11::bytearray`.
3. Python frontend: communicator required for NVL-only sync
`Buffer::sync()` uses `communicator->connect(ipc_config, ...)` on
the pure-NVLink path, so the communicator must be initialized
even when `num_rdma_ranks == 1` and `low_latency_mode == False`.
Always broadcast the unique id and call `runtime.connect()`
before `sync()`.
Validation on a single H100x8 node via torchrun:
- 2 ranks: dispatch 195 tokens, combine diff=0
- 4 ranks: dispatch 371 tokens, combine diff=0
- 8 ranks: dispatch 456 tokens, combine diff=0
Test harness added at test/python/ext/ep/test_intranode_multirank.py.
Port DeepEP's pure-RDMA low-latency (LL) MoE kernels from
csrc/kernels/internode_ll.cu (branch chhwang/dev-atomic-add-cleanup)
into the MSCCL++ EP extension. NVSHMEM / IBGDA device primitives are
replaced with MSCCL++ PortChannelDeviceHandle operations:
nvshmemx_barrier_all_block() -> port-channel signal+wait ring
nvshmemi_ibgda_put_nbi_warp(...) -> lane-0 PortChannel.put(...)
nvshmemi_ibgda_amo_nonfetch_add(...) -> lane-0 PortChannel.atomicAdd(...)
The atomicAdd path relies on the MSCCL++ Connection::atomicAdd /
PortChannelDeviceHandle::atomicAdd API cherry-picked from branch
chhwang/new-atomic-add; the LL dispatch path uses a signed delta
(-num_tokens_sent - 1) which the new int64_t signature supports.
Changes:
* New file src/ext/ep/kernels/internode_ll.cu (~530 lines) with the
three kernels clean_low_latency_buffer, dispatch<kUseFP8,...>,
combine<...> plus their launchers. rdma_buffer_ptr is threaded
through the launchers so the kernel can translate virtual addresses
into registered-memory offsets expected by MSCCL++.
* kernels/api.cuh: replace the single stub signature with full LL
launcher prototypes.
* buffer.cc: replace the four LL throw-stubs
(clean_low_latency_buffer, low_latency_dispatch,
low_latency_combine, get_next_low_latency_combine_buffer) with
torch-Tensor implementations ported from DeepEP/csrc/deep_ep.cpp.
* Drop src/ext/ep/internode_stub.cc and its CMake entry.
* python/mscclpp/ext/ep/buffer.py: remove the low_latency_mode=True
NotImplementedError guard; update docstring.
* test/python/ext/ep/test_ep_smoke.py: rename
test_low_latency_rejected -> test_low_latency_buffer_construct
to reflect that LL construction is now accepted.
* src/ext/ep/README.md: update status matrix, document the
NVSHMEM -> MSCCL++ translation table, and list the known
limitations.
This is a structural port: the kernels compile, link, and pass the
single-rank smoke tests, but end-to-end behaviour on multi-node H100
is not yet validated. Two known caveats:
1. Performance will NOT match IBGDA because MSCCL++ port channels
use a CPU proxy; this port is for functional parity, not latency.
2. Buffer::sync() in LL mode only connects peers that share the
same local GPU id (DeepEP convention), so the LL kernels assume
a one-GPU-per-node topology (num_ranks == num_rdma_ranks).
Multi-GPU-per-node LL layouts will need a follow-up in sync().
Tested:
cmake --build build -j --target mscclpp_ep_cpp # builds clean
pytest test/python/ext/ep/test_ep_smoke.py # 3 passed
Port DeepEP's high-throughput MoE dispatch/combine kernels onto MSCCL++
as an optional build target `mscclpp_ep_cpp`, gated by -DMSCCLPP_BUILD_EXT_EP
(OFF by default). Sources are lifted from DeepEP branch
`chhwang/dev-atomic-add-cleanup` and rebased onto upstream MSCCL++ APIs;
the NVSHMEM / IBGDA dependencies are replaced with `PortChannel` +
`MemoryChannel` + the new `Connection::atomicAdd` primitive.
Scope
-----
Intranode (NVLink-only):
* `Buffer` ctor/dtor: cudaMalloc nvl workspace, export IPC handle,
allocate FIFO + peer-pointer tables, start `ProxyService`.
* `sync()`: import peer IPC handles, upload peer pointer table,
build `MemoryDevice2DeviceSemaphore` + `MemoryChannel` per peer.
* `get_dispatch_layout`, `intranode_dispatch`, `intranode_combine`
ported verbatim (torch::Tensor ABI preserved).
Internode HT (NVLink + RDMA):
* `sync()` RDMA branch: cudaMalloc RDMA buffer + `bootstrap->barrier()`
(replacing NVSHMEM symmetric-heap allocation); register with
`all_transport`, exchange via `sendMemory`/`recvMemory`, build 12 IB
QPs/peer + 16 semaphores/peer + 16 port channels/peer.
* Full `internode.cu` port (notify_dispatch / dispatch / cached_notify
/ combine / get_dispatch_layout). The 4 raw `ChannelTrigger` atomic
sites are rewritten to call the new
`PortChannelDeviceHandle::atomicAdd(offset, value)` API; the single
`nvshmem_fence()` is replaced with `__threadfence_system()` (remote
visibility guaranteed by the subsequent port-channel barrier).
* `internode_dispatch` / `internode_combine` host code ported, with
the torch tensor marshalling and CPU spin-wait on mapped counters.
Low-latency (pure RDMA):
* Not ported. `low_latency_dispatch`, `low_latency_combine`,
`clean_low_latency_buffer`, `get_next_low_latency_combine_buffer`
throw `std::runtime_error`; the Python frontend refuses to
construct a Buffer with `low_latency_mode=True`.
Python layer
------------
* New pybind11 + libtorch Python extension `mscclpp_ep_cpp` (separate
from the nanobind `_mscclpp` because the EP ABI carries
`torch::Tensor` / `at::cuda::CUDAStream`).
* `mscclpp.ext.ep.Buffer` mirrors `deep_ep.Buffer`; exchanges device
IDs, IPC handles and the bootstrap UniqueId over the user's
`torch.distributed` process group before calling `sync()`.
* `mscclpp.ext` auto-imports `ep` if the extension is built.
Build
-----
* `src/ext/ep/CMakeLists.txt`: finds Python + Torch; warns and skips if
`CMAKE_PREFIX_PATH` doesn't point at `torch.utils.cmake_prefix_path`.
Falls back to Torch's bundled pybind11 if a standalone pybind11 is not
installed. Links `libtorch_python` explicitly (without it, `import
mscclpp_ep_cpp` fails with `undefined symbol: THPDtypeType`).
* Top-level `CMakeLists.txt` exposes the `MSCCLPP_BUILD_EXT_EP` option
(default OFF).
Tests
-----
* `test/python/ext/ep/test_ep_smoke.py`: skipped if the extension isn't
built. Covers Config round-trip, low-latency size hint, and the LL
construction guard. Multi-rank functional tests still to do on H100.
Notes
-----
* Builds against the preceding "atomic add" commit which adds
`Connection::atomicAdd` and `PortChannelDeviceHandle::atomicAdd` to
upstream MSCCL++.
* Intranode path verified end-to-end (build + import + smoke tests).
* Internode HT is code-complete but requires real IB hardware to
validate; see `src/ext/ep/README.md` for the detailed port plan and
remaining LL migration.
## Support Python wheel build
This PR modernizes the Python packaging for MSCCL++ by defining
dependencies and optional extras in `pyproject.toml`, enabling proper
wheel builds with `pip install ".[cuda12]"`.
### Changes
**`pyproject.toml`**
- Add `dependencies` (numpy, blake3, pybind11, sortedcontainers)
- Add `optional-dependencies` for platform-specific CuPy (`cuda11`,
`cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras
- Bump minimum Python version from 3.8 to 3.10
**`test/deploy/setup.sh`**
- Use `pip install ".[<platform>,benchmark,test]"` instead of separate
`pip install -r requirements_*.txt` + `pip install .` steps
- Add missing CUDA 13 case
**`docs/quickstart.md`**
- Update install instructions to use extras (e.g., `pip install
".[cuda12]"`)
- Document all available extras and clarify that `rocm6` builds CuPy
from source
- Update Python version references to 3.10
**`python/csrc/CMakeLists.txt`**, **`python/test/CMakeLists.txt`**
- Update `find_package(Python)` from 3.8 to 3.10
### Notes
- The `requirements_*.txt` files are kept for Docker base image builds
where only dependencies (not the project itself) should be installed.
- CuPy is intentionally not in base dependencies — users must specify a
platform extra to get the correct pre-built wheel (or source build for
ROCm).
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem
`nccl-test.yml` was the only CI template calling `deploy.yml` without
passing `gpuArch`. Since the CI build machine has no GPU, CMake fell
back to building for **all** supported architectures (`80;90;100;120`),
unnecessarily slowing down CI builds.
## Fix
- Add `gpuArch` parameter to `nccl-test.yml` and forward it to
`deploy.yml`
- Pass `gpuArch: '80'` (A100) and `gpuArch: '90'` (H100) from
`nccl-api-test.yml`
All other templates were already passing `gpuArch` correctly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- **Multi-node H100 CI setup**: Improve architecture detection and GPU
configuration
- **Remove hardcoded VMSS hostnames** from deploy files
- **Fix CUDA compat library issue**: Remove stale compat paths from
Docker image for CUDA 12+. Instead, `peer_access_test` now returns a
distinct exit code (2) for CUDA init failure, and `setup.sh`
conditionally adds compat libs only when needed. This fixes
`cudaErrorSystemNotReady` (error 803) when the host driver is newer than
the container's compat libs.
- **Speed up deploy**: Replace recursive `parallel-scp` with
tar+scp+untar to avoid per-file SSH overhead.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Major enhancements to the IB signal forwarding mechanisms
(`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5
Direct Verbs, and refactoring the signal forwarding path for IB
HostNoAtomic mode. The changes fix memory consistency issues and reduce
signaling latency.
- GDRCopy and MLX5 Direct Verbs MR integration
- Signal forwarding path redesign
- Semaphore and connection API updates
- Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates
The reduce send operation in DSL essentially combines the reduce and put
operations. The put operation carry the information about the channel
type, whereas previously, we were using the channel type from the reduce
operation.
The v0.9.0 conf.py (introduced in #775) dynamically loads the version
from python/mscclpp/_version.py.
This file is generated at build time by setuptools_scm and is listed in
.gitignore — it is never committed to the repo. Earlier tags (v0.8.0 and
below) used a hardcoded release (e.g., "v0.8.0") in conf.py, so they had
no dependency on generated files.
sphinx-multiversion checks out each tag using git archive, which only
extracts committed files.
Since _version.py is not committed, the v0.9.0 checkout is missing it,
and conf.py crashes on import. All future tags will have this same
problem.
**Three changes:**
1. docs/build_multiversion.py (new): A wrapper around
sphinx-multiversion that monkey-patches copy_tree to generate
_version.py in each tag checkout after extraction. The version string is
parsed from the tag name (e.g., v0.9.0 → __version__ = "0.9.0").
2. Makefile: The multiversion target now calls build_multiversion.py
instead of sphinx-multiversion directly.
3. conf.py: Added a fallback so that if _version.py doesn't exist, it
reads the version from the VERSION file instead. This makes conf.py
resilient for any future scenario where _version.py is missing.
**Testing**
Verified locally:
• make multiversion now successfully builds all 11 versions (v0.4.0
through v0.9.0)
• v0.9.0 docs are correctly generated under _build/html/v0.9.0/
Version selector shows v0.9.0 as latest
## Summary
Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including
optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using
inline assembly.
Extends the allpair packet and fullmesh allreduce kernels to support
higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data,
improving numerical accuracy.
Adds Python tests to verify that higher-precision accumulation is at
least as accurate as native FP8 accumulation across all algorithm
variants.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- **Add `fp8_e4m3b15` datatype**: A software-defined FP8 type with 4
exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375).
Implemented entirely in software with no HW dependency, using
Triton-style bit manipulation through fp16 as intermediate for efficient
conversion.
- **Add mixed-precision accumulation for allreduce**: All allreduce
algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and
others) now support a configurable `accumDtype` parameter, enabling FP8
inputs to be reduced in float16 or float32 for higher accuracy.
- **Propagate `accumDtype` through the full API**: The new parameter is
threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc`
→ dispatch → CUDA kernels, with `DataType::AUTO` as the default
(resolves to input dtype at runtime).
- **Add FP8 accumulation correctness tests**: New `test_fp8_accum.py`
validates that higher-precision accumulation produces results at least
as accurate as native FP8 accumulation across multiple algorithms and
sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm.
- **Add `test_fp8_accum.py` to CI**: Azure Pipeline `ut.yml` now runs
FP8 accumulation tests alongside existing pytests.
- **NCCL shim logging cleanup**: Migrated `printf`-style `WARN`/`INFO`
calls to streaming-style logging.
## Key files
| Area | Files |
|------|-------|
| New datatype + vector ops | `include/mscclpp/gpu_data_types.hpp` |
| Accumulation reduce helpers | `src/core/include/reduce_kernel.hpp` |
| Algorithm API (`accumDtype`) | `include/mscclpp/algorithm.hpp`,
`src/core/algorithm.cc` |
| Allreduce kernels | `src/ext/collectives/allreduce/*.cu` |
| Dispatch + common | `src/ext/collectives/include/allreduce/common.hpp`
|
| Python bindings | `python/csrc/algorithm.cpp`,
`python/mscclpp/_core/algorithm.py` |
| Tests | `python/test/test_fp8_accum.py` |
| CI | `.azure-pipelines/templates/ut.yml` |
## Test plan
- [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15
accumulation tests
- [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped
- [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP
- [x] Existing `test_mscclpp.py` tests continue to pass
- [x] NCCL shim builds and runs correctly with new `accumDtype` defaults
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This pull request updates the deployment pipeline to allow custom CMake
arguments to be passed to the pip install process on remote VMs.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- Fix `run-remote.sh` to correctly execute multi-command scripts (e.g.,
multiple `mpirun` calls)
- The old approach piped decoded script through `base64 -d | bash`,
which feeds the script via bash's **stdin**. When `mpirun` (or its child
processes) runs, it can consume the remaining stdin, causing bash to
never see subsequent commands — only the first command would execute.
- The fix decodes the script to a **temp file** and runs `bash -euxo
pipefail "$TMP"` instead, so bash reads commands from the file and
`mpirun` consuming stdin has no effect.
- Applied to both the docker path (pssh + docker exec) and the
non-docker path (pssh only).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Summary
- Replace the two-step `signal()` implementation (`incOutbound()` +
`atomicStore()`) with a single fire-and-forget PTX
`red.release.sys.global.add.u64` instruction
- This eliminates one local atomic fetch-add and replaces a remote store
with a remote atomic add that has no return value — more efficient on
both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to
fire-and-forget `flat_atomic_add_x2`)
- Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait
ping-pong latency
### Performance (H100, 2 ranks, signal+wait round-trip)
```
SemaphorePerfTest.SignalPingPong:
Store-based (old): 2.595 us/iter
Red-based (new): 2.345 us/iter
Speedup: 1.11x
```
## Test plan
- [x] Builds successfully (`make mp_unit_tests`)
- [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter
"SemaphorePerfTest"` — 1.11x speedup
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### Summary
Update the installer to place bundled default execution plans under
`<MSCCLPP_CACHE_DIR>/default`, which is where the runtime already looks
for bundled plans.
### Background
The C++ runtime treats `MSCCLPP_CACHE_DIR` as the cache *root* and loads
bundled default plans from `<cache root>/default`.
When `MSCCLPP_CACHE_DIR` was set, the installer instead wrote bundled
plans
directly into the cache root, causing the runtime to miss them.
This surfaced while running benchmarking tests with a non-default
`MSCCLPP_CACHE_DIR`, where the bundled plans were not being discovered.
### Change
This PR updates the installer to always install bundled default plans
into
`<MSCCLPP_CACHE_DIR>/default`, preserving the existing runtime contract.
### Scope
- Installer-only change
- No runtime behavior changes
### Validation
Manual inspection of the updated install path.
Successful build
---------
Co-authored-by: Ekow Wellington <t-ekoww@microsoft.com>
- Removes the GTest dependency, replacing it with a minimal custom
framework (`test/framework.*`) that covers only what the tests actually
use — a unified `TEST()` macro with SFINAE-based fixture auto-detection,
`EXPECT_*`/`ASSERT_*` assertions, environments, and setup/teardown.
- `--exclude-perf-tests` flag and substring-based negative filtering
- `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to
Codecov
- Merges standalone `test/perf/` into main test targets
- Refactors Azure pipelines to reduce redundancies & make more readable
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
## Summary
Fix a use-after-free where the CUDA allocation handle
(`CUmemGenericAllocationHandle`) was released prematurely while the
exported fabric handle still referenced it.
## Problem
Unlike POSIX FD handles (where the kernel keeps the allocation alive via
the open file descriptor), fabric handles do not hold their own
reference to the underlying allocation. The original code called
`cuMemRelease(allocHandle)` immediately after exporting the fabric
handle, freeing the allocation. When a remote process later tries to
`cuMemImportFromShareableHandle` using that fabric handle, it references
a freed allocation — a **use-after-free**.
This affected both code paths:
1. **`GpuIpcMemHandle::create()`**: The local `allocHandle` obtained via
`cuMemRetainAllocationHandle` was released right after fabric export,
leaving the fabric handle dangling.
2. **`GpuIpcMemHandle::createMulticast()`**: The `allocHandle` from
`cuMulticastCreate` was unconditionally released, even when it was the
only thing keeping the multicast object alive for the fabric handle.
## Fix
- **Added `allocHandle` field** to the `fabric` struct in
`GpuIpcMemHandle` to store the allocation handle and keep it alive for
the lifetime of the `GpuIpcMemHandle`.
- **`create()`**: Retain an additional reference via
`cuMemRetainAllocationHandle` and store it in `fabric.allocHandle` when
a fabric handle is successfully exported.
- **`createMulticast()`**: Store the `allocHandle` directly in
`fabric.allocHandle` instead of unconditionally releasing it. Only
release if fabric export was not used.
- **`deleter()`**: Release `fabric.allocHandle` via `cuMemRelease` when
the handle type includes `Fabric`, ensuring proper cleanup.
- **`GpuIpcMem` constructor (importer side)**: Clear
`fabric.allocHandle` after importing, since the importer gets its own
handle via `cuMemImportFromShareableHandle` and should not release the
exporter's allocation handle.
## Files Changed
- `src/core/include/gpu_ipc_mem.hpp` — Added
`CUmemGenericAllocationHandle allocHandle` to fabric struct.
- `src/core/gpu_ipc_mem.cc` — Retain/release allocation handle properly
across create, createMulticast, deleter, and importer paths.
## Summary
This PR addresses a multicast resource leak, fixes `cuMemMap` offset
handling for multicast handles, renames NVLS allreduce algorithm classes
for clarity, and adds a new unit test for `SwitchChannel`.
### Bug Fixes
#### 1. Fix multicast allocation handle leak in `createMulticast()`
(`gpu_ipc_mem.cc`)
`GpuIpcMemHandle::createMulticast()` called
`cuMulticastCreate(&allocHandle, ...)` but never released the local
`allocHandle` after exporting it to shareable handles (POSIX FD /
Fabric). This caused a reference count leak — the multicast object was
never freed even after all mappings and imported handles were released.
Per the [CUDA Driver API docs for
`cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html):
> *"The memory allocation will be freed when all outstanding mappings to
the memory are unmapped and when all outstanding references to the
handle (including its shareable counterparts) are also released."*
The fix adds `cuMemRelease(allocHandle)` after export, matching the
existing pattern used for regular allocations in
`GpuIpcMemHandle::create()`.
**Impact:** Without this fix, repeated creation/destruction of NVLS
connections causes OOM after ~120 iterations when allocating 1GB
multicast buffers on H100.
#### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`)
`cuMemMap` requires `offset=0` for multicast handles. Previously, the
code attempted to map at a non-zero offset within the multicast object,
leading to errors when binding multiple buffers to the same
`NvlsConnection`. The fix maps the entire range `[0, mcOffset +
bufferSize)` and returns the pointer offset by `mcOffset`. This only
consumes extra virtual address space; no additional physical memory is
used.
### Refactoring
#### 3. Rename NVLS allreduce algorithm classes
Renamed for clarity:
- `AllreduceNvls` → `AllreduceNvlsZeroCopy`
- `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline`
- `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline`
Updated all references in builder, selector, docs, and examples.
#### 4. Move `nvlsConnections` setup to `initialize()`
Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this
member) to individual algorithm class members, initialized in their
`initialize()` methods.
### Tests
#### 5. Add `TwoChannelsSameConnection` test
New unit test that creates two `SwitchChannel` instances from the same
`NvlsConnection`, performs reduce operations on both, and verifies
correctness. This exercises the multi-bind path that triggered the
`cuMemMap` offset fix.
### Files Changed
- `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap
offset fix
- `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` —
nvlsConnections fix
- `src/ext/collectives/include/allreduce/*.hpp` — renamed headers
- `src/ext/collectives/algorithm_collection_builder.cc` — updated
references
- `src/ext/nccl/algorithm_selector.cc` — updated algorithm names
- `test/mp_unit/switch_channel_tests.cu` — new test
- `docs/guide/mscclpp-torch-integration.md` — updated names
- `examples/torch-integration/customized_comm_with_default_algo.py` —
updated names
## Summary
Fix NCCL fallback communicator cleanup errors and update CI to use
stable NCCL releases.
## Problem
When using `LD_PRELOAD=libmscclpp_nccl.so` with NCCL fallback enabled,
the following warnings appear at program exit:
```
NCCL WARN commReclaim: cleanup comm 0x55a0dcadaa90 rank 3 failed in destroy/abort, error 3
```
This is caused by three bugs in the NCCL fallback communicator lifecycle
management.
## Root Causes & Fixes
### 1. Symbol interposition during NCCL cleanup (`RTLD_DEEPBIND`)
**Root cause:** When the fallback NCCL library is loaded via `dlopen`,
its internal calls to its own public API functions (e.g.,
`ncclCommWindowDeregister`, `ncclMemFree`) during `commFree` cleanup are
intercepted by our `LD_PRELOAD`'d stub functions, which return errors.
This causes NCCL's `commReclaim` to report `error 3`
(`ncclSystemError`).
**Fix:** Add `RTLD_DEEPBIND` to the `dlopen` flags. This makes the
dlopen'd NCCL library resolve its own symbols internally first,
bypassing our interposition layer for internal calls.
### 2. Missing `ncclCommFinalize` forwarding
**Root cause:** `CommFinalize` was not in the `mscclppNcclOps_t` struct
and was never loaded via `dlsym`. So `ncclCommFinalize` never forwarded
to the real NCCL's finalize, which is required before `ncclCommDestroy`
in NCCL 2.29+.
**Fix:** Add `CommFinalize` to the ops struct and load it via `dlsym`.
Forward the call in `ncclCommFinalize`.
### 3. CI: Use latest NCCL release tag
The CI pipeline was cloning the NCCL default branch (which may contain
unreleased/unstable code). Updated to fetch the latest release tag via
GitHub API and clone that specific tag.
## Testing
Verified with the exact CI command:
```bash
mpirun -np 8 --bind-to numa --allow-run-as-root \
-x LD_PRELOAD=libmscclpp_nccl.so \
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE \
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce" \
-x MSCCLPP_NCCL_LIB_PATH=/root/nccl/build/lib/libnccl.so \
all_reduce_perf -b 1K -e 1G -f 2 -d half -G 1 -w 10 -n 100
```
- **Before:** `commReclaim: error 3` warnings on all 8 ranks at exit
- **After:** Clean exit, no warnings, correct results
## Files Changed
- `src/ext/nccl/nccl.cc` — Fix comm destroy lifecycle (RTLD_DEEPBIND,
CommFinalize forwarding, destroy order)
- `.azure-pipelines/templates/nccl-test.yaml` — Use latest NCCL release
tag in CI
I recently encountered a weird memory usage issue.
After starting the proxy service on a cuda device X > 0, I notice an
unexpected thread entity apprear on both the GPU X and GPU 0, where GPU
0's share is about 500MB. Note that when the device is 0, there is no
extra memory usage.
The image clearly shows that when 8 ranks each using one GPU and
starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB
extra memory.
<img width="1247" height="1367" alt="Screenshot 2026-02-28 000153"
src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4"
/>
After tracking down to when it happens, I identified the root cause in
Proxy thread initialization.
// never capture in a proxy thread
auto mode = cudaStreamCaptureModeRelaxed;
MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));
pimpl_->threadInit();
The call to cudaThreadExchangeStreamCaptureMode() actually triggers some
resource allocation on the "current device" which is still 0 for the
starting thread.
The later threadInit() is too late to set the correct GPU number.
The fix is simple: call threadInit() before the first cuda call:
pimpl_->threadInit();
// never capture in a proxy thread
auto mode = cudaStreamCaptureModeRelaxed;
MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));
This guarantees that the current device is properly set before calling
any resource-allocating cuda functions.
This is the memory usage after the fix. The extra memory usages are
gone.
<img width="1242" height="459" alt="Image (1)"
src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9"
/>
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
## Summary
Add CI pipeline support for testing in environments without InfiniBand
(IB) hardware.
## Changes
### IB stubs for no-IB builds (`src/core/ib.cc`)
- Added stub implementations for `IbMr` and `IbQp` classes in the `#else
// !defined(USE_IBVERBS)` block so the library links successfully when
built with `-DMSCCLPP_USE_IB=OFF`.
### Environment variable to disable IB tests
(`MSCCLPP_DISABLE_IB_TESTS`)
- Added `disableIbTests` field to the `Env` class
(`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from
`MSCCLPP_DISABLE_IB_TESTS` env var.
- Exposed as `disable_ib_tests` in Python bindings
(`python/csrc/env_py.cpp`).
- Updated `python/test/test_mscclpp.py` to skip IB-dependent tests
(`create_group_and_connection` with IB transport, `test_h2h_semaphores`,
`test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is
true.
### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`)
The no-IB environment pipeline runs two phases:
1. **No-IB build phase**: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run
unit tests, multi-process unit tests, and pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`).
2. **IB build phase**: Rebuild with IB enabled (default), stop the
existing container, redeploy, and run pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build
works correctly in a non-IB environment when IB tests are skipped.
Also increased the job timeout from 40 to 60 minutes to accommodate the
two-phase pipeline.
This PR adds an example code for switch channel testing. It validates
switch channel on single node and multi node environments. We need to
add the description of the algorithms and the explanation of the code
under doc.
example outputs:
rank0:
./bidir_switch_channel 10.0.5.233:45571 0 0
Rank 0 (GPU 0): Preparing for tests ...
Rank 0 (GPU 0): bytes 4096, elapsed 0.0062328 ms/iter, BW 0.657169 GB/s
Rank 0 (GPU 0): bytes 4.1943e+06, elapsed 0.0164577 ms/iter, BW 254.854
GB/s
Rank 0 (GPU 0): bytes 1.34218e+08, elapsed 0.33628 ms/iter, BW 399.125
GB/s
Rank 0: Succeed!
rank1:
./bidir_switch_channel 10.0.5.233:45571 1 0
Rank 1 (GPU 0): Preparing for tests ...
Rank 1: Succeed!
This pull request updates the way the `nlohmann/json` library is fetched
and upgrades it to a newer version in both the main build and test
configuration files.
Addressed installation issue in some env
This pull request updates the handling of the default flag buffer in the
C++ and Python bindings to ensure proper memory management when
interfacing with Python.
Make sure the buffer will not be deallocated when transfer ownership
from cpp to python