## Summary
Fix NCCL fallback communicator cleanup errors and update CI to use
stable NCCL releases.
## Problem
When using `LD_PRELOAD=libmscclpp_nccl.so` with NCCL fallback enabled,
the following warnings appear at program exit:
```
NCCL WARN commReclaim: cleanup comm 0x55a0dcadaa90 rank 3 failed in destroy/abort, error 3
```
This is caused by three bugs in the NCCL fallback communicator lifecycle
management.
## Root Causes & Fixes
### 1. Symbol interposition during NCCL cleanup (`RTLD_DEEPBIND`)
**Root cause:** When the fallback NCCL library is loaded via `dlopen`,
its internal calls to its own public API functions (e.g.,
`ncclCommWindowDeregister`, `ncclMemFree`) during `commFree` cleanup are
intercepted by our `LD_PRELOAD`'d stub functions, which return errors.
This causes NCCL's `commReclaim` to report `error 3`
(`ncclSystemError`).
**Fix:** Add `RTLD_DEEPBIND` to the `dlopen` flags. This makes the
dlopen'd NCCL library resolve its own symbols internally first,
bypassing our interposition layer for internal calls.
### 2. Missing `ncclCommFinalize` forwarding
**Root cause:** `CommFinalize` was not in the `mscclppNcclOps_t` struct
and was never loaded via `dlsym`. So `ncclCommFinalize` never forwarded
to the real NCCL's finalize, which is required before `ncclCommDestroy`
in NCCL 2.29+.
**Fix:** Add `CommFinalize` to the ops struct and load it via `dlsym`.
Forward the call in `ncclCommFinalize`.
### 3. CI: Use latest NCCL release tag
The CI pipeline was cloning the NCCL default branch (which may contain
unreleased/unstable code). Updated to fetch the latest release tag via
GitHub API and clone that specific tag.
## Testing
Verified with the exact CI command:
```bash
mpirun -np 8 --bind-to numa --allow-run-as-root \
-x LD_PRELOAD=libmscclpp_nccl.so \
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE \
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce" \
-x MSCCLPP_NCCL_LIB_PATH=/root/nccl/build/lib/libnccl.so \
all_reduce_perf -b 1K -e 1G -f 2 -d half -G 1 -w 10 -n 100
```
- **Before:** `commReclaim: error 3` warnings on all 8 ranks at exit
- **After:** Clean exit, no warnings, correct results
## Files Changed
- `src/ext/nccl/nccl.cc` — Fix comm destroy lifecycle (RTLD_DEEPBIND,
CommFinalize forwarding, destroy order)
- `.azure-pipelines/templates/nccl-test.yaml` — Use latest NCCL release
tag in CI
I recently encountered a weird memory usage issue.
After starting the proxy service on a cuda device X > 0, I notice an
unexpected thread entity apprear on both the GPU X and GPU 0, where GPU
0's share is about 500MB. Note that when the device is 0, there is no
extra memory usage.
The image clearly shows that when 8 ranks each using one GPU and
starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB
extra memory.
<img width="1247" height="1367" alt="Screenshot 2026-02-28 000153"
src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4"
/>
After tracking down to when it happens, I identified the root cause in
Proxy thread initialization.
// never capture in a proxy thread
auto mode = cudaStreamCaptureModeRelaxed;
MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));
pimpl_->threadInit();
The call to cudaThreadExchangeStreamCaptureMode() actually triggers some
resource allocation on the "current device" which is still 0 for the
starting thread.
The later threadInit() is too late to set the correct GPU number.
The fix is simple: call threadInit() before the first cuda call:
pimpl_->threadInit();
// never capture in a proxy thread
auto mode = cudaStreamCaptureModeRelaxed;
MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));
This guarantees that the current device is properly set before calling
any resource-allocating cuda functions.
This is the memory usage after the fix. The extra memory usages are
gone.
<img width="1242" height="459" alt="Image (1)"
src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9"
/>
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
## Summary
Add CI pipeline support for testing in environments without InfiniBand
(IB) hardware.
## Changes
### IB stubs for no-IB builds (`src/core/ib.cc`)
- Added stub implementations for `IbMr` and `IbQp` classes in the `#else
// !defined(USE_IBVERBS)` block so the library links successfully when
built with `-DMSCCLPP_USE_IB=OFF`.
### Environment variable to disable IB tests
(`MSCCLPP_DISABLE_IB_TESTS`)
- Added `disableIbTests` field to the `Env` class
(`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from
`MSCCLPP_DISABLE_IB_TESTS` env var.
- Exposed as `disable_ib_tests` in Python bindings
(`python/csrc/env_py.cpp`).
- Updated `python/test/test_mscclpp.py` to skip IB-dependent tests
(`create_group_and_connection` with IB transport, `test_h2h_semaphores`,
`test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is
true.
### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`)
The no-IB environment pipeline runs two phases:
1. **No-IB build phase**: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run
unit tests, multi-process unit tests, and pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`).
2. **IB build phase**: Rebuild with IB enabled (default), stop the
existing container, redeploy, and run pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build
works correctly in a non-IB environment when IB tests are skipped.
Also increased the job timeout from 40 to 60 minutes to accommodate the
two-phase pipeline.
This PR adds an example code for switch channel testing. It validates
switch channel on single node and multi node environments. We need to
add the description of the algorithms and the explanation of the code
under doc.
example outputs:
rank0:
./bidir_switch_channel 10.0.5.233:45571 0 0
Rank 0 (GPU 0): Preparing for tests ...
Rank 0 (GPU 0): bytes 4096, elapsed 0.0062328 ms/iter, BW 0.657169 GB/s
Rank 0 (GPU 0): bytes 4.1943e+06, elapsed 0.0164577 ms/iter, BW 254.854
GB/s
Rank 0 (GPU 0): bytes 1.34218e+08, elapsed 0.33628 ms/iter, BW 399.125
GB/s
Rank 0: Succeed!
rank1:
./bidir_switch_channel 10.0.5.233:45571 1 0
Rank 1 (GPU 0): Preparing for tests ...
Rank 1: Succeed!
This pull request updates the way the `nlohmann/json` library is fetched
and upgrades it to a newer version in both the main build and test
configuration files.
Addressed installation issue in some env
This pull request updates the handling of the default flag buffer in the
C++ and Python bindings to ensure proper memory management when
interfacing with Python.
Make sure the buffer will not be deallocated when transfer ownership
from cpp to python