Commit Graph

968 Commits

Author SHA1 Message Date
Changho Hwang
bcb392ffdf updates 2026-03-08 03:33:51 +00:00
Changho Hwang
375bc13831 fix 2026-03-07 02:53:54 +00:00
Changho Hwang
c40a233f55 fix 2026-03-07 02:48:08 +00:00
Changho Hwang
e0c7ddb5ff fix 2026-03-07 02:33:20 +00:00
Changho Hwang
75ac8be225 fix 2026-03-07 02:31:51 +00:00
Changho Hwang
284d9139c9 Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-03-06 18:26:02 -08:00
Changho Hwang
c699b8a784 az pipeline refactoring 2026-03-07 02:23:30 +00:00
Binyang Li
3751f0299b Fix NCCL fallback comm destroy and use latest NCCL release in CI (#760)
## Summary

Fix NCCL fallback communicator cleanup errors and update CI to use
stable NCCL releases.

## Problem

When using `LD_PRELOAD=libmscclpp_nccl.so` with NCCL fallback enabled,
the following warnings appear at program exit:

```
NCCL WARN commReclaim: cleanup comm 0x55a0dcadaa90 rank 3 failed in destroy/abort, error 3
```

This is caused by three bugs in the NCCL fallback communicator lifecycle
management.

## Root Causes & Fixes

### 1. Symbol interposition during NCCL cleanup (`RTLD_DEEPBIND`)

**Root cause:** When the fallback NCCL library is loaded via `dlopen`,
its internal calls to its own public API functions (e.g.,
`ncclCommWindowDeregister`, `ncclMemFree`) during `commFree` cleanup are
intercepted by our `LD_PRELOAD`'d stub functions, which return errors.
This causes NCCL's `commReclaim` to report `error 3`
(`ncclSystemError`).

**Fix:** Add `RTLD_DEEPBIND` to the `dlopen` flags. This makes the
dlopen'd NCCL library resolve its own symbols internally first,
bypassing our interposition layer for internal calls.

### 2. Missing `ncclCommFinalize` forwarding

**Root cause:** `CommFinalize` was not in the `mscclppNcclOps_t` struct
and was never loaded via `dlsym`. So `ncclCommFinalize` never forwarded
to the real NCCL's finalize, which is required before `ncclCommDestroy`
in NCCL 2.29+.

**Fix:** Add `CommFinalize` to the ops struct and load it via `dlsym`.
Forward the call in `ncclCommFinalize`.

### 3. CI: Use latest NCCL release tag

The CI pipeline was cloning the NCCL default branch (which may contain
unreleased/unstable code). Updated to fetch the latest release tag via
GitHub API and clone that specific tag.

## Testing

Verified with the exact CI command:
```bash
mpirun -np 8 --bind-to numa --allow-run-as-root \
  -x LD_PRELOAD=libmscclpp_nccl.so \
  -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE \
  -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce" \
  -x MSCCLPP_NCCL_LIB_PATH=/root/nccl/build/lib/libnccl.so \
  all_reduce_perf -b 1K -e 1G -f 2 -d half -G 1 -w 10 -n 100
```

- **Before:** `commReclaim: error 3` warnings on all 8 ranks at exit
- **After:** Clean exit, no warnings, correct results

## Files Changed

- `src/ext/nccl/nccl.cc` — Fix comm destroy lifecycle (RTLD_DEEPBIND,
CommFinalize forwarding, destroy order)
- `.azure-pipelines/templates/nccl-test.yaml` — Use latest NCCL release
tag in CI
2026-03-06 16:33:35 -08:00
Changho Hwang
00583da21b separate pipeline for codecov 2026-03-06 21:31:04 +00:00
Changho Hwang
60ff32c014 updates 2026-03-06 19:40:34 +00:00
Changho Hwang
bbb9c10a1e Update Docker image 2026-03-06 19:15:04 +00:00
Changho Hwang
f4b8574a1c Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-03-03 15:49:01 -08:00
Xingbo Wu
69565a2f32 Do threadInit/cudaSetDevice before other cuda calls (#757)
I recently encountered a weird memory usage issue.
After starting the proxy service on a cuda device X > 0, I notice an
unexpected thread entity apprear on both the GPU X and GPU 0, where GPU
0's share is about 500MB. Note that when the device is 0, there is no
extra memory usage.
The image clearly shows that when 8 ranks each using one GPU and
starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB
extra memory.
<img width="1247" height="1367" alt="Screenshot 2026-02-28 000153"
src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4"
/>


After tracking down to when it happens, I identified the root cause in
Proxy thread initialization.

    // never capture in a proxy thread
    auto mode = cudaStreamCaptureModeRelaxed;
    MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));

    pimpl_->threadInit();

The call to cudaThreadExchangeStreamCaptureMode() actually triggers some
resource allocation on the "current device" which is still 0 for the
starting thread.
The later threadInit() is too late to set the correct GPU number.

The fix is simple: call threadInit() before the first cuda call:

    pimpl_->threadInit();
    // never capture in a proxy thread
    auto mode = cudaStreamCaptureModeRelaxed;
    MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));

This guarantees that the current device is properly set before calling
any resource-allocating cuda functions.

This is the memory usage after the fix. The extra memory usages are
gone.

<img width="1242" height="459" alt="Image (1)"
src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9"
/>

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-03-02 15:53:59 -08:00
Caio Rocha
4bc1999001 Adding Support to Setting Message Size Range in Native Algorithm API (#758) 2026-02-27 17:50:43 -08:00
Binyang Li
ab49386839 Add doc for perf tunning (#756) 2026-02-27 10:59:36 -08:00
Changho Hwang
8c3a4362cd update CI 2026-02-26 19:37:06 -08:00
Changho Hwang
eb99a266e6 Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-26 19:25:58 -08:00
Binyang Li
25435acf5d Add new algos for GB200 (#747)
- Add new algos (allreduce_rsag, allreduce_rsag_pipeline and
allreduce_rsag_zero_copy) for GB200.
- Add IB stub for non-IB env
- Provides example for algorithm tunning with different nblocks/nthreads

Perf for allreduce_rsag
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
     1048576        262144     float     sum      -1    25.16   41.67   62.51       0    23.73   44.18   66.27       0
     2097152        524288     float     sum      -1    26.06   80.47  120.71       0    25.31   82.86  124.29       0
     4194304       1048576     float     sum      -1    31.09  134.93  202.39       0    30.75  136.39  204.58       0
     8388608       2097152     float     sum      -1    45.52  184.29  276.43       0    45.13  185.87  278.80       0
    16777216       4194304     float     sum      -1    75.73  221.53  332.30       0    75.51  222.18  333.27       0
    33554432       8388608     float     sum      -1   137.25  244.48  366.72       0   137.22  244.54  366.81       0
    67108864      16777216     float     sum      -1   271.34  247.32  370.99       0   270.86  247.76  371.65       0
   134217728      33554432     float     sum      -1   534.25  251.22  376.84       0   534.43  251.14  376.71       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 264.454 
#
# Collective test concluded: all_reduce_perf
```

perf for allreduce_rsag_pipeline
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
     1048576        262144     float     sum      -1    61.57   17.03   25.55       0    61.51   17.05   25.57       0
     2097152        524288     float     sum      -1    61.31   34.20   51.31       0    61.23   34.25   51.38       0
     4194304       1048576     float     sum      -1    61.62   68.06  102.10       0    61.84   67.83  101.74       0
     8388608       2097152     float     sum      -1    61.97  135.37  203.06       0    61.89  135.53  203.30       0
    16777216       4194304     float     sum      -1    63.15  265.65  398.48       0    62.89  266.76  400.15       0
    33554432       8388608     float     sum      -1   100.63  333.46  500.19       0    99.76  336.34  504.51       0
    67108864      16777216     float     sum      -1   180.04  372.75  559.13       0   179.75  373.34  560.01       0
   134217728      33554432     float     sum      -1   339.60  395.23  592.84       0   338.16  396.91  595.36       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 304.665 
#
# Collective test concluded: all_reduce_perf
```

perf for allreduce_rsag_zero_copy
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
     1048576        262144     float     sum      -1    14.99   69.93  104.90       0    14.44   72.61  108.92       0
     2097152        524288     float     sum      -1    16.19  129.56  194.33       0    15.85  132.32  198.48       0
     4194304       1048576     float     sum      -1    21.19  197.98  296.97       0    20.64  203.20  304.81       0
     8388608       2097152     float     sum      -1    31.04  270.27  405.41       0    30.68  273.44  410.16       0
    16777216       4194304     float     sum      -1    50.34  333.26  499.89       0    50.15  334.51  501.77       0
    33554432       8388608     float     sum      -1    89.58  374.56  561.84       0    88.65  378.48  567.73       0
    67108864      16777216     float     sum      -1   165.69  405.03  607.54       0   163.64  410.10  615.16       0
   134217728      33554432     float     sum      -1   323.19  415.28  622.93       0   318.01  422.05  633.07       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 414.619 
#
# Collective test concluded: all_reduce_perf
```

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2026-02-24 16:43:23 -08:00
Binyang Li
184dcbf9d7 Add CI pipeline for no-IB environment testing (#755)
## Summary

Add CI pipeline support for testing in environments without InfiniBand
(IB) hardware.

## Changes

### IB stubs for no-IB builds (`src/core/ib.cc`)
- Added stub implementations for `IbMr` and `IbQp` classes in the `#else
// !defined(USE_IBVERBS)` block so the library links successfully when
built with `-DMSCCLPP_USE_IB=OFF`.

### Environment variable to disable IB tests
(`MSCCLPP_DISABLE_IB_TESTS`)
- Added `disableIbTests` field to the `Env` class
(`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from
`MSCCLPP_DISABLE_IB_TESTS` env var.
- Exposed as `disable_ib_tests` in Python bindings
(`python/csrc/env_py.cpp`).
- Updated `python/test/test_mscclpp.py` to skip IB-dependent tests
(`create_group_and_connection` with IB transport, `test_h2h_semaphores`,
`test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is
true.

### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`)
The no-IB environment pipeline runs two phases:

1. **No-IB build phase**: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run
unit tests, multi-process unit tests, and pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`).
2. **IB build phase**: Rebuild with IB enabled (default), stop the
existing container, redeploy, and run pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build
works correctly in a non-IB environment when IB tests are skipped.

Also increased the job timeout from 40 to 60 minutes to accommodate the
two-phase pipeline.
2026-02-24 15:55:59 -08:00
Changho Hwang
11e27e2978 Update coverage report commands to handle errors and adjust paths 2026-02-23 18:33:11 -08:00
Changho Hwang
d88ee8de9c Refine coverage report to include only mscclpp source and include directories 2026-02-23 18:27:14 -08:00
Changho Hwang
2f27d7d7fe Update coverage report to exclude additional directories in lcov command 2026-02-23 18:25:10 -08:00
Changho Hwang
2adf4a48e2 use variable group 2026-02-23 16:49:39 -08:00
Changho Hwang
2f02d383c4 Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-23 16:43:35 -08:00
Caio Rocha
7738603d63 Adjusting Communicator in Python API (#752) 2026-02-23 16:33:52 -08:00
Changho Hwang
edda25df6b Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-23 14:48:04 -08:00
Changho Hwang
d0c709ea82 Fix Codecov token usage in coverage upload step 2026-02-23 14:45:29 -08:00
Caio Rocha
b5256032fe Disabling Nanobind Memory Leak Warnings in Release Builds (#745)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-02-23 11:55:17 -08:00
Changho Hwang
6c2bc8f4b3 coverage fix 2026-02-23 11:32:50 -08:00
Changho Hwang
04ebd9ba6e fix coverage file path 2026-02-23 10:39:39 -08:00
Changho Hwang
c4afbe12d9 Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-23 10:25:22 -08:00
mahdiehghazim
2a6f1c1192 Mahdieh/switchchannel test clean (#751)
This PR adds an example code for switch channel testing. It validates
switch channel on single node and multi node environments. We need to
add the description of the algorithms and the explanation of the code
under doc.

example outputs:

rank0:

./bidir_switch_channel 10.0.5.233:45571 0 0
Rank 0 (GPU 0): Preparing for tests ...
Rank 0 (GPU 0): bytes 4096, elapsed 0.0062328 ms/iter, BW 0.657169 GB/s
Rank 0 (GPU 0): bytes 4.1943e+06, elapsed 0.0164577 ms/iter, BW 254.854
GB/s
Rank 0 (GPU 0): bytes 1.34218e+08, elapsed 0.33628 ms/iter, BW 399.125
GB/s
Rank 0: Succeed!

rank1:
./bidir_switch_channel 10.0.5.233:45571 1 0
Rank 1 (GPU 0): Preparing for tests ...
Rank 1: Succeed!
2026-02-20 22:46:32 -05:00
Binyang Li
3962574bcb Address installation issue in some env (#750)
This pull request updates the way the `nlohmann/json` library is fetched
and upgrades it to a newer version in both the main build and test
configuration files.
Addressed installation issue in some env
2026-02-20 16:11:16 -08:00
Caio Rocha
e2acf7f1c8 Removing MPI Dependency (#743) 2026-02-20 16:04:12 -08:00
Changho Hwang
41695bab94 Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-20 14:04:27 -08:00
Changho Hwang
b9609f83a0 add coverage flags 2026-02-20 14:03:54 -08:00
Changho Hwang
caeec7590a updates 2026-02-20 13:43:32 -08:00
Binyang Li
39865c218b address flagBuffer ownership issue (#749)
This pull request updates the handling of the default flag buffer in the
C++ and Python bindings to ensure proper memory management when
interfacing with Python.

Make sure the buffer will not be deallocated when transfer ownership
from cpp to python
2026-02-20 13:42:29 -08:00
Changho Hwang
dcdd3febd1 update UT CI 2026-02-20 13:35:32 -08:00
Changho Hwang
b64536f28e Merge branch 'main' into copilot/remove-gtest-use-custom-framework 2026-02-18 20:35:34 -08:00
Changho Hwang
2b4adcc4ad fix lint 2026-02-18 20:33:57 -08:00
Changho Hwang
b693d1b3fc lint issue 2026-02-18 20:31:25 -08:00
Changho Hwang
4d9aceac6f badge 2026-02-18 20:25:50 -08:00
Changho Hwang
bed85b56cb codecov upload 2026-02-18 20:23:42 -08:00
Changho Hwang
e40c72bd2b license text update 2026-02-18 20:12:32 -08:00
Changho Hwang
4afbf780ed minor 2026-02-18 19:54:37 -08:00
Changho Hwang
d2efc2fd3b coverage update 2026-02-18 19:48:29 -08:00
Changho Hwang
b6ce0f2ede simplify 2026-02-18 19:16:21 -08:00
Changho Hwang
30b9891180 simplifying 2026-02-18 18:35:33 -08:00
Binyang Li
4701ae3a95 Update dtype name (#748)
- Change FP8_E4M3/FP8_E5M2 to FLOAT8_E4M3/FLOAT8_E5M2
- Add torch.uint8 to DataType.uint8 mapping
2026-02-18 10:35:44 -08:00