Commit Graph

150 Commits

Author SHA1 Message Date
Caio Rocha
b6d0ca13ca Adding CI Test to DSL Executor (#782) 2026-04-13 13:55:45 -07:00
Changho Hwang
d63f9403c0 IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753)
Major enhancements to the IB signal forwarding mechanisms
(`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5
Direct Verbs, and refactoring the signal forwarding path for IB
HostNoAtomic mode. The changes fix memory consistency issues and reduce
signaling latency.
- GDRCopy and MLX5 Direct Verbs MR integration
- Signal forwarding path redesign
- Semaphore and connection API updates
- Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates
2026-04-09 09:24:30 +00:00
Binyang Li
fa95e82e18 Fix CI/CD pipeline issues (#773)
This pull request updates the deployment pipeline to allow custom CMake
arguments to be passed to the pip install process on remote VMs.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 08:41:51 -07:00
Binyang Li
be9126ca1b Fix run-remote.sh to support multi-command scripts (#770)
## Summary
- Fix `run-remote.sh` to correctly execute multi-command scripts (e.g.,
multiple `mpirun` calls)
- The old approach piped decoded script through `base64 -d | bash`,
which feeds the script via bash's **stdin**. When `mpirun` (or its child
processes) runs, it can consume the remaining stdin, causing bash to
never see subsequent commands — only the first command would execute.
- The fix decodes the script to a **temp file** and runs `bash -euxo
pipefail "$TMP"` instead, so bash reads commands from the file and
`mpirun` consuming stdin has no effect.
- Applied to both the docker path (pssh + docker exec) and the
non-docker path (pssh only).


🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-01 16:25:19 -07:00
Changho Hwang
d2f7056cf4 Add unit testing framework readme (#766) 2026-04-01 05:30:35 +00:00
Binyang Li
4f3638b60d Use PTX red for D2D semaphore signal (#768)
## Summary
- Replace the two-step `signal()` implementation (`incOutbound()` +
`atomicStore()`) with a single fire-and-forget PTX
`red.release.sys.global.add.u64` instruction
- This eliminates one local atomic fetch-add and replaces a remote store
with a remote atomic add that has no return value — more efficient on
both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to
fire-and-forget `flat_atomic_add_x2`)
- Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait
ping-pong latency

### Performance (H100, 2 ranks, signal+wait round-trip)

```
SemaphorePerfTest.SignalPingPong:
  Store-based (old): 2.595 us/iter
  Red-based   (new): 2.345 us/iter
  Speedup:           1.11x
```

## Test plan
- [x] Builds successfully (`make mp_unit_tests`)
- [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter
"SemaphorePerfTest"` — 1.11x speedup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 15:34:43 -07:00
Copilot
93f6eeaa6b Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744)
- Removes the GTest dependency, replacing it with a minimal custom
framework (`test/framework.*`) that covers only what the tests actually
use — a unified `TEST()` macro with SFINAE-based fixture auto-detection,
`EXPECT_*`/`ASSERT_*` assertions, environments, and setup/teardown.
- `--exclude-perf-tests` flag and substring-based negative filtering
- `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to
Codecov
- Merges standalone `test/perf/` into main test targets
- Refactors Azure pipelines to reduce redundancies & make more readable

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2026-03-24 23:34:38 -04:00
Binyang Li
bf946ea51e Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms (#759)
## Summary

This PR addresses a multicast resource leak, fixes `cuMemMap` offset
handling for multicast handles, renames NVLS allreduce algorithm classes
for clarity, and adds a new unit test for `SwitchChannel`.

### Bug Fixes

#### 1. Fix multicast allocation handle leak in `createMulticast()`
(`gpu_ipc_mem.cc`)

`GpuIpcMemHandle::createMulticast()` called
`cuMulticastCreate(&allocHandle, ...)` but never released the local
`allocHandle` after exporting it to shareable handles (POSIX FD /
Fabric). This caused a reference count leak — the multicast object was
never freed even after all mappings and imported handles were released.

Per the [CUDA Driver API docs for
`cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html):
> *"The memory allocation will be freed when all outstanding mappings to
the memory are unmapped and when all outstanding references to the
handle (including its shareable counterparts) are also released."*

The fix adds `cuMemRelease(allocHandle)` after export, matching the
existing pattern used for regular allocations in
`GpuIpcMemHandle::create()`.

**Impact:** Without this fix, repeated creation/destruction of NVLS
connections causes OOM after ~120 iterations when allocating 1GB
multicast buffers on H100.

#### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`)

`cuMemMap` requires `offset=0` for multicast handles. Previously, the
code attempted to map at a non-zero offset within the multicast object,
leading to errors when binding multiple buffers to the same
`NvlsConnection`. The fix maps the entire range `[0, mcOffset +
bufferSize)` and returns the pointer offset by `mcOffset`. This only
consumes extra virtual address space; no additional physical memory is
used.

### Refactoring

#### 3. Rename NVLS allreduce algorithm classes

Renamed for clarity:
- `AllreduceNvls` → `AllreduceNvlsZeroCopy`
- `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline`
- `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline`

Updated all references in builder, selector, docs, and examples.

#### 4. Move `nvlsConnections` setup to `initialize()`

Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this
member) to individual algorithm class members, initialized in their
`initialize()` methods.

### Tests

#### 5. Add `TwoChannelsSameConnection` test

New unit test that creates two `SwitchChannel` instances from the same
`NvlsConnection`, performs reduce operations on both, and verifies
correctness. This exercises the multi-bind path that triggered the
`cuMemMap` offset fix.

### Files Changed

- `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap
offset fix
- `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu`
(renamed)
- `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` —
nvlsConnections fix
- `src/ext/collectives/include/allreduce/*.hpp` — renamed headers
- `src/ext/collectives/algorithm_collection_builder.cc` — updated
references
- `src/ext/nccl/algorithm_selector.cc` — updated algorithm names
- `test/mp_unit/switch_channel_tests.cu` — new test
- `docs/guide/mscclpp-torch-integration.md` — updated names
- `examples/torch-integration/customized_comm_with_default_algo.py` —
updated names
2026-03-09 10:22:45 -07:00
Binyang Li
3962574bcb Address installation issue in some env (#750)
This pull request updates the way the `nlohmann/json` library is fetched
and upgrades it to a newer version in both the main build and test
configuration files.
Addressed installation issue in some env
2026-02-20 16:11:16 -08:00
Binyang Li
bd68319e3e Refactor algo selection logic and introduce symmetric_memory env (#741)
This PR refactors the algorithm selection logic in MSCCL++ and
introduces support for symmetric memory configuration through
environment variables.


1. Algorithm Selection Refactoring
Use separate class for algo selection. Could introduce more complex
logic for algo selection based on message size, arch, if cuda graph is
enabled and memory allocation method

2. Symmetric Memory Support
Introduced symmetricMemory parameter in algorithm context key
generation. Remove disableChannelCache env as is ambiguous

3. Add new args for build_default_algorithms 
Add flag_buffer, and flag_buffer_size args to build default algorithm.
Then we could use unified flag buffer for different algorithms, avoid
application hanging when switch algo for different message size.

---------

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2026-02-12 19:06:18 -08:00
Changho Hwang
42be3660e0 Add a new IB stack impl that doesn't use RDMA atomics (#728)
* Added configurable InfiniBand (IB) signaling mode.
`EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`,
`HostNoAtomic`). `Default` is equivalent to `Host` unless specified
different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the
previous implementation using RDMA atomics for signaling, while
`HostNoAtomic` uses write-with-immediate instead.
* Regarding updates in Python bindings and API.
2026-02-10 01:07:53 +00:00
Binyang Li
c12822a7af create CI pipeline for rocm (#718)
Create CI pipeline for AMD GPU.
2026-02-09 16:55:16 -08:00
Qinghua Zhou
620378b4fb Fix cpplint error in main branch (#740)
Fix the legacy cpplint error in main branch.

---------

Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-02-05 09:25:12 -08:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Changho Hwang
2cf14ff723 Minor fixes (#715) 2026-01-05 11:09:48 +08:00
Binyang Li
eda74a7f29 Add handle cache for AMD platform (#698)
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the
handle reference internally and reuse the same handle if already be
opened

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-12-21 18:39:12 -08:00
Changho Hwang
9e076da3d4 Make IB more configurable (#703)
* Added `port` and `gidIndex` field in the IB endpoint config (and
`deviceIndex` field for future usages)
* Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so
* Added `--ib_gid_index` CLI option to `mp_unit_tests`
* Other minor fixes
2025-12-18 13:21:07 -08:00
Caio Rocha
060c35fec6 No IB Env CI Test (#687) 2025-11-19 11:13:03 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Caio Rocha
eb202780f5 Support Synchronous Initialization for Proxy Service (#679) 2025-11-12 18:35:57 -08:00
Changho Hwang
ffafcaf6d6 IB stack enhancements & bug fixes (#673)
* Always use `ibv_reg_dmabuf_mr` when DMABUF is supported
* Do not check `nvidia-peermem` when unnecessary
* More rigorous check on IB port availability
* Fixed ibverbs wrappers
* Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test
2025-11-07 14:26:17 -08:00
Changho Hwang
9994f53cea Fixes for no-IB systems (#667)
* Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB
on/off
* Fix `nvidia-peermem` check; no need for DMABUF supported systems
* Fix `mp_unit_tests` to skip all IB tests when built with
`-DMSCCLPP_USE_IB=OFF`
2025-10-29 10:03:02 -07:00
Changho Hwang
68b1f151f0 Rename nvls* files (#660)
Rename nvls* files to switch_channel*
2025-10-24 11:34:26 -07:00
Changho Hwang
a2f1279c60 Test peer accessibility after deployment (#661)
Test GPUs' peer accessibility before integration testing to distinguish
VM issues.
2025-10-24 11:09:36 -07:00
Binyang Li
610db6f023 Fix test script (#655)
Fix: #654. Address correctness_test.py crash issue

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-21 19:57:17 +00:00
Changho Hwang
2f7d74b281 Fix lint.sh (#652)
Exit 1 upon any errors from clang-format or black
2025-10-20 17:23:01 -07:00
Binyang Li
b1a88d755e Pipeline fix (#645)
Co-authored-by: github-actions <github-actions@github.com>
2025-10-10 11:26:33 -07:00
Binyang Li
5ac427610d Address teardown issue (#638)
Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point
2025-09-25 12:12:40 -07:00
Binyang Li
bf8d424ae3 use unix socket to share fd (#634)
Use unix socket to share fd to other processes. Used for nvls handle sharing
Update nccl interface to support worldSize=1

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-25 11:40:54 -07:00
Abhinav Jangda
5d062b7038 Fix Illegal Memory Access in nvls_test for CUDA12.9 (#631)
Running NVLS test, `test/nvls_test.cu` in CUDA 12.9 leads to illegal
memory access at
571fee16fb/test/nvls_test.cu (L151)
.
This PR addresses this error by moving cudaMemset after memory mapping.
2025-09-10 09:46:18 -07:00
Binyang Li
ba4c4aaeb8 Integrate MSCCL++ with torch workload (#626)
Integrate MSCCL++ with torch
Introduce `NCCL audit shim library`, use can use following commands to
launch torch library. Also avoid break build pipeline in the CPU machine
```bash
export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so
export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH
torchrun --nnodes=1 --nproc_per_node=8 your_script.py
```
2025-09-09 13:28:32 -07:00
Changho Hwang
547a9ae65c Fixed cpp linter (#619) 2025-08-25 12:15:45 -07:00
Binyang Li
2b40fe37b3 add torch test (#612)
Simple torch test
2025-08-15 10:27:21 -07:00
Binyang Li
03c0ff2a91 Fix for multi-nodes test (#614)
Fix multi-node test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-14 20:44:43 -07:00
Binyang Li
bb76d27553 all2all implementation (#609)
Implement single node all2all via MSCCL++ C++API
perf kernel 3:
```
       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.41   44.78   39.19      0
     2097152         65536                                     23.95   87.56   76.61      0
     4194304        131072                                     27.50  152.51  133.45      0
     8388608        262144                                     35.14  238.73  208.89      0
    16777216        524288                                     57.54  291.55  255.11      0
    33554432       1048576                                     109.7  305.81  267.59      0
    67108864       2097152                                     212.3  316.07  276.56      0
   134217728       4194304                                     410.9  326.64  285.81      0
   268435456       8388608                                     784.9  341.99  299.24      0
```

kernel 2
```

#                                        in-place                       out-of-place
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.42   44.77   39.17      0
     2097152         65536                                     24.96   84.02   73.52      0
     4194304        131072                                     28.53  147.03  128.65      0
     8388608        262144                                     36.75  228.28  199.75      0
    16777216        524288                                     58.01  289.20  253.05      0
    33554432       1048576                                     110.4  303.83  265.85      0
    67108864       2097152                                     212.4  315.99  276.49      0
   134217728       4194304                                     407.8  329.12  287.98      0
   268435456       8388608                                     797.4  336.64  294.56      0
```

NCCL:
```
NCCL version 2.21.5+cuda12.4
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        524288      half    none      -1    38.70  216.75  189.66      0    39.25  213.72  187.00    N/A
    16777216       1048576      half    none      -1    71.39  234.99  205.62      0    68.41  245.25  214.60    N/A
    33554432       2097152      half    none      -1    119.7  280.22  245.20      0    119.8  280.17  245.15    N/A
    67108864       4194304      half    none      -1    211.9  316.66  277.08      0    212.7  315.53  276.09    N/A
   134217728       8388608      half    none      -1    408.4  328.61  287.53      0    393.8  340.87  298.26    N/A
   268435456      16777216      half    none      -1    761.6  352.47  308.41      0    763.3  351.70  307.73    N/A
   536870912      33554432      half    none      -1   1502.5  357.31  312.64      0   1467.3  365.89  320.16    N/A
```
2025-08-14 11:30:40 -07:00
Binyang Li
be6a941fba New DSL implementation (#579)
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline 
- Clean code and enhance document
C++ side: 
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-09 00:36:20 -07:00
Binyang Li
4f6f23dae3 Use smart pointer for IB structure (#585)
Change to use smart pointer for IB structure. Registered memory will own
ibMr, ibCtx will not held the reference
- Use smart pointer for IbQp and IbMr
- Update memoryChannel API, keep localRegisteredMemory
- Close fd when registedMemory released

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-06 10:01:58 -07:00
Binyang Li
01e72f3aca Fix multinode test failure (#574)
Add CPU based connection to fix multi-node test failure issue

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-23 10:33:44 -07:00
Binyang Li
5e991cf5c8 update readme & bump version (#550)
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-07-12 01:00:18 -07:00
Changho Hwang
199468bc47 Revise NVLS interface (#458)
* Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel`
* Minor interface improvements
2025-07-12 00:33:03 +00:00
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Changho Hwang
20eca28942 Fix a FIFO correctness bug (#549)
* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-07-11 23:53:59 +00:00
Changho Hwang
22e8db4885 Support connection between local endpoints (#561) 2025-07-02 13:02:44 -07:00
Changho Hwang
b4dde38db8 FIFO improvements (#557)
* Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default.
* Optimize `FifoDeviceHandle`.
* Do not use `cudaHostAllocWriteCombined` that increases latency.
* Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`.
* Fix proxy NUMA binding issues.
* Prevent graph capture inside proxy threads.
* Now `CudaIpcConnection` skips stream sync when unnecessary.
* Now any type of connection needs to hold a shared pointer to the
context for memory safety.
* Now a context should be always managed by a shared pointer for memory
safety.
* Minor docs & interface improvements.
* Minor fix in `mscclpp-test` correctness test.
2025-06-24 09:50:28 -07:00
Changho Hwang
2796cfa5ba New FIFO test (#558)
Comprehensive FIFO testing
2025-06-23 15:42:44 -07:00
Changho Hwang
125d6f5809 Multi-stream CUDA IPC (#326)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>
2025-06-04 10:31:04 -07:00
Changho Hwang
253a1ba1a9 Use a stream pool for gpuCalloc*() (#509)
Previous `gpuCalloc*()` creates a new stream for each allocation, which
messes the timeline up in profiler traces. Now `GpuStreamPool` allows
reusing the temporal streams.
2025-06-04 10:07:20 -07:00
Changho Hwang
83356957bd Improved documentation & minor interface revision (#541) 2025-06-03 14:26:27 -07:00
Binyang Li
a18e91cee4 Set Up a CI Pipeline for H100 (#526)
Set Up a CI Pipeline for H100
2025-05-15 14:50:23 -07:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00