Commit Graph

190 Commits

Author SHA1 Message Date
Qinghua Zhou
e87c66a85d ext/ep: apply clang-format and black to fix CI lint failures
Run `tools/lint.sh cpp` (clang-format 14) and `tools/lint.sh py`
(black) over the EP extension files added by this PR. No functional
changes; pure reformatting to satisfy the cpplint and pylint CI jobs.
2026-05-06 04:12:20 +00:00
Qinghua Zhou
23e8ce6dbe ext/ep: add pragma once to event.hpp and update validation docs
- Add #pragma once to src/ext/ep/event.hpp; including it in multiple TUs
  would otherwise redefine EventHandle.
- python/mscclpp/ext/ep/buffer.py: low-latency internode is now validated
  on 2x H100x8; remove the 'untested on multi-node H100' note.
- src/ext/ep/kernels/internode_ll.cu: replace the untested-on-multi-node
  WARNING with the current validated-on-2x-H100x8 status.

Addresses Copilot review comments on PR #796.
2026-05-06 03:24:34 +00:00
Qinghua Zhou
89f17dab5b Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-05 19:28:37 -07:00
Qinghua Zhou
f0a72263c8 ext/ep: unfilter LL sync + add LL multirank test (intra-node WIP)
- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode.
  DeepEP's original filter was safe because its LL path used NVSHMEM; this
  port drives LL via PortChannel so the kernel indexes
  port_channel_handles[local_expert*num_ranks + dst_rank] for every
  dst_rank. All peers now get a real memory/connection/semaphore/port
  channel entry.
- Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine
  functional round-trip, BF16 only). Works cross-node in DeepEP's
  1-GPU-per-node topology.
- Known limitation documented in src/ext/ep/README.md and the test docstring:
  intra-node 8-GPU LL currently hangs because every peer transfer routes
  through the CPU proxy over IB loopback between distinct HCAs on the same
  host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which
  mis-aligns the 32-bit rdma_recv_count slots when used for same-node
  peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for
  same-node, PortChannel for cross-node) or 64-bit counters.
2026-04-22 06:11:30 +00:00
Qinghua Zhou
9e96bf3b5d ep: document internode HT validation on 2x H100x8
Refresh status docs and comments now that internode HT dispatch and
combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via
test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover
their per-rank token payloads with zero diff).

- src/ext/ep/README.md: consolidate the previously duplicated README
  into a single document; mark intranode and internode HT dispatch and
  combine as validated in the status table; add a 'Running the tests'
  section with torchrun examples for both the intranode and the 2x8
  internode setups; record the dispatch->combine
  torch.cuda.synchronize() + dist.barrier() requirement under Known
  limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural
  port, untested.

- python/mscclpp/ext/ep/buffer.py: update the module docstring and the
  Buffer constructor docstring to say internode HT is validated and
  clarify that LL mode is untested on multi-node hardware.

- src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported'
  and 'low-latency paths still stubbed' comments. mscclpp_ep does not
  use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL
  paths are a structural port that is present but untested, not stubbed.
  Note validation on 2x H100x8 in the internode section header.
2026-04-22 03:56:16 +00:00
Qinghua Zhou
a6af3a4454 ext/ep: fix multi-rank intranode dispatch+combine
Three issues blocked end-to-end intranode validation across multiple
ranks. This commit fixes them and adds a 2/4/8-rank functional test.

1. Combine receiver: OOB __shared__ read

   In the combine receiver warp, the wait loop evaluated
   `channel_tail_idx[recv_lane_id] <= expected_head` before the
   `expected_head >= 0` guard. `channel_tail_idx` is a shared array
   of size `kNumRanks`, but the loop runs on all 32 lanes of a warp,
   so lanes with `recv_lane_id >= kNumRanks` indexed out of bounds.
   compute-sanitizer reported "Invalid __shared__ read of size 4
   bytes" at combine<bf16,2,768>+0xdd0, surfaced asynchronously as
   cudaErrorIllegalAddress at the kernel launch site. Swap the
   operands so the rank-bounds check short-circuits the shared read.

2. Python bindings: UniqueId ABI

   `mscclpp::UniqueId` is a `std::array<uint8_t, N>` which pybind11
   auto-converts to a Python `list`, silently overriding any
   `py::class_<UniqueId>` wrapper. Expose `create_unique_id` /
   `connect` as lambdas that produce/consume `py::bytes` and memcpy
   into a local `UniqueId`. Also coerce `bytes`->`bytearray` at the
   Python call site for `sync()` whose signature expects
   `pybind11::bytearray`.

3. Python frontend: communicator required for NVL-only sync

   `Buffer::sync()` uses `communicator->connect(ipc_config, ...)` on
   the pure-NVLink path, so the communicator must be initialized
   even when `num_rdma_ranks == 1` and `low_latency_mode == False`.
   Always broadcast the unique id and call `runtime.connect()`
   before `sync()`.

Validation on a single H100x8 node via torchrun:
- 2 ranks: dispatch 195 tokens, combine diff=0
- 4 ranks: dispatch 371 tokens, combine diff=0
- 8 ranks: dispatch 456 tokens, combine diff=0

Test harness added at test/python/ext/ep/test_intranode_multirank.py.
2026-04-21 02:03:55 +00:00
Qinghua Zhou
453160cc06 src/ext/ep: port low-latency dispatch/combine kernels
Port DeepEP's pure-RDMA low-latency (LL) MoE kernels from
csrc/kernels/internode_ll.cu (branch chhwang/dev-atomic-add-cleanup)
into the MSCCL++ EP extension. NVSHMEM / IBGDA device primitives are
replaced with MSCCL++ PortChannelDeviceHandle operations:

  nvshmemx_barrier_all_block()            -> port-channel signal+wait ring
  nvshmemi_ibgda_put_nbi_warp(...)        -> lane-0 PortChannel.put(...)
  nvshmemi_ibgda_amo_nonfetch_add(...)    -> lane-0 PortChannel.atomicAdd(...)

The atomicAdd path relies on the MSCCL++ Connection::atomicAdd /
PortChannelDeviceHandle::atomicAdd API cherry-picked from branch
chhwang/new-atomic-add; the LL dispatch path uses a signed delta
(-num_tokens_sent - 1) which the new int64_t signature supports.

Changes:
* New file src/ext/ep/kernels/internode_ll.cu (~530 lines) with the
  three kernels clean_low_latency_buffer, dispatch<kUseFP8,...>,
  combine<...> plus their launchers. rdma_buffer_ptr is threaded
  through the launchers so the kernel can translate virtual addresses
  into registered-memory offsets expected by MSCCL++.
* kernels/api.cuh: replace the single stub signature with full LL
  launcher prototypes.
* buffer.cc: replace the four LL throw-stubs
  (clean_low_latency_buffer, low_latency_dispatch,
  low_latency_combine, get_next_low_latency_combine_buffer) with
  torch-Tensor implementations ported from DeepEP/csrc/deep_ep.cpp.
* Drop src/ext/ep/internode_stub.cc and its CMake entry.
* python/mscclpp/ext/ep/buffer.py: remove the low_latency_mode=True
  NotImplementedError guard; update docstring.
* test/python/ext/ep/test_ep_smoke.py: rename
  test_low_latency_rejected -> test_low_latency_buffer_construct
  to reflect that LL construction is now accepted.
* src/ext/ep/README.md: update status matrix, document the
  NVSHMEM -> MSCCL++ translation table, and list the known
  limitations.

This is a structural port: the kernels compile, link, and pass the
single-rank smoke tests, but end-to-end behaviour on multi-node H100
is not yet validated. Two known caveats:

  1. Performance will NOT match IBGDA because MSCCL++ port channels
     use a CPU proxy; this port is for functional parity, not latency.
  2. Buffer::sync() in LL mode only connects peers that share the
     same local GPU id (DeepEP convention), so the LL kernels assume
     a one-GPU-per-node topology (num_ranks == num_rdma_ranks).
     Multi-GPU-per-node LL layouts will need a follow-up in sync().

Tested:
  cmake --build build -j --target mscclpp_ep_cpp   # builds clean
  pytest test/python/ext/ep/test_ep_smoke.py        # 3 passed
2026-04-20 21:46:00 +00:00
Qinghua Zhou
88425a6771 Add Expert-Parallel (MoE dispatch/combine) extension under src/ext/ep
Port DeepEP's high-throughput MoE dispatch/combine kernels onto MSCCL++
as an optional build target `mscclpp_ep_cpp`, gated by -DMSCCLPP_BUILD_EXT_EP
(OFF by default). Sources are lifted from DeepEP branch
`chhwang/dev-atomic-add-cleanup` and rebased onto upstream MSCCL++ APIs;
the NVSHMEM / IBGDA dependencies are replaced with `PortChannel` +
`MemoryChannel` + the new `Connection::atomicAdd` primitive.

Scope
-----
Intranode (NVLink-only):
  * `Buffer` ctor/dtor: cudaMalloc nvl workspace, export IPC handle,
    allocate FIFO + peer-pointer tables, start `ProxyService`.
  * `sync()`: import peer IPC handles, upload peer pointer table,
    build `MemoryDevice2DeviceSemaphore` + `MemoryChannel` per peer.
  * `get_dispatch_layout`, `intranode_dispatch`, `intranode_combine`
    ported verbatim (torch::Tensor ABI preserved).

Internode HT (NVLink + RDMA):
  * `sync()` RDMA branch: cudaMalloc RDMA buffer + `bootstrap->barrier()`
    (replacing NVSHMEM symmetric-heap allocation); register with
    `all_transport`, exchange via `sendMemory`/`recvMemory`, build 12 IB
    QPs/peer + 16 semaphores/peer + 16 port channels/peer.
  * Full `internode.cu` port (notify_dispatch / dispatch / cached_notify
    / combine / get_dispatch_layout). The 4 raw `ChannelTrigger` atomic
    sites are rewritten to call the new
    `PortChannelDeviceHandle::atomicAdd(offset, value)` API; the single
    `nvshmem_fence()` is replaced with `__threadfence_system()` (remote
    visibility guaranteed by the subsequent port-channel barrier).
  * `internode_dispatch` / `internode_combine` host code ported, with
    the torch tensor marshalling and CPU spin-wait on mapped counters.

Low-latency (pure RDMA):
  * Not ported. `low_latency_dispatch`, `low_latency_combine`,
    `clean_low_latency_buffer`, `get_next_low_latency_combine_buffer`
    throw `std::runtime_error`; the Python frontend refuses to
    construct a Buffer with `low_latency_mode=True`.

Python layer
------------
* New pybind11 + libtorch Python extension `mscclpp_ep_cpp` (separate
  from the nanobind `_mscclpp` because the EP ABI carries
  `torch::Tensor` / `at::cuda::CUDAStream`).
* `mscclpp.ext.ep.Buffer` mirrors `deep_ep.Buffer`; exchanges device
  IDs, IPC handles and the bootstrap UniqueId over the user's
  `torch.distributed` process group before calling `sync()`.
* `mscclpp.ext` auto-imports `ep` if the extension is built.

Build
-----
* `src/ext/ep/CMakeLists.txt`: finds Python + Torch; warns and skips if
  `CMAKE_PREFIX_PATH` doesn't point at `torch.utils.cmake_prefix_path`.
  Falls back to Torch's bundled pybind11 if a standalone pybind11 is not
  installed. Links `libtorch_python` explicitly (without it, `import
  mscclpp_ep_cpp` fails with `undefined symbol: THPDtypeType`).
* Top-level `CMakeLists.txt` exposes the `MSCCLPP_BUILD_EXT_EP` option
  (default OFF).

Tests
-----
* `test/python/ext/ep/test_ep_smoke.py`: skipped if the extension isn't
  built. Covers Config round-trip, low-latency size hint, and the LL
  construction guard. Multi-rank functional tests still to do on H100.

Notes
-----
* Builds against the preceding "atomic add" commit which adds
  `Connection::atomicAdd` and `PortChannelDeviceHandle::atomicAdd` to
  upstream MSCCL++.
* Intranode path verified end-to-end (build + import + smoke tests).
* Internode HT is code-complete but requires real IB hardware to
  validate; see `src/ext/ep/README.md` for the detailed port plan and
  remaining LL migration.
2026-04-20 20:15:23 +00:00
Binyang Li
eeea00b298 Support python wheel build (#787)
## Support Python wheel build

This PR modernizes the Python packaging for MSCCL++ by defining
dependencies and optional extras in `pyproject.toml`, enabling proper
wheel builds with `pip install ".[cuda12]"`.

### Changes

**`pyproject.toml`**
- Add `dependencies` (numpy, blake3, pybind11, sortedcontainers)
- Add `optional-dependencies` for platform-specific CuPy (`cuda11`,
`cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras
- Bump minimum Python version from 3.8 to 3.10

**`test/deploy/setup.sh`**
- Use `pip install ".[<platform>,benchmark,test]"` instead of separate
`pip install -r requirements_*.txt` + `pip install .` steps
- Add missing CUDA 13 case

**`docs/quickstart.md`**
- Update install instructions to use extras (e.g., `pip install
".[cuda12]"`)
- Document all available extras and clarify that `rocm6` builds CuPy
from source
- Update Python version references to 3.10

**`python/csrc/CMakeLists.txt`**, **`python/test/CMakeLists.txt`**
- Update `find_package(Python)` from 3.8 to 3.10

### Notes
- The `requirements_*.txt` files are kept for Docker base image builds
where only dependencies (not the project itself) should be installed.
- CuPy is intentionally not in base dependencies — users must specify a
platform extra to get the correct pre-built wheel (or source build for
ROCm).

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 21:24:45 -07:00
Binyang Li
5380a4ac6e Add MSCCLPP_IB_GID_INDEX env (#780)
Use MSCCLPP_IB_GID_INDEX to control ib gid index

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-13 09:59:42 -07:00
Caio Rocha
feda338595 Adjusting Torch Integration Example (#779)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-04-10 13:57:14 -07:00
Caio Rocha
a7273047e9 Fix TBG on DSL Get Operation (#778) 2026-04-08 17:02:07 -07:00
Caio Rocha
3e5c41c98a Adding Channel Type in ReduceSend Operation on DSL (#777)
The reduce send operation in DSL essentially combines the reduce and put
operations. The put operation carry the information about the channel
type, whereas previously, we were using the channel type from the reduce
operation.
2026-04-08 16:59:08 -07:00
Binyang Li
8896cd909a Add ROCm FP8 E4M3B15 support (#774)
## Summary

Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including
optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using
inline assembly.

Extends the allpair packet and fullmesh allreduce kernels to support
higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data,
improving numerical accuracy.

Adds Python tests to verify that higher-precision accumulation is at
least as accurate as native FP8 accumulation across all algorithm
variants.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 09:53:45 -07:00
Binyang Li
96a72bbd3e Support E4M3B15 datatype (#765)
## Summary

- **Add `fp8_e4m3b15` datatype**: A software-defined FP8 type with 4
exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375).
Implemented entirely in software with no HW dependency, using
Triton-style bit manipulation through fp16 as intermediate for efficient
conversion.
- **Add mixed-precision accumulation for allreduce**: All allreduce
algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and
others) now support a configurable `accumDtype` parameter, enabling FP8
inputs to be reduced in float16 or float32 for higher accuracy.
- **Propagate `accumDtype` through the full API**: The new parameter is
threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc`
→ dispatch → CUDA kernels, with `DataType::AUTO` as the default
(resolves to input dtype at runtime).
- **Add FP8 accumulation correctness tests**: New `test_fp8_accum.py`
validates that higher-precision accumulation produces results at least
as accurate as native FP8 accumulation across multiple algorithms and
sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm.
- **Add `test_fp8_accum.py` to CI**: Azure Pipeline `ut.yml` now runs
FP8 accumulation tests alongside existing pytests.
- **NCCL shim logging cleanup**: Migrated `printf`-style `WARN`/`INFO`
calls to streaming-style logging.

## Key files

| Area | Files |
|------|-------|
| New datatype + vector ops | `include/mscclpp/gpu_data_types.hpp` |
| Accumulation reduce helpers | `src/core/include/reduce_kernel.hpp` |
| Algorithm API (`accumDtype`) | `include/mscclpp/algorithm.hpp`,
`src/core/algorithm.cc` |
| Allreduce kernels | `src/ext/collectives/allreduce/*.cu` |
| Dispatch + common | `src/ext/collectives/include/allreduce/common.hpp`
|
| Python bindings | `python/csrc/algorithm.cpp`,
`python/mscclpp/_core/algorithm.py` |
| Tests | `python/test/test_fp8_accum.py` |
| CI | `.azure-pipelines/templates/ut.yml` |

## Test plan

- [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15
accumulation tests
- [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped
- [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP
- [x] Existing `test_mscclpp.py` tests continue to pass
- [x] NCCL shim builds and runs correctly with new `accumDtype` defaults

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 13:37:02 -07:00
Binyang Li
4f3638b60d Use PTX red for D2D semaphore signal (#768)
## Summary
- Replace the two-step `signal()` implementation (`incOutbound()` +
`atomicStore()`) with a single fire-and-forget PTX
`red.release.sys.global.add.u64` instruction
- This eliminates one local atomic fetch-add and replaces a remote store
with a remote atomic add that has no return value — more efficient on
both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to
fire-and-forget `flat_atomic_add_x2`)
- Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait
ping-pong latency

### Performance (H100, 2 ranks, signal+wait round-trip)

```
SemaphorePerfTest.SignalPingPong:
  Store-based (old): 2.595 us/iter
  Red-based   (new): 2.345 us/iter
  Speedup:           1.11x
```

## Test plan
- [x] Builds successfully (`make mp_unit_tests`)
- [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter
"SemaphorePerfTest"` — 1.11x speedup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 15:34:43 -07:00
Ekow Wellington
fd76507e9a Install default plans under MSCCLPP_CACHE_DIR/default (#769)
### Summary
Update the installer to place bundled default execution plans under
`<MSCCLPP_CACHE_DIR>/default`, which is where the runtime already looks
for bundled plans.

### Background
The C++ runtime treats `MSCCLPP_CACHE_DIR` as the cache *root* and loads
bundled default plans from `<cache root>/default`.
When `MSCCLPP_CACHE_DIR` was set, the installer instead wrote bundled
plans
directly into the cache root, causing the runtime to miss them.

This surfaced while running benchmarking tests with a non-default
`MSCCLPP_CACHE_DIR`, where the bundled plans were not being discovered.

### Change
This PR updates the installer to always install bundled default plans
into
`<MSCCLPP_CACHE_DIR>/default`, preserving the existing runtime contract.

### Scope
- Installer-only change
- No runtime behavior changes

### Validation
Manual inspection of the updated install path.
Successful build

---------

Co-authored-by: Ekow Wellington <t-ekoww@microsoft.com>
2026-03-31 14:27:33 -05:00
Caio Rocha
4bc1999001 Adding Support to Setting Message Size Range in Native Algorithm API (#758) 2026-02-27 17:50:43 -08:00
Binyang Li
184dcbf9d7 Add CI pipeline for no-IB environment testing (#755)
## Summary

Add CI pipeline support for testing in environments without InfiniBand
(IB) hardware.

## Changes

### IB stubs for no-IB builds (`src/core/ib.cc`)
- Added stub implementations for `IbMr` and `IbQp` classes in the `#else
// !defined(USE_IBVERBS)` block so the library links successfully when
built with `-DMSCCLPP_USE_IB=OFF`.

### Environment variable to disable IB tests
(`MSCCLPP_DISABLE_IB_TESTS`)
- Added `disableIbTests` field to the `Env` class
(`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from
`MSCCLPP_DISABLE_IB_TESTS` env var.
- Exposed as `disable_ib_tests` in Python bindings
(`python/csrc/env_py.cpp`).
- Updated `python/test/test_mscclpp.py` to skip IB-dependent tests
(`create_group_and_connection` with IB transport, `test_h2h_semaphores`,
`test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is
true.

### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`)
The no-IB environment pipeline runs two phases:

1. **No-IB build phase**: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run
unit tests, multi-process unit tests, and pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`).
2. **IB build phase**: Rebuild with IB enabled (default), stop the
existing container, redeploy, and run pytests (with
`MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build
works correctly in a non-IB environment when IB tests are skipped.

Also increased the job timeout from 40 to 60 minutes to accommodate the
two-phase pipeline.
2026-02-24 15:55:59 -08:00
Caio Rocha
7738603d63 Adjusting Communicator in Python API (#752) 2026-02-23 16:33:52 -08:00
Caio Rocha
b5256032fe Disabling Nanobind Memory Leak Warnings in Release Builds (#745)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-02-23 11:55:17 -08:00
Caio Rocha
e2acf7f1c8 Removing MPI Dependency (#743) 2026-02-20 16:04:12 -08:00
Binyang Li
39865c218b address flagBuffer ownership issue (#749)
This pull request updates the handling of the default flag buffer in the
C++ and Python bindings to ensure proper memory management when
interfacing with Python.

Make sure the buffer will not be deallocated when transfer ownership
from cpp to python
2026-02-20 13:42:29 -08:00
Binyang Li
4701ae3a95 Update dtype name (#748)
- Change FP8_E4M3/FP8_E5M2 to FLOAT8_E4M3/FLOAT8_E5M2
- Add torch.uint8 to DataType.uint8 mapping
2026-02-18 10:35:44 -08:00
Binyang Li
bd68319e3e Refactor algo selection logic and introduce symmetric_memory env (#741)
This PR refactors the algorithm selection logic in MSCCL++ and
introduces support for symmetric memory configuration through
environment variables.


1. Algorithm Selection Refactoring
Use separate class for algo selection. Could introduce more complex
logic for algo selection based on message size, arch, if cuda graph is
enabled and memory allocation method

2. Symmetric Memory Support
Introduced symmetricMemory parameter in algorithm context key
generation. Remove disableChannelCache env as is ambiguous

3. Add new args for build_default_algorithms 
Add flag_buffer, and flag_buffer_size args to build default algorithm.
Then we could use unified flag buffer for different algorithms, avoid
application hanging when switch algo for different message size.

---------

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2026-02-12 19:06:18 -08:00
Caio Rocha
dff3bc7bbb Support Fusion for ReadPutPacket Operation at DSL (#742)
Support is being added for fusing the ReadPutPacket operation on DSL,
which reduces the overhead caused by reading packet data multiple times
in the scratch buffer. Fusion will occur when two rppkt operations are
executed consecutively with the same src_buffer:

rppkt(src, dst0) + rppkt(src, dst1) -> rppkt(src, [dst0, dst1]

Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-02-12 17:27:20 -08:00
Changho Hwang
42be3660e0 Add a new IB stack impl that doesn't use RDMA atomics (#728)
* Added configurable InfiniBand (IB) signaling mode.
`EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`,
`HostNoAtomic`). `Default` is equivalent to `Host` unless specified
different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the
previous implementation using RDMA atomics for signaling, while
`HostNoAtomic` uses write-with-immediate instead.
* Regarding updates in Python bindings and API.
2026-02-10 01:07:53 +00:00
Binyang Li
c12822a7af create CI pipeline for rocm (#718)
Create CI pipeline for AMD GPU.
2026-02-09 16:55:16 -08:00
Qinghua Zhou
620378b4fb Fix cpplint error in main branch (#740)
Fix the legacy cpplint error in main branch.

---------

Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-02-05 09:25:12 -08:00
Binyang Li
e21513791a Address comments for PR #692 (#733)
Rename nanobind-exposed C++ types to Cpp*
Replace MSCCLPP_EXECUTION_PLAN_DIR / MSCCLPP_NATIVE_CACHE_DIR with
MSCCLPP_CACHE_DIR across C++ and Python.
2026-02-03 10:13:20 -08:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Binyang Li
78ce9fac8d Fix ci pipeline failure (#729) 2026-01-21 13:28:14 -05:00
Changho Hwang
105239fc6c Use GpuIpcMem for NVLS connections (#719)
* Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast
memory handling.
* Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API
handles this automatically).
* Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a
shared pointer with custom deleter for unmapping, which prevents misuse
of raw pointers and reduces states to be stored in the `GpuIpcMem`
instance.
* Now for `RuntimeIpc` type handles, for consistency with other types,
`cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of
the ctor of `GpuIpcMem`.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-01-15 13:16:04 +08:00
Changho Hwang
b8a1b0a134 Add CUDA 13.0 Docker images (#720)
* Updated Dockerfiles and the build script to support CUDA 13.0
* Added Python3 venv which is required since Python 3.12
* Updated the default MLNX-OFED version to the LTS version
* Added docker push instruction for multi-arch manifest
2026-01-09 19:03:33 +08:00
Changho Hwang
fc221e234d Remove UB std:: declarations (#709)
Remove custom delcarations inside `std::` of which behaviors are
undefined by the standard
2026-01-05 11:11:46 +08:00
Binyang Li
ca6a4a3274 Replace __HIP_PLATFORM_AMD__ to use internal macro (#712)
Replacing most of checks for `__HIP_PLATFORM_AMD__` with
`MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source
file.
2026-01-04 04:47:58 -08:00
Binyang Li
eda74a7f29 Add handle cache for AMD platform (#698)
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the
handle reference internally and reuse the same handle if already be
opened

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-12-21 18:39:12 -08:00
Changho Hwang
9e076da3d4 Make IB more configurable (#703)
* Added `port` and `gidIndex` field in the IB endpoint config (and
`deviceIndex` field for future usages)
* Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so
* Added `--ib_gid_index` CLI option to `mp_unit_tests`
* Other minor fixes
2025-12-18 13:21:07 -08:00
Changho Hwang
8b8593ba51 Fix Python bindings and tests (#690)
Minimal fix to make things work. We need a more careful look at
preventing silent fallback of nanobind when it fails to (properly)
construct a C++ STL object with mscclpp instances.
2025-11-21 12:53:12 -08:00
Caio Rocha
a19bca9738 Fix Minor Issue Proxy Python Interface (#685) 2025-11-17 09:03:00 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Caio Rocha
7eb3ff701a Supporting New Packet Kernel Operation at Executor (#677)
This PR introduces three new operations to enhance flexibility and
performance at executor.

One operation can be invoked directly via the DSL API and two operations
are created through fusion of existing operations, reducing overhead and
improving efficiency.

1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt
format to the remote side in pkt format via the port channel. Both
source and destination buffers must be scratch.

2. Reduce Copy Packet (Fusion):
Reduce Packet+Copy Packet=Reduce Copy Packet
Triggered when the destination buffer of Reduce Packet matches the
source buffer of Copy Packet.
Purpose: Combine reduction and copy into a single step for better
performance.

3. Reduce Copy Send Packet (Fusion):
Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer
of Reduce Copy Packet matches src buffer of Put Packet)
Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt
buffer of Reduce Copy Packet matches src buffer of Read Put Packet)
Purpose: Combine reduction, copy, and send operations into one optimized
pipeline.


Fusion Diagram
Reduce Packet + Copy Packet → Reduce Copy Packet
Reduce Copy Packet + Put Packet → Reduce Copy Send Packet
Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet

Beyond this, this PR adjust the AllReduce 2 Node algorithm:

Message Size  |  Latency (µs)
        1K            |     15.34
        2K            |     15.88
        4K            |     15.71
        8K            |     16.01
        16K          |     15.88
        32K          |     16.21
        64K          |     16.90
        128K        |     18.24
        256K        |     20.39
        512K        |     25.26
        1M           |     32.74
        2M           |     53.64
2025-11-13 14:08:44 -08:00
Binyang Li
5acac93dbc Integrate MSCCL++ DSL to torch workload (#620)
Provides two integration ways for MSCCL++ DSL.
1. Integrate with customized communication group
2. Integrate with NCCL API

Introduce new Python APIs to make it work:
```python
mscclpp.compile # compile dsl to json based execution plan
mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery
mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size....
```
Fix #556

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-29 15:39:00 -07:00
Changho Hwang
09219c1f6a Fix #651 (#662)
* Python cannot distinguish `Communicator::connect(const Endpoint&,
...)` from `Communicator::connect(const EndpointConfig&, ...)`.
Temporarily removed the former one.
* A few other fixes in Python bindings.
2025-10-24 14:25:51 -07:00
Changho Hwang
68b1f151f0 Rename nvls* files (#660)
Rename nvls* files to switch_channel*
2025-10-24 11:34:26 -07:00
Changho Hwang
200cdf946e Update EndpointConfig interfaces (#651)
* Separate IB-specific options into a nested struct
* Enable `connect()` by an `Endpoint`, not only by `EndpointConfig`
* Other minor changes
2025-10-22 10:39:39 -07:00
Binyang Li
b1a88d755e Pipeline fix (#645)
Co-authored-by: github-actions <github-actions@github.com>
2025-10-10 11:26:33 -07:00
Binyang Li
ddca185add Address corner case when generating version file (#641)
Address corner case for version file generation

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
2025-10-07 14:32:33 -07:00
Binyang Li
3d94383696 Add MSCCLPP_GIT_COMMIT micro (#640)
- Add MSCCLPP_GIT_COMMIT micro
- Update docs
2025-10-06 15:57:28 -07:00
Caio Rocha
b76f3ebf39 Add 2 Node AllReduce DSL Algorithm (#636)
This PR creates two allreduce algorithms designed for a 2-node
environment. These algorithms are in-place and non-zero copy.
2025-10-01 17:00:17 -07:00