Commit Graph

179 Commits

Author SHA1 Message Date
Qinghua Zhou
935cc70534 fix: resolve illegal memory access and kernel correctness issues in alltoallv
1. Fix pinned buffer race condition (alltoallv_single.py):
   - The shared pinned CPU buffer was reused for 4 sequential non_blocking
     H2D copies. GPU DMA read stale data after CPU overwrote the buffer
     with the next field, corrupting sendCounts/recvCounts and causing the
     kernel to write to wrong addresses. Fixed by using 5 dedicated pinned
     buffers — one per field (send_counts, send_displs, recv_counts,
     recv_displs, remote_recv_displs).

2. Remove C++ periodic reset (alltoallv_fullmesh.cu):
   - A hardcoded static counter reset destroyed MemoryChannels and
     semaphores every 1000 kernel calls while inter-GPU signaling was
     still in progress, causing semaphore epoch mismatch and illegal
     memory access.

3. Fix semaphore wait (alltoallv_kernel.hpp):
   - Make wait() unconditional after signal(). Skipping wait() when
     recvCounts==0 desynced the semaphore epoch counter — subsequent
     calls wait() returned immediately before the peer finished writing.

4. Add memory fence (alltoallv_kernel.hpp):
   - Add __threadfence_system() after wait() outside the primary-block
     guard so ALL thread blocks execute it before kernel exit. Ensures
     NVLink remote writes from put() are globally visible to subsequent
     kernels on the receiving GPU.
2026-04-20 17:18:05 +00:00
Qinghua Zhou
1d271f4cc7 Merge latest multinode branch 2026-04-08 23:03:12 +00:00
Qinghua Zhou
9be576578d Merge multinode branch 2026-03-25 02:51:24 +00:00
Qinghua Zhou
ec011f14ea Add detection of torch.baseline and debug info 2026-03-25 01:52:24 +00:00
Qinghua Zhou
7e1cb7b8cf Support cross-node CudaIPC 2026-03-21 10:41:32 +00:00
Qinghua Zhou
9ef1fb7cee Run pass the multinode test 2026-03-18 17:08:22 +00:00
Qinghua Zhou
bdb30b56a5 Broadcast UniqueId via TCP; Detect whether torch comparison is possible 2026-03-16 10:01:35 +00:00
Qinghua Zhou
f47e97659d Update the benchmark to improve the rank mapping, communicator creation, backend selection 2026-03-16 09:25:34 +00:00
Qinghua Zhou
b7b180df24 Exchange recv displacement arrays between all ranks via bootstrap allGather 2026-03-05 15:19:20 +00:00
Qinghua Zhou
d5743e2d6c Integrate with MoE training flow 2026-03-03 15:17:20 +00:00
Qinghua Zhou
d00713d3c2 Add more real moe workloads for alltoallv 2026-03-02 12:51:21 +00:00
Qinghua Zhou
ee843d445f Add test of real MoE workloads 2026-02-25 12:39:48 +00:00
Qinghua Zhou
ae59eab6a2 Add unified benchmarking function to test all_to_all_single of mscclpp and torch 2026-02-24 07:17:17 +00:00
Qinghua Zhou
715ecd91cf Add baseline test of torch.distributed.all_to_all_single 2026-02-24 06:51:10 +00:00
Qinghua Zhou
98be0def08 Use variable sizes in the peformance test 2026-02-24 06:29:46 +00:00
Qinghua Zhou
6292b6ab33 Report undirectional bandwidth 2026-02-24 06:02:33 +00:00
Qinghua Zhou
f803eff8b9 Use multiple thread blocks; Add peer-parallel kernels 2026-02-24 04:05:01 +00:00
Qinghua Zhou
21e3f1ebb3 Get correct remote receive displacements for peers 2026-02-23 14:22:30 +00:00
Qinghua Zhou
7ba83e20dd PyTorch-compatible all_to_all_single API using mscclpp kernels 2026-02-23 09:51:51 +00:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Binyang Li
78ce9fac8d Fix ci pipeline failure (#729) 2026-01-21 13:28:14 -05:00
Changho Hwang
105239fc6c Use GpuIpcMem for NVLS connections (#719)
* Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast
memory handling.
* Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API
handles this automatically).
* Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a
shared pointer with custom deleter for unmapping, which prevents misuse
of raw pointers and reduces states to be stored in the `GpuIpcMem`
instance.
* Now for `RuntimeIpc` type handles, for consistency with other types,
`cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of
the ctor of `GpuIpcMem`.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-01-15 13:16:04 +08:00
Changho Hwang
b8a1b0a134 Add CUDA 13.0 Docker images (#720)
* Updated Dockerfiles and the build script to support CUDA 13.0
* Added Python3 venv which is required since Python 3.12
* Updated the default MLNX-OFED version to the LTS version
* Added docker push instruction for multi-arch manifest
2026-01-09 19:03:33 +08:00
Changho Hwang
fc221e234d Remove UB std:: declarations (#709)
Remove custom delcarations inside `std::` of which behaviors are
undefined by the standard
2026-01-05 11:11:46 +08:00
Binyang Li
ca6a4a3274 Replace __HIP_PLATFORM_AMD__ to use internal macro (#712)
Replacing most of checks for `__HIP_PLATFORM_AMD__` with
`MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source
file.
2026-01-04 04:47:58 -08:00
Binyang Li
eda74a7f29 Add handle cache for AMD platform (#698)
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the
handle reference internally and reuse the same handle if already be
opened

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-12-21 18:39:12 -08:00
Changho Hwang
9e076da3d4 Make IB more configurable (#703)
* Added `port` and `gidIndex` field in the IB endpoint config (and
`deviceIndex` field for future usages)
* Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so
* Added `--ib_gid_index` CLI option to `mp_unit_tests`
* Other minor fixes
2025-12-18 13:21:07 -08:00
Changho Hwang
8b8593ba51 Fix Python bindings and tests (#690)
Minimal fix to make things work. We need a more careful look at
preventing silent fallback of nanobind when it fails to (properly)
construct a C++ STL object with mscclpp instances.
2025-11-21 12:53:12 -08:00
Caio Rocha
a19bca9738 Fix Minor Issue Proxy Python Interface (#685) 2025-11-17 09:03:00 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Caio Rocha
7eb3ff701a Supporting New Packet Kernel Operation at Executor (#677)
This PR introduces three new operations to enhance flexibility and
performance at executor.

One operation can be invoked directly via the DSL API and two operations
are created through fusion of existing operations, reducing overhead and
improving efficiency.

1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt
format to the remote side in pkt format via the port channel. Both
source and destination buffers must be scratch.

2. Reduce Copy Packet (Fusion):
Reduce Packet+Copy Packet=Reduce Copy Packet
Triggered when the destination buffer of Reduce Packet matches the
source buffer of Copy Packet.
Purpose: Combine reduction and copy into a single step for better
performance.

3. Reduce Copy Send Packet (Fusion):
Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer
of Reduce Copy Packet matches src buffer of Put Packet)
Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt
buffer of Reduce Copy Packet matches src buffer of Read Put Packet)
Purpose: Combine reduction, copy, and send operations into one optimized
pipeline.


Fusion Diagram
Reduce Packet + Copy Packet → Reduce Copy Packet
Reduce Copy Packet + Put Packet → Reduce Copy Send Packet
Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet

Beyond this, this PR adjust the AllReduce 2 Node algorithm:

Message Size  |  Latency (µs)
        1K            |     15.34
        2K            |     15.88
        4K            |     15.71
        8K            |     16.01
        16K          |     15.88
        32K          |     16.21
        64K          |     16.90
        128K        |     18.24
        256K        |     20.39
        512K        |     25.26
        1M           |     32.74
        2M           |     53.64
2025-11-13 14:08:44 -08:00
Binyang Li
5acac93dbc Integrate MSCCL++ DSL to torch workload (#620)
Provides two integration ways for MSCCL++ DSL.
1. Integrate with customized communication group
2. Integrate with NCCL API

Introduce new Python APIs to make it work:
```python
mscclpp.compile # compile dsl to json based execution plan
mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery
mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size....
```
Fix #556

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-29 15:39:00 -07:00
Changho Hwang
09219c1f6a Fix #651 (#662)
* Python cannot distinguish `Communicator::connect(const Endpoint&,
...)` from `Communicator::connect(const EndpointConfig&, ...)`.
Temporarily removed the former one.
* A few other fixes in Python bindings.
2025-10-24 14:25:51 -07:00
Changho Hwang
68b1f151f0 Rename nvls* files (#660)
Rename nvls* files to switch_channel*
2025-10-24 11:34:26 -07:00
Changho Hwang
200cdf946e Update EndpointConfig interfaces (#651)
* Separate IB-specific options into a nested struct
* Enable `connect()` by an `Endpoint`, not only by `EndpointConfig`
* Other minor changes
2025-10-22 10:39:39 -07:00
Binyang Li
b1a88d755e Pipeline fix (#645)
Co-authored-by: github-actions <github-actions@github.com>
2025-10-10 11:26:33 -07:00
Binyang Li
ddca185add Address corner case when generating version file (#641)
Address corner case for version file generation

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
2025-10-07 14:32:33 -07:00
Binyang Li
3d94383696 Add MSCCLPP_GIT_COMMIT micro (#640)
- Add MSCCLPP_GIT_COMMIT micro
- Update docs
2025-10-06 15:57:28 -07:00
Caio Rocha
b76f3ebf39 Add 2 Node AllReduce DSL Algorithm (#636)
This PR creates two allreduce algorithms designed for a 2-node
environment. These algorithms are in-place and non-zero copy.
2025-10-01 17:00:17 -07:00
Qinghua Zhou
16a96ea77b Support detailed version tracking that captures git repository information (#639)
#### Version Format

The package version includes the git commit hash directly in the version
string for development builds:
- **Release version**: `0.7.0`
- **Development version**: `0.7.0.dev36+g6e2360d69` (includes short
commit hash)
- **Development with uncommitted changes**:
`0.7.0.dev36+g6e2360d69.dirty`

#### Checking Version Information

After installation, you can check the version information in several
ways:

**From Python:**
```python
import mscclpp

# Access individual attributes
print(f"Version: {mscclpp.__version__}")           # Full version with commit
Version: 0.7.0.dev36+g6e2360d69

# Get as dictionary
mscclpp.version()
{'version': '0.7.0.dev46+gb0d27c58f', 'base_version': '0.7.0', 'git_commit': 'b0d27c58f'}
```

#### Version Information Details

The version tracking captures:
- **Package Version** (`mscclpp.__version__`): Full version string
including git commit (e.g., `0.7.0.dev36+g6e2360d69`)

This information is embedded during the package build process and
remains accessible even after distribution, making it easier to debug
issues and ensure reproducibility.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-09-30 09:00:33 -07:00
Caio Rocha
c3473b1794 Thread Block Group DSL (#621)
Supporting the creation of a group of thread block to perform some
operation.
2025-09-03 14:58:40 -07:00
Caio Rocha
4d9bb9f015 Adding Channel Id Field DSL Port Channel Operations (#615) 2025-08-15 16:10:52 -07:00
Caio Rocha
9261b1d278 AlltoAll Test Support (#606)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-08-15 16:00:41 -07:00
Binyang Li
699cc45eed Fix ut (#613)
Fix pytest
2025-08-14 17:15:28 -07:00
Changho Hwang
2eadbaf86f python doc auto generation (#605)
Add Python API references
2025-08-11 10:34:29 -07:00
Binyang Li
be6a941fba New DSL implementation (#579)
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline 
- Clean code and enhance document
C++ side: 
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-09 00:36:20 -07:00
Changho Hwang
542a10f69e Merge ChannelTrigger with ProxyTrigger (#601) 2025-08-08 19:07:50 +00:00
Binyang Li
4f6f23dae3 Use smart pointer for IB structure (#585)
Change to use smart pointer for IB structure. Registered memory will own
ibMr, ibCtx will not held the reference
- Use smart pointer for IbQp and IbMr
- Update memoryChannel API, keep localRegisteredMemory
- Close fd when registedMemory released

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-06 10:01:58 -07:00
Binyang Li
658411ccc4 update pytest and python API to fix ut failure (#598)
update pytest and python API to fix ut failure
2025-08-05 15:17:33 -07:00
Changho Hwang
c580e4c503 Support CudaIpc connection within a single process (#593)
* Allow CudaIpc connection between GPUs in a single process
* Added an example of connection in a single process
* Minor interface updates

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-08-02 12:59:36 +08:00