Commit Graph

855 Commits

Author SHA1 Message Date
Changho Hwang
8b8593ba51 Fix Python bindings and tests (#690)
Minimal fix to make things work. We need a more careful look at
preventing silent fallback of nanobind when it fails to (properly)
construct a C++ STL object with mscclpp instances.
2025-11-21 12:53:12 -08:00
Caio Rocha
060c35fec6 No IB Env CI Test (#687) 2025-11-19 11:13:03 -08:00
Caio Rocha
bbdeafb3ca Fix Error in Non IB Env at Executor (#686)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-11-17 16:35:57 -08:00
Qinghua Zhou
b9428341a2 Revise the mscclpp datatype (#671)
Use mscclpp::DataType to replace the following types in API interface:
int dtype; 
ncclDataType_t dtype;

Add data type conversion:
Convert ncclDataType_t to mscclpp::DataType
2025-11-17 12:58:47 -08:00
Caio Rocha
a19bca9738 Fix Minor Issue Proxy Python Interface (#685) 2025-11-17 09:03:00 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Caio Rocha
7eb3ff701a Supporting New Packet Kernel Operation at Executor (#677)
This PR introduces three new operations to enhance flexibility and
performance at executor.

One operation can be invoked directly via the DSL API and two operations
are created through fusion of existing operations, reducing overhead and
improving efficiency.

1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt
format to the remote side in pkt format via the port channel. Both
source and destination buffers must be scratch.

2. Reduce Copy Packet (Fusion):
Reduce Packet+Copy Packet=Reduce Copy Packet
Triggered when the destination buffer of Reduce Packet matches the
source buffer of Copy Packet.
Purpose: Combine reduction and copy into a single step for better
performance.

3. Reduce Copy Send Packet (Fusion):
Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer
of Reduce Copy Packet matches src buffer of Put Packet)
Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt
buffer of Reduce Copy Packet matches src buffer of Read Put Packet)
Purpose: Combine reduction, copy, and send operations into one optimized
pipeline.


Fusion Diagram
Reduce Packet + Copy Packet → Reduce Copy Packet
Reduce Copy Packet + Put Packet → Reduce Copy Send Packet
Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet

Beyond this, this PR adjust the AllReduce 2 Node algorithm:

Message Size  |  Latency (µs)
        1K            |     15.34
        2K            |     15.88
        4K            |     15.71
        8K            |     16.01
        16K          |     15.88
        32K          |     16.21
        64K          |     16.90
        128K        |     18.24
        256K        |     20.39
        512K        |     25.26
        1M           |     32.74
        2M           |     53.64
2025-11-13 14:08:44 -08:00
Caio Rocha
eb202780f5 Support Synchronous Initialization for Proxy Service (#679) 2025-11-12 18:35:57 -08:00
Changho Hwang
ffafcaf6d6 IB stack enhancements & bug fixes (#673)
* Always use `ibv_reg_dmabuf_mr` when DMABUF is supported
* Do not check `nvidia-peermem` when unnecessary
* More rigorous check on IB port availability
* Fixed ibverbs wrappers
* Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test
2025-11-07 14:26:17 -08:00
Binyang Li
9eb958183c upgrade codeql to v3 (#676) 2025-11-06 16:58:19 -08:00
Changho Hwang
960a8ddebd Add a new logger (#668)
* Add `logger.hpp` that will gradually replace `debug.h`
* Minor fixes in `ib.cc`

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2025-11-04 10:32:46 -08:00
Binyang Li
5acac93dbc Integrate MSCCL++ DSL to torch workload (#620)
Provides two integration ways for MSCCL++ DSL.
1. Integrate with customized communication group
2. Integrate with NCCL API

Introduce new Python APIs to make it work:
```python
mscclpp.compile # compile dsl to json based execution plan
mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery
mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size....
```
Fix #556

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-29 15:39:00 -07:00
Changho Hwang
9994f53cea Fixes for no-IB systems (#667)
* Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB
on/off
* Fix `nvidia-peermem` check; no need for DMABUF supported systems
* Fix `mp_unit_tests` to skip all IB tests when built with
`-DMSCCLPP_USE_IB=OFF`
2025-10-29 10:03:02 -07:00
Caio Rocha
2b987cf8e8 Resolve IBVerbs Loading Issues (#648)
Some systems do not include libibverbs.so when installing ibverbs;
instead, they only provide libibverbs.so1. This PR updates the CMake
file to search for this library and modifies the wrapper to load it.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-28 18:14:53 -07:00
Qinghua Zhou
a38c2ee784 FP8 support for Allreduce (#646)
Add FP8 support for Allreduce on both NVIDIA and AMD platform.
Add new data type: fp8_e4m3 and fp8_e5m2

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-10-27 14:51:48 -07:00
Changho Hwang
fc0aaaf1b4 Auto-detect CUDA arch in CMake GPU check (#666)
Compute capability 60 support is dropped from CUDA 13

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-10-27 11:25:24 -07:00
Binyang Li
2b05908635 Add token pool for cuCreate API (#628)
Create a tokenPool to allocate token. This feature is used to support
inter node NVL and try to reduce the footprint caused by cuCreate

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-27 11:19:21 -07:00
Changho Hwang
09219c1f6a Fix #651 (#662)
* Python cannot distinguish `Communicator::connect(const Endpoint&,
...)` from `Communicator::connect(const EndpointConfig&, ...)`.
Temporarily removed the former one.
* A few other fixes in Python bindings.
2025-10-24 14:25:51 -07:00
Changho Hwang
68b1f151f0 Rename nvls* files (#660)
Rename nvls* files to switch_channel*
2025-10-24 11:34:26 -07:00
Changho Hwang
a2f1279c60 Test peer accessibility after deployment (#661)
Test GPUs' peer accessibility before integration testing to distinguish
VM issues.
2025-10-24 11:09:36 -07:00
Changho Hwang
4d4f087c11 Add exclude paths under pipeline triggers (#664) 2025-10-23 18:27:36 -07:00
Caio Rocha
d7b99e9c9d Improving DSL documentation (#650) 2025-10-23 17:50:33 -07:00
Changho Hwang
f7d1fb4492 Exclude irrelevant files from workflow triggers (#663) 2025-10-23 15:52:19 -07:00
Changho Hwang
58996b5c51 Fix docs version (#659)
Fetch full history of the repo for accurate version info
2025-10-23 11:14:27 -07:00
Changho Hwang
a48421872e Fix docs (#656)
* Fix Python doc generation
* Remove `ChannelTrigger` and fix `ProxyTrigger`
* Fixed package versions for consistency
2025-10-23 00:34:53 +00:00
Binyang Li
cbf448b012 New allreduce algo for small message size (#647)
New algo for message size < 32KB, command: `mpirun --allow-run-as-root
-tag-output -np 8 -x
LD_PRELOAD=/root/mscclpp/build/apps/nccl/libmscclpp_nccl.so -x
MSCCLPP_DISABLE_CHANNEL_CACHE=1 ./build/all_reduce_perf -b 1K -e 32K -f
2 -c 1 -G 1 -n 100 -d half`
Tested on H100
Perf:
```
[1,0]<stdout>:#                                                              out-of-place                       in-place          
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:        1024           512      half     sum      -1     4.65    0.22    0.39      0     4.60    0.22    0.39      0
[1,0]<stdout>:        2048          1024      half     sum      -1     4.96    0.41    0.72      0     4.93    0.42    0.73      0
[1,0]<stdout>:        4096          2048      half     sum      -1     5.12    0.80    1.40      0     5.12    0.80    1.40      0
[1,0]<stdout>:        8192          4096      half     sum      -1     5.11    1.60    2.81      0     5.08    1.61    2.82      0
[1,0]<stdout>:       16384          8192      half     sum      -1     5.47    3.00    5.24      0     5.44    3.01    5.27      0
[1,0]<stdout>:       32768         16384      half     sum      -1     6.24    5.25    9.19      0     6.28    5.22    9.14      0
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 3.29145
```

Old:
```
[1,0]<stdout>:#                                                              out-of-place                       in-place          
[1,0]<stdout>:#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
[1,0]<stdout>:#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
[1,0]<stdout>:        1024           512      half     sum      -1     5.02    0.20    0.36      0     5.12    0.20    0.35      0
[1,0]<stdout>:        2048          1024      half     sum      -1     5.28    0.39    0.68      0     5.29    0.39    0.68      0
[1,0]<stdout>:        4096          2048      half     sum      -1     5.45    0.75    1.32      0     5.46    0.75    1.31      0
[1,0]<stdout>:        8192          4096      half     sum      -1     5.50    1.49    2.61      0     5.51    1.49    2.60      0
[1,0]<stdout>:       16384          8192      half     sum      -1     5.79    2.83    4.95      0     5.80    2.82    4.94      0
[1,0]<stdout>:       32768         16384      half     sum      -1     7.36    4.45    7.79      0     7.36    4.46    7.80      0
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 2.94887 
```
2025-10-22 17:28:53 -07:00
Changho Hwang
200cdf946e Update EndpointConfig interfaces (#651)
* Separate IB-specific options into a nested struct
* Enable `connect()` by an `Endpoint`, not only by `EndpointConfig`
* Other minor changes
2025-10-22 10:39:39 -07:00
Binyang Li
610db6f023 Fix test script (#655)
Fix: #654. Address correctness_test.py crash issue

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-10-21 19:57:17 +00:00
Changho Hwang
b8f61cb761 Update the port channel tutorial doc (#653) 2025-10-21 11:52:15 -07:00
Changho Hwang
2f7d74b281 Fix lint.sh (#652)
Exit 1 upon any errors from clang-format or black
2025-10-20 17:23:01 -07:00
Binyang Li
b1a88d755e Pipeline fix (#645)
Co-authored-by: github-actions <github-actions@github.com>
v0.8.0
2025-10-10 11:26:33 -07:00
Binyang Li
ddca185add Address corner case when generating version file (#641)
Address corner case for version file generation

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions <github-actions@github.com>
2025-10-07 14:32:33 -07:00
Binyang Li
3d94383696 Add MSCCLPP_GIT_COMMIT micro (#640)
- Add MSCCLPP_GIT_COMMIT micro
- Update docs
2025-10-06 15:57:28 -07:00
Binyang Li
da84c6e4cc Reduce memory footprint for allreduce8 and allgather6 (#644)
Reduce memory footprint for allreduce8 and allgather6
Remove libLoadSucceed check for group start
2025-10-03 15:08:56 -07:00
Binyang Li
fe65c045c4 Make ncclReduce/ncclSend/ncclRecv work (#643)
Fallback to nccl for ncclReduce/ncclSend/ncclRecv

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-02 17:05:00 -07:00
Caio Rocha
b76f3ebf39 Add 2 Node AllReduce DSL Algorithm (#636)
This PR creates two allreduce algorithms designed for a 2-node
environment. These algorithms are in-place and non-zero copy.
2025-10-01 17:00:17 -07:00
Binyang Li
79e9e613a8 Fix Rocm build issue (#642)
Fix cross-compiling issue. Add `target_compile_options` to make sure the
compile option is correct
2025-10-01 14:20:45 -07:00
Qinghua Zhou
16a96ea77b Support detailed version tracking that captures git repository information (#639)
#### Version Format

The package version includes the git commit hash directly in the version
string for development builds:
- **Release version**: `0.7.0`
- **Development version**: `0.7.0.dev36+g6e2360d69` (includes short
commit hash)
- **Development with uncommitted changes**:
`0.7.0.dev36+g6e2360d69.dirty`

#### Checking Version Information

After installation, you can check the version information in several
ways:

**From Python:**
```python
import mscclpp

# Access individual attributes
print(f"Version: {mscclpp.__version__}")           # Full version with commit
Version: 0.7.0.dev36+g6e2360d69

# Get as dictionary
mscclpp.version()
{'version': '0.7.0.dev46+gb0d27c58f', 'base_version': '0.7.0', 'git_commit': 'b0d27c58f'}
```

#### Version Information Details

The version tracking captures:
- **Package Version** (`mscclpp.__version__`): Full version string
including git commit (e.g., `0.7.0.dev36+g6e2360d69`)

This information is embedded during the package build process and
remains accessible even after distribution, making it easier to debug
issues and ensure reproducibility.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-09-30 09:00:33 -07:00
Binyang Li
70b8297c56 Revise NCCL API implementation (#617)
- Make nccl interface extensible. Customer can register their own algo
to NCCL API. User can provide customized algo selection function.
- Fallback to NCCL/RCCL if no algo is selected based on algo selection
function
- MSCCLPP interfaces now works for any scale
2025-09-26 10:08:12 -07:00
Binyang Li
5ac427610d Address teardown issue (#638)
Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point
2025-09-25 12:12:40 -07:00
Binyang Li
bf8d424ae3 use unix socket to share fd (#634)
Use unix socket to share fd to other processes. Used for nvls handle sharing
Update nccl interface to support worldSize=1

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-25 11:40:54 -07:00
Changho Hwang
43f160c8e6 Fix for safe process teardown (#633)
* `gpuFree*()` functions are usually called during process teardown, so
we let them ignore regarding errors.
* `AvoidCudaGraphCaptureGuard` is constructed in `gpuFree*()` functions,
so it needs the same fix.
2025-09-10 20:28:04 -07:00
Binyang Li
d946c45ebd Adapt with torch 2.6 (#632) 2025-09-10 20:40:39 +00:00
Abhinav Jangda
5d062b7038 Fix Illegal Memory Access in nvls_test for CUDA12.9 (#631)
Running NVLS test, `test/nvls_test.cu` in CUDA 12.9 leads to illegal
memory access at
571fee16fb/test/nvls_test.cu (L151)
.
This PR addresses this error by moving cudaMemset after memory mapping.
2025-09-10 09:46:18 -07:00
Changho Hwang
571fee16fb Add FifoDeviceHandle::poll() (#630) 2025-09-09 23:32:35 +00:00
Binyang Li
ba4c4aaeb8 Integrate MSCCL++ with torch workload (#626)
Integrate MSCCL++ with torch
Introduce `NCCL audit shim library`, use can use following commands to
launch torch library. Also avoid break build pipeline in the CPU machine
```bash
export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so
export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH
torchrun --nnodes=1 --nproc_per_node=8 your_script.py
```
2025-09-09 13:28:32 -07:00
Binyang Li
4bbe16b351 Fix hang issue in logging submodule (#625)
Fix: #622, using std::recursive_mutex to allow acquiring `lock`
reclusively in the same thread
2025-09-05 09:18:28 -07:00
Changho Hwang
c4d8781390 Fix memory exchange within a single process (#624) 2025-09-04 12:53:51 -07:00
Caio Rocha
c3473b1794 Thread Block Group DSL (#621)
Supporting the creation of a group of thread block to perform some
operation.
2025-09-03 14:58:40 -07:00
Changho Hwang
547a9ae65c Fixed cpp linter (#619) 2025-08-25 12:15:45 -07:00