58 Commits

Author SHA1 Message Date
Binyang Li
3962574bcb Address installation issue in some env (#750)
This pull request updates the way the `nlohmann/json` library is fetched
and upgrades it to a newer version in both the main build and test
configuration files.
Addressed installation issue in some env
2026-02-20 16:11:16 -08:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Caio Rocha
eb202780f5 Support Synchronous Initialization for Proxy Service (#679) 2025-11-12 18:35:57 -08:00
Changho Hwang
547a9ae65c Fixed cpp linter (#619) 2025-08-25 12:15:45 -07:00
Binyang Li
03c0ff2a91 Fix for multi-nodes test (#614)
Fix multi-node test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-14 20:44:43 -07:00
Binyang Li
bb76d27553 all2all implementation (#609)
Implement single node all2all via MSCCL++ C++API
perf kernel 3:
```
       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.41   44.78   39.19      0
     2097152         65536                                     23.95   87.56   76.61      0
     4194304        131072                                     27.50  152.51  133.45      0
     8388608        262144                                     35.14  238.73  208.89      0
    16777216        524288                                     57.54  291.55  255.11      0
    33554432       1048576                                     109.7  305.81  267.59      0
    67108864       2097152                                     212.3  316.07  276.56      0
   134217728       4194304                                     410.9  326.64  285.81      0
   268435456       8388608                                     784.9  341.99  299.24      0
```

kernel 2
```

#                                        in-place                       out-of-place
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.42   44.77   39.17      0
     2097152         65536                                     24.96   84.02   73.52      0
     4194304        131072                                     28.53  147.03  128.65      0
     8388608        262144                                     36.75  228.28  199.75      0
    16777216        524288                                     58.01  289.20  253.05      0
    33554432       1048576                                     110.4  303.83  265.85      0
    67108864       2097152                                     212.4  315.99  276.49      0
   134217728       4194304                                     407.8  329.12  287.98      0
   268435456       8388608                                     797.4  336.64  294.56      0
```

NCCL:
```
NCCL version 2.21.5+cuda12.4
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        524288      half    none      -1    38.70  216.75  189.66      0    39.25  213.72  187.00    N/A
    16777216       1048576      half    none      -1    71.39  234.99  205.62      0    68.41  245.25  214.60    N/A
    33554432       2097152      half    none      -1    119.7  280.22  245.20      0    119.8  280.17  245.15    N/A
    67108864       4194304      half    none      -1    211.9  316.66  277.08      0    212.7  315.53  276.09    N/A
   134217728       8388608      half    none      -1    408.4  328.61  287.53      0    393.8  340.87  298.26    N/A
   268435456      16777216      half    none      -1    761.6  352.47  308.41      0    763.3  351.70  307.73    N/A
   536870912      33554432      half    none      -1   1502.5  357.31  312.64      0   1467.3  365.89  320.16    N/A
```
2025-08-14 11:30:40 -07:00
Binyang Li
4f6f23dae3 Use smart pointer for IB structure (#585)
Change to use smart pointer for IB structure. Registered memory will own
ibMr, ibCtx will not held the reference
- Use smart pointer for IbQp and IbMr
- Update memoryChannel API, keep localRegisteredMemory
- Close fd when registedMemory released

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-06 10:01:58 -07:00
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Changho Hwang
20eca28942 Fix a FIFO correctness bug (#549)
* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-07-11 23:53:59 +00:00
Changho Hwang
b4dde38db8 FIFO improvements (#557)
* Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default.
* Optimize `FifoDeviceHandle`.
* Do not use `cudaHostAllocWriteCombined` that increases latency.
* Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`.
* Fix proxy NUMA binding issues.
* Prevent graph capture inside proxy threads.
* Now `CudaIpcConnection` skips stream sync when unnecessary.
* Now any type of connection needs to hold a shared pointer to the
context for memory safety.
* Now a context should be always managed by a shared pointer for memory
safety.
* Minor docs & interface improvements.
* Minor fix in `mscclpp-test` correctness test.
2025-06-24 09:50:28 -07:00
Changho Hwang
83356957bd Improved documentation & minor interface revision (#541) 2025-06-03 14:26:27 -07:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Changho Hwang
2127a3ba29 Improve CMake options (#376)
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-22 01:54:11 +00:00
Changho Hwang
74130c7c5e Use IB transport flags only when an IB device exists (#355) 2024-09-19 07:13:11 +00:00
Binyang Li
bc465aefcd Add __launch_bounds__ for mscclpp-test (#273) 2024-03-25 15:55:37 -07:00
Binyang Li
4734d8718f Fix multi-node ci pipeline (#272)
Add `__launch_bounds__` to fix perf regression issue in CI pipeline
2024-03-12 09:39:00 -07:00
Changho Hwang
cdaf3aea3d New packet format & optimizations (#256)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-20 20:01:37 -08:00
Changho Hwang
70e28b3c76 Do not check value of __HIP_PLATFORM_AMD__ (#240)
According to the
[document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary),
`__HIP_PLATFORM_AMD__` is effective only by definition.
2023-12-25 13:51:18 +08:00
Changho Hwang
5ff8bc5ef2 Fix & improve perf for ROCm (#232)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-12-18 11:30:08 +08:00
Changho Hwang
544ff0c21d ROCm support (#213)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-11-24 16:41:56 +08:00
Changho Hwang
060fda12e6 mscclpp-test in Python (#204)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Esha Choukse <eschouks@microsoft.com>
2023-11-16 12:45:25 +08:00
Changho Hwang
3521fb0280 Clear minor warnings (#214)
Clear warnings from the clang compiler.
2023-11-14 09:28:48 +08:00
Binyang2014
6f43282c1d Fix allreduce bug (#197)
Fix allreduce correctness issue
2023-10-18 23:16:57 +08:00
Changho Hwang
8c0f9e84d0 v0.3.0 (#171) 2023-10-11 22:35:54 +08:00
Changho Hwang
6c0ee72916 Construct ProxyChannel with shared pointers (#184) 2023-09-18 05:46:23 +00:00
Changho Hwang
d2f13f1e54 Fix #174 (#180)
Added `extern "C"` based on another specification in
`/usr/local/cuda/include/crt/common_functions.h`.
2023-09-15 06:44:41 +00:00
Binyang2014
952f2da9cc Improve single node allreduce performance (#169)
Improve all reduce performance for single node.
New number:
|   n_ctx | size    |  target latency (us) | allreduce5 | allreduce6 |
|---------|---------|----------------|------------|------------|
|       1 | 24.0kB  |            7.7 |            |        7.23|
|       2 | 48.0kB  |            7.7 |            |        7.69|
|       4 | 96.0kB  |            8   |            |        8.34|
|       8 | 192.0kB |           12.6 |            |        9.75|
|      12 | 288.0kB |           13   |            |       11.34|
|      16 | 384.0kB |           13.3 |            |       12.99|
|     768 | 18.0MB  |          158.7 |       160.3|            |
|     896 | 21.0MB  |          184.5 |       183.8|            |
|    1024 | 24.0MB  |          209.5 |       207.5|            |
|    1152 | 27.0MB  |          234.3 |       231.9|            |
|    1280 | 30.0MB  |          260   |       255.6|            |
|    1408 | 33.0MB  |          284.9 |       278.7|            |
|    1536 | 36.0MB  |          310.3 |       302.0|            |
|    1664 | 39.0MB  |          336.2 |       325.3|            |
|    1792 | 42.0MB  |          361.4 |       348.8|            |
|    1920 | 45.0MB  |          384.6 |       372.2|            |
|    2048 | 48.0MB  |          409.1 |       395.4|            |

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-09-13 14:30:08 +00:00
Binyang2014
097aa8843a Fix pytest unstable issue. (#170)
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Olli Saarikivi
828be48b21 Add Context and Endpoint classes to enable non-Communicator use-cases (#166)
This PR implements and closes #137. The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.

The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.

Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.

A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.

Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.
2023-09-06 13:10:04 +08:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Binyang2014
a58e2e9623 Make sure the semaphore not be released during the lifecycle of SmChannel (#131)
Fix #126

 - Put `std::shared_ptr<SmDevice2DeviceSemaphore>` into the `SmChannel` 
 - add a `DeviceHandle` struct in `SmChannel`
 - add `DeviceHandle` template
 
Users need to write code like this to use channel in device side:
```
using DeviceHandle = mscclpp::DeviceHandle<T>;
__device__ DeviceHandle<mscclpp::SimpleProxyChannel> channel;
__device__ DeviceHandle<mscclpp::SmChannel> smChannel;
```

To cover a channel to deviceHandle, need to call this function:
`mscclpp::deviceHandle(SimpleProxyChannel or SmChannel)`

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-07-20 12:18:22 +08:00
Saeed Maleki
e7d5e652df Python bindings (#125)
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-19 15:35:54 +08:00
Changho Hwang
1d71715d19 Separate mscclpp-test kernels (#122)
Separate different kernel implementations in mscclpp-test to reduce the
number of registers required by the kernels.
2023-07-10 10:11:20 -07:00
Changho Hwang
4114d65c60 Documents & minor updates (#119)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-07 17:35:05 +08:00
Changho Hwang
bb7b85a810 2-node AllReduce improvements (#118)
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-07-07 07:05:46 +00:00
Changho Hwang
6ec585f3d8 Packet copy for IB (#109)
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00
Changho Hwang
21eed722af Add license comments (#106) 2023-06-25 12:40:12 +08:00
Binyang2014
2640578b22 Add performance check for mscclpp-test (#110)
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Changho Hwang
60b3dd5a61 Bug fixes & resolve warnings (#107)
* Fix a bug in host hashing
* Fix a bug in `HostEpoch::wait()`
* Remove misc warnings
2023-06-16 09:31:23 +00:00
Binyang2014
8410fcd8fc Fix allgather kernel 2 perf bug (#108)
Fix #105
2023-06-16 15:36:20 +08:00
Binyang2014
8efacae332 update pipeline (#103)
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
2023-06-14 20:14:57 +08:00
Changho Hwang
4d0b0a650f Remove vulnerable sscanf (#101) 2023-06-14 10:02:46 +08:00
Binyang2014
b1ce368656 Implement host offload algorithm for allgather (#84)
Implement host offload algorithm for allgather
For 1n-8p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            32    73.02    0.01    0.01      0

# Out of bounds values : 0 OK
#
```
For 2n-16p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16    90.30    0.01    0.01      0

# Out of bounds values : 0 OK
#
```
2023-06-13 10:01:58 +00:00
Binyang2014
6ee4e80317 Create Azure pipeline for multi-node tests (#97)
Create Azure pipeline to run mscclpp-test on multi-nodes
2023-06-13 06:34:07 +00:00
Changho Hwang
76718e4015 Saemal/atomic signal (#96)
* code complelete

* fix correctness issue

* Fix correctness issuee

* fix lint

* ass compile

* Fix build issue

* Fix runtime error

* Fix correctness issue

* Fix crash issue

* minor change

* Fix memory leak

* Fix review comments

* Finish allgather

* address comments

* load element to register first then store to remote address

* Finish allGather

* init

* Build connections

* allreduce_test works

* Bug fix

* Add CUDA flags

* Add packet copy (LL)

* Lint

* Set tmpPtr from constructors

* Lint

* Multiple blocks per peer

* Beautify

* Temporal ring reduce

* Ring reduce works correctly

* Overlapping

* Fix overlapping

* Improve vector sum

* figuring out how to use atomics

* working now

* wip

* Enhance LL AllReduce

* Support multiple blocks per peer

* Fix a ring reduce bug

* Fix a AllReduce kernel 2 bug

* Bug fix

* wip

* Make it compilable

* Lint

* Lint

* Minor changes

* Unit test to reproduce memory consistency bugs

* Unit test bug fixes

* Fixes

* Typo

* wip

* done with core

* wip

* wip

* compiles

* only the atomic is failing

* almost working

* all tests pass now

* clang-12

* More jailbreaks

* bug fix for common.cu

* adding stdint to concurrency.hpp

* Out-of-place for AllReduce kernel 2

* Optimize `sync()`

* Fix mp_unit_tests

* Init TestEngine with TestArgs

* Change common.cu into common.cc

* Cleanup common.hpp

* Lint

* fixes to the mscclpp-tests

* fixed common.cc

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-12 21:38:06 -07:00