Commit Graph

64 Commits

Author SHA1 Message Date
Qinghua Zhou
f803eff8b9 Use multiple thread blocks; Add peer-parallel kernels 2026-02-24 04:05:01 +00:00
Qinghua Zhou
21e3f1ebb3 Get correct remote receive displacements for peers 2026-02-23 14:22:30 +00:00
Qinghua Zhou
97426b3483 Use same chaneel for both signal and wait for ring kernel; Add pipelined kernel for imbalannced worloads 2026-02-18 12:13:30 +00:00
Qinghua Zhou
43980da455 Use maximum threads (1024) for best bandwidth utilization 2026-02-18 03:00:29 +00:00
Qinghua Zhou
b7485762a5 Improve with memory channels for intra-node communication 2026-02-17 13:55:15 +00:00
Qinghua Zhou
c42579e900 Move the alltoallv kernel to the src directory; Utilize the kernel in mscclpp-test 2026-02-06 02:57:34 +00:00
Qinghua Zhou
ac3e770c42 Add alltoallv kernel and test 2026-02-05 07:41:35 +00:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Caio Rocha
eb202780f5 Support Synchronous Initialization for Proxy Service (#679) 2025-11-12 18:35:57 -08:00
Changho Hwang
547a9ae65c Fixed cpp linter (#619) 2025-08-25 12:15:45 -07:00
Binyang Li
03c0ff2a91 Fix for multi-nodes test (#614)
Fix multi-node test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-14 20:44:43 -07:00
Binyang Li
bb76d27553 all2all implementation (#609)
Implement single node all2all via MSCCL++ C++API
perf kernel 3:
```
       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.41   44.78   39.19      0
     2097152         65536                                     23.95   87.56   76.61      0
     4194304        131072                                     27.50  152.51  133.45      0
     8388608        262144                                     35.14  238.73  208.89      0
    16777216        524288                                     57.54  291.55  255.11      0
    33554432       1048576                                     109.7  305.81  267.59      0
    67108864       2097152                                     212.3  316.07  276.56      0
   134217728       4194304                                     410.9  326.64  285.81      0
   268435456       8388608                                     784.9  341.99  299.24      0
```

kernel 2
```

#                                        in-place                       out-of-place
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.42   44.77   39.17      0
     2097152         65536                                     24.96   84.02   73.52      0
     4194304        131072                                     28.53  147.03  128.65      0
     8388608        262144                                     36.75  228.28  199.75      0
    16777216        524288                                     58.01  289.20  253.05      0
    33554432       1048576                                     110.4  303.83  265.85      0
    67108864       2097152                                     212.4  315.99  276.49      0
   134217728       4194304                                     407.8  329.12  287.98      0
   268435456       8388608                                     797.4  336.64  294.56      0
```

NCCL:
```
NCCL version 2.21.5+cuda12.4
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        524288      half    none      -1    38.70  216.75  189.66      0    39.25  213.72  187.00    N/A
    16777216       1048576      half    none      -1    71.39  234.99  205.62      0    68.41  245.25  214.60    N/A
    33554432       2097152      half    none      -1    119.7  280.22  245.20      0    119.8  280.17  245.15    N/A
    67108864       4194304      half    none      -1    211.9  316.66  277.08      0    212.7  315.53  276.09    N/A
   134217728       8388608      half    none      -1    408.4  328.61  287.53      0    393.8  340.87  298.26    N/A
   268435456      16777216      half    none      -1    761.6  352.47  308.41      0    763.3  351.70  307.73    N/A
   536870912      33554432      half    none      -1   1502.5  357.31  312.64      0   1467.3  365.89  320.16    N/A
```
2025-08-14 11:30:40 -07:00
Binyang Li
4f6f23dae3 Use smart pointer for IB structure (#585)
Change to use smart pointer for IB structure. Registered memory will own
ibMr, ibCtx will not held the reference
- Use smart pointer for IbQp and IbMr
- Update memoryChannel API, keep localRegisteredMemory
- Close fd when registedMemory released

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-06 10:01:58 -07:00
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Changho Hwang
20eca28942 Fix a FIFO correctness bug (#549)
* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-07-11 23:53:59 +00:00
Changho Hwang
b4dde38db8 FIFO improvements (#557)
* Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default.
* Optimize `FifoDeviceHandle`.
* Do not use `cudaHostAllocWriteCombined` that increases latency.
* Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`.
* Fix proxy NUMA binding issues.
* Prevent graph capture inside proxy threads.
* Now `CudaIpcConnection` skips stream sync when unnecessary.
* Now any type of connection needs to hold a shared pointer to the
context for memory safety.
* Now a context should be always managed by a shared pointer for memory
safety.
* Minor docs & interface improvements.
* Minor fix in `mscclpp-test` correctness test.
2025-06-24 09:50:28 -07:00
Changho Hwang
83356957bd Improved documentation & minor interface revision (#541) 2025-06-03 14:26:27 -07:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Changho Hwang
2127a3ba29 Improve CMake options (#376)
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-22 01:54:11 +00:00
Changho Hwang
74130c7c5e Use IB transport flags only when an IB device exists (#355) 2024-09-19 07:13:11 +00:00
Binyang Li
bc465aefcd Add __launch_bounds__ for mscclpp-test (#273) 2024-03-25 15:55:37 -07:00
Binyang Li
4734d8718f Fix multi-node ci pipeline (#272)
Add `__launch_bounds__` to fix perf regression issue in CI pipeline
2024-03-12 09:39:00 -07:00
Changho Hwang
cdaf3aea3d New packet format & optimizations (#256)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-20 20:01:37 -08:00
Changho Hwang
70e28b3c76 Do not check value of __HIP_PLATFORM_AMD__ (#240)
According to the
[document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary),
`__HIP_PLATFORM_AMD__` is effective only by definition.
2023-12-25 13:51:18 +08:00
Changho Hwang
5ff8bc5ef2 Fix & improve perf for ROCm (#232)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-12-18 11:30:08 +08:00
Changho Hwang
544ff0c21d ROCm support (#213)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-11-24 16:41:56 +08:00
Changho Hwang
060fda12e6 mscclpp-test in Python (#204)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Esha Choukse <eschouks@microsoft.com>
2023-11-16 12:45:25 +08:00
Changho Hwang
3521fb0280 Clear minor warnings (#214)
Clear warnings from the clang compiler.
2023-11-14 09:28:48 +08:00
Binyang2014
6f43282c1d Fix allreduce bug (#197)
Fix allreduce correctness issue
2023-10-18 23:16:57 +08:00
Changho Hwang
8c0f9e84d0 v0.3.0 (#171) 2023-10-11 22:35:54 +08:00
Changho Hwang
6c0ee72916 Construct ProxyChannel with shared pointers (#184) 2023-09-18 05:46:23 +00:00
Changho Hwang
d2f13f1e54 Fix #174 (#180)
Added `extern "C"` based on another specification in
`/usr/local/cuda/include/crt/common_functions.h`.
2023-09-15 06:44:41 +00:00
Binyang2014
952f2da9cc Improve single node allreduce performance (#169)
Improve all reduce performance for single node.
New number:
|   n_ctx | size    |  target latency (us) | allreduce5 | allreduce6 |
|---------|---------|----------------|------------|------------|
|       1 | 24.0kB  |            7.7 |            |        7.23|
|       2 | 48.0kB  |            7.7 |            |        7.69|
|       4 | 96.0kB  |            8   |            |        8.34|
|       8 | 192.0kB |           12.6 |            |        9.75|
|      12 | 288.0kB |           13   |            |       11.34|
|      16 | 384.0kB |           13.3 |            |       12.99|
|     768 | 18.0MB  |          158.7 |       160.3|            |
|     896 | 21.0MB  |          184.5 |       183.8|            |
|    1024 | 24.0MB  |          209.5 |       207.5|            |
|    1152 | 27.0MB  |          234.3 |       231.9|            |
|    1280 | 30.0MB  |          260   |       255.6|            |
|    1408 | 33.0MB  |          284.9 |       278.7|            |
|    1536 | 36.0MB  |          310.3 |       302.0|            |
|    1664 | 39.0MB  |          336.2 |       325.3|            |
|    1792 | 42.0MB  |          361.4 |       348.8|            |
|    1920 | 45.0MB  |          384.6 |       372.2|            |
|    2048 | 48.0MB  |          409.1 |       395.4|            |

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-09-13 14:30:08 +00:00
Binyang2014
097aa8843a Fix pytest unstable issue. (#170)
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Olli Saarikivi
828be48b21 Add Context and Endpoint classes to enable non-Communicator use-cases (#166)
This PR implements and closes #137. The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.

The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.

Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.

A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.

Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.
2023-09-06 13:10:04 +08:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Binyang2014
a58e2e9623 Make sure the semaphore not be released during the lifecycle of SmChannel (#131)
Fix #126

 - Put `std::shared_ptr<SmDevice2DeviceSemaphore>` into the `SmChannel` 
 - add a `DeviceHandle` struct in `SmChannel`
 - add `DeviceHandle` template
 
Users need to write code like this to use channel in device side:
```
using DeviceHandle = mscclpp::DeviceHandle<T>;
__device__ DeviceHandle<mscclpp::SimpleProxyChannel> channel;
__device__ DeviceHandle<mscclpp::SmChannel> smChannel;
```

To cover a channel to deviceHandle, need to call this function:
`mscclpp::deviceHandle(SimpleProxyChannel or SmChannel)`

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-07-20 12:18:22 +08:00
Saeed Maleki
e7d5e652df Python bindings (#125)
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-19 15:35:54 +08:00
Changho Hwang
1d71715d19 Separate mscclpp-test kernels (#122)
Separate different kernel implementations in mscclpp-test to reduce the
number of registers required by the kernels.
2023-07-10 10:11:20 -07:00
Changho Hwang
4114d65c60 Documents & minor updates (#119)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-07 17:35:05 +08:00
Changho Hwang
bb7b85a810 2-node AllReduce improvements (#118)
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-07-07 07:05:46 +00:00
Changho Hwang
6ec585f3d8 Packet copy for IB (#109)
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00
Changho Hwang
21eed722af Add license comments (#106) 2023-06-25 12:40:12 +08:00
Binyang2014
2640578b22 Add performance check for mscclpp-test (#110)
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Changho Hwang
60b3dd5a61 Bug fixes & resolve warnings (#107)
* Fix a bug in host hashing
* Fix a bug in `HostEpoch::wait()`
* Remove misc warnings
2023-06-16 09:31:23 +00:00