Commit Graph

49 Commits

Author SHA1 Message Date
Changho Hwang
2cf14ff723 Minor fixes (#715) 2026-01-05 11:09:48 +08:00
Binyang Li
eda74a7f29 Add handle cache for AMD platform (#698)
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the
handle reference internally and reuse the same handle if already be
opened

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-12-21 18:39:12 -08:00
Changho Hwang
9e076da3d4 Make IB more configurable (#703)
* Added `port` and `gidIndex` field in the IB endpoint config (and
`deviceIndex` field for future usages)
* Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so
* Added `--ib_gid_index` CLI option to `mp_unit_tests`
* Other minor fixes
2025-12-18 13:21:07 -08:00
Changho Hwang
1bf4e8c90e connect() APIs changed to return an instance instead of a shared_ptr (#680)
The key purpose is handling all mscclpp objects' memory internally by
hiding shared pointers from user APIs.
* `Connection` class is now a wrapper of `BaseConnection` class that is
equivalent to the previous `Connection` class
* `connect()` methods now return `Connection` instead of
`std::shared_ptr<Connection>`
* Removed `connectOnSetup()` method
2025-11-15 11:40:40 -08:00
Changho Hwang
ffafcaf6d6 IB stack enhancements & bug fixes (#673)
* Always use `ibv_reg_dmabuf_mr` when DMABUF is supported
* Do not check `nvidia-peermem` when unnecessary
* More rigorous check on IB port availability
* Fixed ibverbs wrappers
* Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test
2025-11-07 14:26:17 -08:00
Changho Hwang
9994f53cea Fixes for no-IB systems (#667)
* Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB
on/off
* Fix `nvidia-peermem` check; no need for DMABUF supported systems
* Fix `mp_unit_tests` to skip all IB tests when built with
`-DMSCCLPP_USE_IB=OFF`
2025-10-29 10:03:02 -07:00
Changho Hwang
68b1f151f0 Rename nvls* files (#660)
Rename nvls* files to switch_channel*
2025-10-24 11:34:26 -07:00
Binyang Li
bf8d424ae3 use unix socket to share fd (#634)
Use unix socket to share fd to other processes. Used for nvls handle sharing
Update nccl interface to support worldSize=1

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-25 11:40:54 -07:00
Binyang Li
be6a941fba New DSL implementation (#579)
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline 
- Clean code and enhance document
C++ side: 
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-09 00:36:20 -07:00
Binyang Li
4f6f23dae3 Use smart pointer for IB structure (#585)
Change to use smart pointer for IB structure. Registered memory will own
ibMr, ibCtx will not held the reference
- Use smart pointer for IbQp and IbMr
- Update memoryChannel API, keep localRegisteredMemory
- Close fd when registedMemory released

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-06 10:01:58 -07:00
Binyang Li
01e72f3aca Fix multinode test failure (#574)
Add CPU based connection to fix multi-node test failure issue

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-23 10:33:44 -07:00
Binyang Li
5e991cf5c8 update readme & bump version (#550)
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-07-12 01:00:18 -07:00
Changho Hwang
199468bc47 Revise NVLS interface (#458)
* Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel`
* Minor interface improvements
2025-07-12 00:33:03 +00:00
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Changho Hwang
83356957bd Improved documentation & minor interface revision (#541) 2025-06-03 14:26:27 -07:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Changho Hwang
710f6686dc Revised MemoryChannel interfaces (#508)
* Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as
a standalone function.
* Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to
`mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively
for consistency.
* Renamed `MemoryChannel::getPackets()` to
`MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer`
to `packetBuffer`.
* Added the `MemoryChannel::unpackPacket()` method that unpacks one
packet in the buffer.
* Added the `BaseMemoryChannel` class that only contains a semaphore
without memory addresses.
* Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()`
method that is lacking use cases.
2025-04-25 00:02:56 +00:00
Changho Hwang
3565bfdf6d Renaming channels (#436)
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Binyang Li
88d28e07a7 Select algo according to json config (#396)
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Binyang Li
b30bb260e3 Tune threads per block for mscclpp executor (#345) 2024-09-18 17:21:47 -07:00
Changho Hwang
1e82dd444f Make ibverbs optional at compile time (#340)
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-08-21 12:47:05 -07:00
Ziyue Yang
76328fe623 Add NPKit GPU event support (#310) 2024-06-13 13:59:50 +08:00
Binyang Li
6226556ce2 Optimized the execution kernel (#294) 2024-05-03 11:54:50 -07:00
Changho Hwang
d4ede480f4 Ethernet support (#284)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-04-25 11:06:43 -07:00
Binyang Li
64d837f9ab Add executor to execute schedule-plan file (#283)
Add executor to execute the JSON schedule file generated by msccl-tools

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-04-18 19:10:41 +00:00
Changho Hwang
5ba6ce00c7 Fix bootstrapping mechanism (#278)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>
2024-03-27 10:24:24 +08:00
Changho Hwang
cdaf3aea3d New packet format & optimizations (#256)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-20 20:01:37 -08:00
Changho Hwang
4eb0a08b8c Add putWithSignal() latency tests (#246) 2024-01-24 01:10:35 +00:00
Changho Hwang
544ff0c21d ROCm support (#213)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-11-24 16:41:56 +08:00
Changho Hwang
11ac824cc7 Align interfaces of put/get/putPackets/getPackets (#185) 2023-10-07 22:18:26 +08:00
Changho Hwang
b3d0fdb8df Add an atomic signal perf test (#183) 2023-09-18 08:12:14 +00:00
Changho Hwang
a6b24dcbed Fix #163 (#182)
The bug was caused as frequent calls of initialize() temporarily exhaust
all available ephemeral ports. Fixed by retrying `bind()` after a while
upon `EADDRINUSE`.
2023-09-15 08:35:01 +00:00
Changho Hwang
3aa72098d9 Add poll() for semaphores (#181) 2023-09-15 07:40:44 +00:00
Saeed Maleki
015e29c138 adding signal for atomic op (#178)
This address [this](https://github.com/microsoft/mscclpp/issues/177).
2023-09-11 10:46:25 -07:00
Binyang2014
097aa8843a Fix pytest unstable issue. (#170)
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Olli Saarikivi
828be48b21 Add Context and Endpoint classes to enable non-Communicator use-cases (#166)
This PR implements and closes #137. The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.

The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.

Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.

A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.

Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.
2023-09-06 13:10:04 +08:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Binyang2014
a58e2e9623 Make sure the semaphore not be released during the lifecycle of SmChannel (#131)
Fix #126

 - Put `std::shared_ptr<SmDevice2DeviceSemaphore>` into the `SmChannel` 
 - add a `DeviceHandle` struct in `SmChannel`
 - add `DeviceHandle` template
 
Users need to write code like this to use channel in device side:
```
using DeviceHandle = mscclpp::DeviceHandle<T>;
__device__ DeviceHandle<mscclpp::SimpleProxyChannel> channel;
__device__ DeviceHandle<mscclpp::SmChannel> smChannel;
```

To cover a channel to deviceHandle, need to call this function:
`mscclpp::deviceHandle(SimpleProxyChannel or SmChannel)`

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-07-20 12:18:22 +08:00
Saeed Maleki
e7d5e652df Python bindings (#125)
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-19 15:35:54 +08:00
Changho Hwang
4114d65c60 Documents & minor updates (#119)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-07 17:35:05 +08:00
Changho Hwang
bb7b85a810 2-node AllReduce improvements (#118)
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-07-07 07:05:46 +00:00
Changho Hwang
6ec585f3d8 Packet copy for IB (#109)
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00
Changho Hwang
21eed722af Add license comments (#106) 2023-06-25 12:40:12 +08:00
Saeed Maleki
cd7797fd5e FIFO optimization (#112)
This saves 2us on IB latency
2023-06-19 05:36:56 +00:00
Changho Hwang
60b3dd5a61 Bug fixes & resolve warnings (#107)
* Fix a bug in host hashing
* Fix a bug in `HostEpoch::wait()`
* Remove misc warnings
2023-06-16 09:31:23 +00:00
Changho Hwang
6cd8960394 DirectChannel Unit Tests (#102)
* Add DirectChannel unit tests
* Split mp_unit_tests.cu into multiple files
2023-06-15 20:55:57 +08:00