mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 17:26:04 +00:00

Author	SHA1	Message	Date
Changho Hwang	d63f9403c0	IB `host-no-atomic`: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753 ) Major enhancements to the IB signal forwarding mechanisms (`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5 Direct Verbs, and refactoring the signal forwarding path for IB HostNoAtomic mode. The changes fix memory consistency issues and reduce signaling latency. - GDRCopy and MLX5 Direct Verbs MR integration - Signal forwarding path redesign - Semaphore and connection API updates - Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates	2026-04-09 09:24:30 +00:00
Binyang Li	4f3638b60d	Use PTX red for D2D semaphore signal (#768 ) ## Summary - Replace the two-step `signal()` implementation (`incOutbound()` + `atomicStore()`) with a single fire-and-forget PTX `red.release.sys.global.add.u64` instruction - This eliminates one local atomic fetch-add and replaces a remote store with a remote atomic add that has no return value — more efficient on both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to fire-and-forget `flat_atomic_add_x2`) - Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait ping-pong latency ### Performance (H100, 2 ranks, signal+wait round-trip) ``` SemaphorePerfTest.SignalPingPong: Store-based (old): 2.595 us/iter Red-based (new): 2.345 us/iter Speedup: 1.11x ``` ## Test plan - [x] Builds successfully (`make mp_unit_tests`) - [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter "SemaphorePerfTest"` — 1.11x speedup 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 15:34:43 -07:00
Copilot	93f6eeaa6b	Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 ) - Removes the GTest dependency, replacing it with a minimal custom framework (`test/framework.`) that covers only what the tests actually use — a unified `TEST()` macro with SFINAE-based fixture auto-detection, `EXPECT_`/`ASSERT_*` assertions, environments, and setup/teardown. - `--exclude-perf-tests` flag and substring-based negative filtering - `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to Codecov - Merges standalone `test/perf/` into main test targets - Refactors Azure pipelines to reduce redundancies & make more readable --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-03-24 23:34:38 -04:00
Binyang Li	bf946ea51e	Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms (#759 ) ## Summary This PR addresses a multicast resource leak, fixes `cuMemMap` offset handling for multicast handles, renames NVLS allreduce algorithm classes for clarity, and adds a new unit test for `SwitchChannel`. ### Bug Fixes #### 1. Fix multicast allocation handle leak in `createMulticast()` (`gpu_ipc_mem.cc`) `GpuIpcMemHandle::createMulticast()` called `cuMulticastCreate(&allocHandle, ...)` but never released the local `allocHandle` after exporting it to shareable handles (POSIX FD / Fabric). This caused a reference count leak — the multicast object was never freed even after all mappings and imported handles were released. Per the [CUDA Driver API docs for `cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html): > "The memory allocation will be freed when all outstanding mappings to the memory are unmapped and when all outstanding references to the handle (including its shareable counterparts) are also released." The fix adds `cuMemRelease(allocHandle)` after export, matching the existing pattern used for regular allocations in `GpuIpcMemHandle::create()`. Impact: Without this fix, repeated creation/destruction of NVLS connections causes OOM after ~120 iterations when allocating 1GB multicast buffers on H100. #### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`) `cuMemMap` requires `offset=0` for multicast handles. Previously, the code attempted to map at a non-zero offset within the multicast object, leading to errors when binding multiple buffers to the same `NvlsConnection`. The fix maps the entire range `[0, mcOffset + bufferSize)` and returns the pointer offset by `mcOffset`. This only consumes extra virtual address space; no additional physical memory is used. ### Refactoring #### 3. Rename NVLS allreduce algorithm classes Renamed for clarity: - `AllreduceNvls` → `AllreduceNvlsZeroCopy` - `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline` - `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline` Updated all references in builder, selector, docs, and examples. #### 4. Move `nvlsConnections` setup to `initialize()` Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this member) to individual algorithm class members, initialized in their `initialize()` methods. ### Tests #### 5. Add `TwoChannelsSameConnection` test New unit test that creates two `SwitchChannel` instances from the same `NvlsConnection`, performs reduce operations on both, and verifies correctness. This exercises the multi-bind path that triggered the `cuMemMap` offset fix. ### Files Changed - `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap offset fix - `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` — nvlsConnections fix - `src/ext/collectives/include/allreduce/*.hpp` — renamed headers - `src/ext/collectives/algorithm_collection_builder.cc` — updated references - `src/ext/nccl/algorithm_selector.cc` — updated algorithm names - `test/mp_unit/switch_channel_tests.cu` — new test - `docs/guide/mscclpp-torch-integration.md` — updated names - `examples/torch-integration/customized_comm_with_default_algo.py` — updated names	2026-03-09 10:22:45 -07:00
Changho Hwang	42be3660e0	Add a new IB stack impl that doesn't use RDMA atomics (#728 ) * Added configurable InfiniBand (IB) signaling mode. `EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`, `HostNoAtomic`). `Default` is equivalent to `Host` unless specified different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the previous implementation using RDMA atomics for signaling, while `HostNoAtomic` uses write-with-immediate instead. * Regarding updates in Python bindings and API.	2026-02-10 01:07:53 +00:00
Changho Hwang	2cf14ff723	Minor fixes (#715 )	2026-01-05 11:09:48 +08:00
Binyang Li	eda74a7f29	Add handle cache for AMD platform (#698 ) Introduce handle cache for AMD platform. Avoid reaching handle limitation if we open too much IPC handles For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-21 18:39:12 -08:00
Changho Hwang	9e076da3d4	Make IB more configurable (#703 ) * Added `port` and `gidIndex` field in the IB endpoint config (and `deviceIndex` field for future usages) * Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so * Added `--ib_gid_index` CLI option to `mp_unit_tests` * Other minor fixes	2025-12-18 13:21:07 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Changho Hwang	ffafcaf6d6	IB stack enhancements & bug fixes (#673 ) * Always use `ibv_reg_dmabuf_mr` when DMABUF is supported * Do not check `nvidia-peermem` when unnecessary * More rigorous check on IB port availability * Fixed ibverbs wrappers * Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test	2025-11-07 14:26:17 -08:00
Changho Hwang	9994f53cea	Fixes for no-IB systems (#667 ) * Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB on/off * Fix `nvidia-peermem` check; no need for DMABUF supported systems * Fix `mp_unit_tests` to skip all IB tests when built with `-DMSCCLPP_USE_IB=OFF`	2025-10-29 10:03:02 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Binyang Li	bf8d424ae3	use unix socket to share fd (#634 ) Use unix socket to share fd to other processes. Used for nvls handle sharing Update nccl interface to support worldSize=1 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-25 11:40:54 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Binyang Li	01e72f3aca	Fix multinode test failure (#574 ) Add CPU based connection to fix multi-node test failure issue --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-07-23 10:33:44 -07:00
Binyang Li	5e991cf5c8	update readme & bump version (#550 ) Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-07-12 01:00:18 -07:00
Changho Hwang	199468bc47	Revise NVLS interface (#458 ) * Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel` * Minor interface improvements	2025-07-12 00:33:03 +00:00
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Changho Hwang	83356957bd	Improved documentation & minor interface revision (#541 )	2025-06-03 14:26:27 -07:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Changho Hwang	1e82dd444f	Make ibverbs optional at compile time (#340 ) Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-08-21 12:47:05 -07:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Binyang Li	6226556ce2	Optimized the execution kernel (#294 )	2024-05-03 11:54:50 -07:00
Changho Hwang	d4ede480f4	Ethernet support (#284 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-04-25 11:06:43 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Changho Hwang	cdaf3aea3d	New packet format & optimizations (#256 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-20 20:01:37 -08:00
Changho Hwang	4eb0a08b8c	Add `putWithSignal()` latency tests (#246 )	2024-01-24 01:10:35 +00:00
Changho Hwang	544ff0c21d	ROCm support (#213 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-11-24 16:41:56 +08:00
Changho Hwang	11ac824cc7	Align interfaces of put/get/putPackets/getPackets (#185 )	2023-10-07 22:18:26 +08:00
Changho Hwang	b3d0fdb8df	Add an atomic signal perf test (#183 )	2023-09-18 08:12:14 +00:00
Changho Hwang	a6b24dcbed	Fix #163 (#182 ) The bug was caused as frequent calls of initialize() temporarily exhaust all available ephemeral ports. Fixed by retrying `bind()` after a while upon `EADDRINUSE`.	2023-09-15 08:35:01 +00:00
Changho Hwang	3aa72098d9	Add `poll()` for semaphores (#181 )	2023-09-15 07:40:44 +00:00
Saeed Maleki	015e29c138	adding signal for atomic op (#178 ) This address [this](https://github.com/microsoft/mscclpp/issues/177).	2023-09-11 10:46:25 -07:00
Binyang2014	097aa8843a	Fix pytest unstable issue. (#170 ) - remove `#include <cstdint>` from `poll.hpp`. To make it only contains device-side code - Fix compilation issue, which will cause pytest fail randomly. Reuse the compiled result for same kernel with different arguments	2023-09-06 17:09:04 -07:00
Olli Saarikivi	828be48b21	Add Context and Endpoint classes to enable non-Communicator use-cases (#166 ) This PR implements and closes #137. The new `Endpoint` and `Context` classes expose the connection establishing functionality from `Communicator`, which now is only responsible for tying together the bootstrapper with a context. The largest breaking change here is that `Communicator.connectOnSetup(...)` now returns the `Connection` wrapped inside a `NonblockingFuture`. This is because with the way `Context` is implemented a `Connection` is now fully initialized on construction. Some smaller breaking API changes from this change are that `RegisteredMemory` no longer has a `rank()` function (as there maybe no concept of rank), and similarly `Connection` has no `remoteRank()` and `tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf` functions in `Communicator`. A new `EndpointConfig` class is introduced to avoid duplication of the IB configuration parameters in the APIs of `Context` and `Communicator`. The usual usage pattern of just passing in a `Transport` still works due to an implicit conversion into `EndpointConfig`. Miscellaneous changes: -Cleans up how the PIMPL pattern is applied by making both the `Impl` struct and the `pimpl_` pointers private for all relevant classes in the core API. -Enables ctest to be run from the build root directory.	2023-09-06 13:10:04 +08:00
Saeed Maleki	8d1b984bed	Change device handle interfaces & others (#142 ) * Changed device handle interfaces * Changed proxy service interfaces * Move device code into separate files * Fixed FIFO polling issues * Add configuration arguments in several interface functions --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>	2023-08-16 20:00:56 +08:00
Binyang2014	a58e2e9623	Make sure the semaphore not be released during the lifecycle of SmChannel (#131 ) Fix #126 - Put `std::shared_ptr<SmDevice2DeviceSemaphore>` into the `SmChannel` - add a `DeviceHandle` struct in `SmChannel` - add `DeviceHandle` template Users need to write code like this to use channel in device side: ``` using DeviceHandle = mscclpp::DeviceHandle<T>; __device__ DeviceHandle<mscclpp::SimpleProxyChannel> channel; __device__ DeviceHandle<mscclpp::SmChannel> smChannel; ``` To cover a channel to deviceHandle, need to call this function: `mscclpp::deviceHandle(SimpleProxyChannel or SmChannel)` --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-07-20 12:18:22 +08:00
Saeed Maleki	e7d5e652df	Python bindings (#125 ) Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-07-19 15:35:54 +08:00
Changho Hwang	4114d65c60	Documents & minor updates (#119 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-07-07 17:35:05 +08:00
Changho Hwang	bb7b85a810	2-node AllReduce improvements (#118 ) * Added `get()` interfaces to `SmChannel` * Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel 3) and 99GB/s for 48MB (kernel 4) * Fixed a FIFO perf bug * Several fixes & validations in mscclpp-test --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-07-07 07:05:46 +00:00
Changho Hwang	6ec585f3d8	Packet copy for IB (#109 ) * Extend channels to support LL with IB * Rename classes and interfaces	2023-06-28 10:39:31 -07:00

1 2

54 Commits