mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-24 23:06:17 +00:00

Author	SHA1	Message	Date
Changho Hwang	a02ba3b1bd	Add `GpuIpcMemHandle` (#704 ) Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-14 10:49:31 +08:00
Changho Hwang	fc221e234d	Remove UB `std::` declarations (#709 ) Remove custom delcarations inside `std::` of which behaviors are undefined by the standard	2026-01-05 11:11:46 +08:00
Changho Hwang	2cf14ff723	Minor fixes (#715 )	2026-01-05 11:09:48 +08:00
Changho Hwang	bb555277ad	Rename `P2P` log subsys into `GPU` (#716 )	2026-01-05 11:08:43 +08:00
Binyang Li	ca6a4a3274	Replace `__HIP_PLATFORM_AMD__` to use internal macro (#712 ) Replacing most of checks for `__HIP_PLATFORM_AMD__` with `MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source file.	2026-01-04 04:47:58 -08:00
qishilu	b2d96e8ba5	Use uncached memory on Rocm platform to avoid hang (#711 ) MSCCLPP_DEVICE_HIP is undefined because it is defined in device.hpp. Use __HIP_PLATFORM_AMD__ here.	2025-12-24 10:49:36 +08:00
Binyang Li	eda74a7f29	Add handle cache for AMD platform (#698 ) Introduce handle cache for AMD platform. Avoid reaching handle limitation if we open too much IPC handles For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-21 18:39:12 -08:00
Changho Hwang	9e076da3d4	Make IB more configurable (#703 ) * Added `port` and `gidIndex` field in the IB endpoint config (and `deviceIndex` field for future usages) * Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so * Added `--ib_gid_index` CLI option to `mp_unit_tests` * Other minor fixes	2025-12-18 13:21:07 -08:00
Changho Hwang	51a86630ff	Build fixes (#696 ) * Fix CMake build for CUDA 13 * Add a missing header file	2025-11-26 20:02:01 -08:00
Changho Hwang	8b75634d31	Optimized logger (#693 ) * Leverage constant folding * Use `shouldLog()` function for early exit * Per-thread timestamp caching to remove mutex	2025-11-25 08:58:17 -08:00
Changho Hwang	ddf84a6b9d	Add `CudaDeviceGuard` (#691 ) Add an RAII guard that sets a proper GPU device before a CUDA API call. We may change this stateful in the future to minimize `cudaGetDevice()` calls. This PR fixes a bug of the tutorial 01.	2025-11-24 13:38:44 -08:00
Changho Hwang	8b8593ba51	Fix Python bindings and tests (#690 ) Minimal fix to make things work. We need a more careful look at preventing silent fallback of nanobind when it fails to (properly) construct a C++ STL object with mscclpp instances.	2025-11-21 12:53:12 -08:00
Caio Rocha	bbdeafb3ca	Fix Error in Non IB Env at Executor (#686 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-17 16:35:57 -08:00
Qinghua Zhou	b9428341a2	Revise the mscclpp datatype (#671 ) Use mscclpp::DataType to replace the following types in API interface: int dtype; ncclDataType_t dtype; Add data type conversion: Convert ncclDataType_t to mscclpp::DataType	2025-11-17 12:58:47 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Caio Rocha	7eb3ff701a	Supporting New Packet Kernel Operation at Executor (#677 ) This PR introduces three new operations to enhance flexibility and performance at executor. One operation can be invoked directly via the DSL API and two operations are created through fusion of existing operations, reducing overhead and improving efficiency. 1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt format to the remote side in pkt format via the port channel. Both source and destination buffers must be scratch. 2. Reduce Copy Packet (Fusion): Reduce Packet+Copy Packet=Reduce Copy Packet Triggered when the destination buffer of Reduce Packet matches the source buffer of Copy Packet. Purpose: Combine reduction and copy into a single step for better performance. 3. Reduce Copy Send Packet (Fusion): Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer of Reduce Copy Packet matches src buffer of Put Packet) Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt buffer of Reduce Copy Packet matches src buffer of Read Put Packet) Purpose: Combine reduction, copy, and send operations into one optimized pipeline. Fusion Diagram Reduce Packet + Copy Packet → Reduce Copy Packet Reduce Copy Packet + Put Packet → Reduce Copy Send Packet Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet Beyond this, this PR adjust the AllReduce 2 Node algorithm: Message Size \| Latency (µs) 1K \| 15.34 2K \| 15.88 4K \| 15.71 8K \| 16.01 16K \| 15.88 32K \| 16.21 64K \| 16.90 128K \| 18.24 256K \| 20.39 512K \| 25.26 1M \| 32.74 2M \| 53.64	2025-11-13 14:08:44 -08:00
Caio Rocha	eb202780f5	Support Synchronous Initialization for Proxy Service (#679 )	2025-11-12 18:35:57 -08:00
Changho Hwang	ffafcaf6d6	IB stack enhancements & bug fixes (#673 ) * Always use `ibv_reg_dmabuf_mr` when DMABUF is supported * Do not check `nvidia-peermem` when unnecessary * More rigorous check on IB port availability * Fixed ibverbs wrappers * Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test	2025-11-07 14:26:17 -08:00
Changho Hwang	960a8ddebd	Add a new logger (#668 ) * Add `logger.hpp` that will gradually replace `debug.h` * Minor fixes in `ib.cc` --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2025-11-04 10:32:46 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Changho Hwang	9994f53cea	Fixes for no-IB systems (#667 ) * Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB on/off * Fix `nvidia-peermem` check; no need for DMABUF supported systems * Fix `mp_unit_tests` to skip all IB tests when built with `-DMSCCLPP_USE_IB=OFF`	2025-10-29 10:03:02 -07:00
Caio Rocha	2b987cf8e8	Resolve IBVerbs Loading Issues (#648 ) Some systems do not include libibverbs.so when installing ibverbs; instead, they only provide libibverbs.so1. This PR updates the CMake file to search for this library and modifies the wrapper to load it. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-28 18:14:53 -07:00
Qinghua Zhou	a38c2ee784	FP8 support for Allreduce (#646 ) Add FP8 support for Allreduce on both NVIDIA and AMD platform. Add new data type: fp8_e4m3 and fp8_e5m2 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-10-27 14:51:48 -07:00
Binyang Li	2b05908635	Add token pool for cuCreate API (#628 ) Create a tokenPool to allocate token. This feature is used to support inter node NVL and try to reduce the footprint caused by cuCreate --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-27 11:19:21 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Changho Hwang	200cdf946e	Update `EndpointConfig` interfaces (#651 ) * Separate IB-specific options into a nested struct * Enable `connect()` by an `Endpoint`, not only by `EndpointConfig` * Other minor changes	2025-10-22 10:39:39 -07:00
Changho Hwang	2f7d74b281	Fix lint.sh (#652 ) Exit 1 upon any errors from clang-format or black	2025-10-20 17:23:01 -07:00
Binyang Li	70b8297c56	Revise NCCL API implementation (#617 ) - Make nccl interface extensible. Customer can register their own algo to NCCL API. User can provide customized algo selection function. - Fallback to NCCL/RCCL if no algo is selected based on algo selection function - MSCCLPP interfaces now works for any scale	2025-09-26 10:08:12 -07:00
Binyang Li	5ac427610d	Address teardown issue (#638 ) Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point	2025-09-25 12:12:40 -07:00
Binyang Li	bf8d424ae3	use unix socket to share fd (#634 ) Use unix socket to share fd to other processes. Used for nvls handle sharing Update nccl interface to support worldSize=1 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-25 11:40:54 -07:00
Changho Hwang	43f160c8e6	Fix for safe process teardown (#633 ) * `gpuFree()` functions are usually called during process teardown, so we let them ignore regarding errors. `AvoidCudaGraphCaptureGuard` is constructed in `gpuFree*()` functions, so it needs the same fix.	2025-09-10 20:28:04 -07:00
Binyang Li	ba4c4aaeb8	Integrate MSCCL++ with torch workload (#626 ) Integrate MSCCL++ with torch Introduce `NCCL audit shim library`, use can use following commands to launch torch library. Also avoid break build pipeline in the CPU machine ```bash export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH torchrun --nnodes=1 --nproc_per_node=8 your_script.py ```	2025-09-09 13:28:32 -07:00
Binyang Li	4bbe16b351	Fix hang issue in logging submodule (#625 ) Fix: #622, using std::recursive_mutex to allow acquiring `lock` reclusively in the same thread	2025-09-05 09:18:28 -07:00
Changho Hwang	c4d8781390	Fix memory exchange within a single process (#624 )	2025-09-04 12:53:51 -07:00
Caio Rocha	c3473b1794	Thread Block Group DSL (#621 ) Supporting the creation of a group of thread block to perform some operation.	2025-09-03 14:58:40 -07:00
Changho Hwang	547a9ae65c	Fixed cpp linter (#619 )	2025-08-25 12:15:45 -07:00
Caio Rocha	f839184a27	Fix deadlock in Executor channel setup (#616 ) In cases where we have circular channel creation, such as: creating channel 0 <-> 1 creating channel 1 <-> 2 creating channel 2 <-> 3 creating channel 3 <-> 0 creating channel 0 <-> 3 creating channel 1 <-> 0 creating channel 2 <-> 1 creating channel 3 <-> 2 This setup can result in a deadlock during the first channel creation for each rank. The current code requires sharing the semaphore for the first channel before moving on, which leads to the following sequence: creating channel 0 <-> 1 creating channel 1 <-> 2 creating channel 2 <-> 3 creating channel 3 <-> 0 <-- HANG ISSUE --> The process hangs because, for example, rank 0 will only share the semaphore with rank 3 after receiving it from rank 1. However, rank 1 is waiting for a semaphore from rank 2, rank 2 is waiting for one from rank 3, and rank 3 is waiting for one from rank 0. The solution is to make this creation asynchronous and only retrieve the semaphore after all semaphores have been requested.	2025-08-19 16:44:44 -07:00
Caio Rocha	9261b1d278	AlltoAll Test Support (#606 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-15 16:00:41 -07:00
Binyang Li	03c0ff2a91	Fix for multi-nodes test (#614 ) Fix multi-node test --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-14 20:44:43 -07:00
Binyang Li	671b688bb3	Create ib mr for per ib transport (#611 )	2025-08-14 19:54:48 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Changho Hwang	1cc1b827f4	MNNVL fix (#604 )	2025-08-08 19:23:55 +00:00
Changho Hwang	542a10f69e	Merge ChannelTrigger with ProxyTrigger (#601 )	2025-08-08 19:07:50 +00:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Changho Hwang	d55ac96f5e	Fixed the local channel test (#597 )	2025-08-05 15:33:48 -07:00
Changho Hwang	334b232e36	Fix GpuStreamPool to be aware of the device ID of streams (#590 ) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-04 11:07:31 -07:00
Changho Hwang	c580e4c503	Support CudaIpc connection within a single process (#593 ) * Allow CudaIpc connection between GPUs in a single process * Added an example of connection in a single process * Minor interface updates --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-02 12:59:36 +08:00
Changho Hwang	199468bc47	Revise NVLS interface (#458 ) * Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel` * Minor interface improvements	2025-07-12 00:33:03 +00:00
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Changho Hwang	20eca28942	Fix a FIFO correctness bug (#549 ) * Add a FIFO test code that reproduced a correctness issue * Fix the correctness issue by using pinned memory instead of cudaMemcpy --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-07-11 23:53:59 +00:00

1 2 3 4 5 ...

484 Commits