mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-04-20 14:59:29 +00:00

Author	SHA1	Message	Date
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	78ce9fac8d	Fix ci pipeline failure (#729 )	2026-01-21 13:28:14 -05:00
Binyang Li	abbdb7f630	Fix ci issue (#727 ) Solve the CI failure when cuda version newer than driver version	2026-01-15 22:21:02 -08:00
Changho Hwang	105239fc6c	Use `GpuIpcMem` for NVLS connections (#719 ) * Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast memory handling. * Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API handles this automatically). * Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a shared pointer with custom deleter for unmapping, which prevents misuse of raw pointers and reduces states to be stored in the `GpuIpcMem` instance. * Now for `RuntimeIpc` type handles, for consistency with other types, `cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of the ctor of `GpuIpcMem`. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-15 13:16:04 +08:00
Changho Hwang	c2a87302bd	Reduce CI build time (#723 ) Specify GPU architecture during CI build to reduce build time	2026-01-15 10:45:40 +08:00
Changho Hwang	a02ba3b1bd	Add `GpuIpcMemHandle` (#704 ) Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-14 10:49:31 +08:00
Changho Hwang	4dd075602c	Bypassing SSCA alerts (#721 ) Remove default image tags to bypass SSCA alerts	2026-01-12 23:46:27 +08:00
Changho Hwang	b8a1b0a134	Add CUDA 13.0 Docker images (#720 ) * Updated Dockerfiles and the build script to support CUDA 13.0 * Added Python3 venv which is required since Python 3.12 * Updated the default MLNX-OFED version to the LTS version * Added docker push instruction for multi-arch manifest	2026-01-09 19:03:33 +08:00
Binyang Li	eab2afb8b9	Update container images for pipeline (#717 ) - Remove cuda11 support for nccl-test pipeline, since nccl build failed for cuda11. - Update to cuda12.9 for CI pipeline. Will consider dropping cuda11 support add cuda13 support in near future	2026-01-07 14:10:49 +08:00
Qinghua Zhou	168a6c7037	Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 (#694 ) Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-01-06 08:59:59 +08:00
Changho Hwang	fc221e234d	Remove UB `std::` declarations (#709 ) Remove custom delcarations inside `std::` of which behaviors are undefined by the standard	2026-01-05 11:11:46 +08:00
Changho Hwang	2cf14ff723	Minor fixes (#715 )	2026-01-05 11:09:48 +08:00
Changho Hwang	bb555277ad	Rename `P2P` log subsys into `GPU` (#716 )	2026-01-05 11:08:43 +08:00
Binyang Li	ca6a4a3274	Replace `__HIP_PLATFORM_AMD__` to use internal macro (#712 ) Replacing most of checks for `__HIP_PLATFORM_AMD__` with `MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source file.	2026-01-04 04:47:58 -08:00
qishilu	b2d96e8ba5	Use uncached memory on Rocm platform to avoid hang (#711 ) MSCCLPP_DEVICE_HIP is undefined because it is defined in device.hpp. Use __HIP_PLATFORM_AMD__ here.	2025-12-24 10:49:36 +08:00
Changho Hwang	7b18a42274	Add copilot-instructions.md (#602 )	2025-12-22 22:15:40 -08:00
Binyang Li	eda74a7f29	Add handle cache for AMD platform (#698 ) Introduce handle cache for AMD platform. Avoid reaching handle limitation if we open too much IPC handles For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-21 18:39:12 -08:00
Caio Rocha	8d998820a3	Improve DSL Documentation (#707 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-19 15:17:08 -08:00
Changho Hwang	9e076da3d4	Make IB more configurable (#703 ) * Added `port` and `gidIndex` field in the IB endpoint config (and `deviceIndex` field for future usages) * Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so * Added `--ib_gid_index` CLI option to `mp_unit_tests` * Other minor fixes	2025-12-18 13:21:07 -08:00
Caio Rocha	11b7b35832	Creating Documentation Section for MSCCL++ DSL (#706 )	2025-12-15 15:07:01 -08:00
Changho Hwang	da60eb7f46	Add an IB multi-node tutorial (#702 )	2025-12-11 15:15:58 -08:00
Changho Hwang	51a86630ff	Build fixes (#696 ) * Fix CMake build for CUDA 13 * Add a missing header file	2025-11-26 20:02:01 -08:00
Changho Hwang	8b75634d31	Optimized logger (#693 ) * Leverage constant folding * Use `shouldLog()` function for early exit * Per-thread timestamp caching to remove mutex	2025-11-25 08:58:17 -08:00
Changho Hwang	ddf84a6b9d	Add `CudaDeviceGuard` (#691 ) Add an RAII guard that sets a proper GPU device before a CUDA API call. We may change this stateful in the future to minimize `cudaGetDevice()` calls. This PR fixes a bug of the tutorial 01.	2025-11-24 13:38:44 -08:00
Caio Rocha	17247cd695	DSL Quick Start (#689 ) Fix #675 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-21 14:45:49 -08:00
Changho Hwang	8b8593ba51	Fix Python bindings and tests (#690 ) Minimal fix to make things work. We need a more careful look at preventing silent fallback of nanobind when it fails to (properly) construct a C++ STL object with mscclpp instances.	2025-11-21 12:53:12 -08:00
Caio Rocha	060c35fec6	No IB Env CI Test (#687 )	2025-11-19 11:13:03 -08:00
Caio Rocha	bbdeafb3ca	Fix Error in Non IB Env at Executor (#686 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-17 16:35:57 -08:00
Qinghua Zhou	b9428341a2	Revise the mscclpp datatype (#671 ) Use mscclpp::DataType to replace the following types in API interface: int dtype; ncclDataType_t dtype; Add data type conversion: Convert ncclDataType_t to mscclpp::DataType	2025-11-17 12:58:47 -08:00
Caio Rocha	a19bca9738	Fix Minor Issue Proxy Python Interface (#685 )	2025-11-17 09:03:00 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Caio Rocha	7eb3ff701a	Supporting New Packet Kernel Operation at Executor (#677 ) This PR introduces three new operations to enhance flexibility and performance at executor. One operation can be invoked directly via the DSL API and two operations are created through fusion of existing operations, reducing overhead and improving efficiency. 1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt format to the remote side in pkt format via the port channel. Both source and destination buffers must be scratch. 2. Reduce Copy Packet (Fusion): Reduce Packet+Copy Packet=Reduce Copy Packet Triggered when the destination buffer of Reduce Packet matches the source buffer of Copy Packet. Purpose: Combine reduction and copy into a single step for better performance. 3. Reduce Copy Send Packet (Fusion): Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer of Reduce Copy Packet matches src buffer of Put Packet) Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt buffer of Reduce Copy Packet matches src buffer of Read Put Packet) Purpose: Combine reduction, copy, and send operations into one optimized pipeline. Fusion Diagram Reduce Packet + Copy Packet → Reduce Copy Packet Reduce Copy Packet + Put Packet → Reduce Copy Send Packet Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet Beyond this, this PR adjust the AllReduce 2 Node algorithm: Message Size \| Latency (µs) 1K \| 15.34 2K \| 15.88 4K \| 15.71 8K \| 16.01 16K \| 15.88 32K \| 16.21 64K \| 16.90 128K \| 18.24 256K \| 20.39 512K \| 25.26 1M \| 32.74 2M \| 53.64	2025-11-13 14:08:44 -08:00
Caio Rocha	eb202780f5	Support Synchronous Initialization for Proxy Service (#679 )	2025-11-12 18:35:57 -08:00
Changho Hwang	ffafcaf6d6	IB stack enhancements & bug fixes (#673 ) * Always use `ibv_reg_dmabuf_mr` when DMABUF is supported * Do not check `nvidia-peermem` when unnecessary * More rigorous check on IB port availability * Fixed ibverbs wrappers * Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test	2025-11-07 14:26:17 -08:00
Binyang Li	9eb958183c	upgrade codeql to v3 (#676 )	2025-11-06 16:58:19 -08:00
Changho Hwang	960a8ddebd	Add a new logger (#668 ) * Add `logger.hpp` that will gradually replace `debug.h` * Minor fixes in `ib.cc` --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2025-11-04 10:32:46 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Changho Hwang	9994f53cea	Fixes for no-IB systems (#667 ) * Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB on/off * Fix `nvidia-peermem` check; no need for DMABUF supported systems * Fix `mp_unit_tests` to skip all IB tests when built with `-DMSCCLPP_USE_IB=OFF`	2025-10-29 10:03:02 -07:00
Caio Rocha	2b987cf8e8	Resolve IBVerbs Loading Issues (#648 ) Some systems do not include libibverbs.so when installing ibverbs; instead, they only provide libibverbs.so1. This PR updates the CMake file to search for this library and modifies the wrapper to load it. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-28 18:14:53 -07:00
Qinghua Zhou	a38c2ee784	FP8 support for Allreduce (#646 ) Add FP8 support for Allreduce on both NVIDIA and AMD platform. Add new data type: fp8_e4m3 and fp8_e5m2 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-10-27 14:51:48 -07:00
Changho Hwang	fc0aaaf1b4	Auto-detect CUDA arch in CMake GPU check (#666 ) Compute capability 60 support is dropped from CUDA 13 Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-10-27 11:25:24 -07:00
Binyang Li	2b05908635	Add token pool for cuCreate API (#628 ) Create a tokenPool to allocate token. This feature is used to support inter node NVL and try to reduce the footprint caused by cuCreate --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-27 11:19:21 -07:00
Changho Hwang	09219c1f6a	Fix #651 (#662 ) * Python cannot distinguish `Communicator::connect(const Endpoint&, ...)` from `Communicator::connect(const EndpointConfig&, ...)`. Temporarily removed the former one. * A few other fixes in Python bindings.	2025-10-24 14:25:51 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Changho Hwang	a2f1279c60	Test peer accessibility after deployment (#661 ) Test GPUs' peer accessibility before integration testing to distinguish VM issues.	2025-10-24 11:09:36 -07:00
Changho Hwang	4d4f087c11	Add exclude paths under pipeline triggers (#664 )	2025-10-23 18:27:36 -07:00
Caio Rocha	d7b99e9c9d	Improving DSL documentation (#650 )	2025-10-23 17:50:33 -07:00
Changho Hwang	f7d1fb4492	Exclude irrelevant files from workflow triggers (#663 )	2025-10-23 15:52:19 -07:00
Changho Hwang	58996b5c51	Fix docs version (#659 ) Fetch full history of the repo for accurate version info	2025-10-23 11:14:27 -07:00
Changho Hwang	a48421872e	Fix docs (#656 ) * Fix Python doc generation * Remove `ChannelTrigger` and fix `ProxyTrigger` * Fixed package versions for consistency	2025-10-23 00:34:53 +00:00

1 2 3 4 5 ...

880 Commits