mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-07-13 02:37:20 +00:00

Author	SHA1	Message	Date
Qinghua Zhou	715ecd91cf	Add baseline test of torch.distributed.all_to_all_single	2026-02-24 06:51:10 +00:00
Qinghua Zhou	98be0def08	Use variable sizes in the peformance test	2026-02-24 06:29:46 +00:00
Qinghua Zhou	6292b6ab33	Report undirectional bandwidth	2026-02-24 06:02:33 +00:00
Qinghua Zhou	f803eff8b9	Use multiple thread blocks; Add peer-parallel kernels	2026-02-24 04:05:01 +00:00
Qinghua Zhou	21e3f1ebb3	Get correct remote receive displacements for peers	2026-02-23 14:22:30 +00:00
Qinghua Zhou	7ba83e20dd	PyTorch-compatible all_to_all_single API using mscclpp kernels	2026-02-23 09:51:51 +00:00
Qinghua Zhou	b04df484b6	Add size-adaptive algorithm selection based on message size and world size	2026-02-18 14:24:00 +00:00
Qinghua Zhou	97426b3483	Use same chaneel for both signal and wait for ring kernel; Add pipelined kernel for imbalannced worloads	2026-02-18 12:13:30 +00:00
Qinghua Zhou	43980da455	Use maximum threads (1024) for best bandwidth utilization	2026-02-18 03:00:29 +00:00
Qinghua Zhou	b7485762a5	Improve with memory channels for intra-node communication	2026-02-17 13:55:15 +00:00
Qinghua Zhou	c42579e900	Move the alltoallv kernel to the src directory; Utilize the kernel in mscclpp-test	2026-02-06 02:57:34 +00:00
Qinghua Zhou	ac3e770c42	Add alltoallv kernel and test	2026-02-05 07:41:35 +00:00
Qinghua Zhou	f0441ee4ea	Update document versioning for PR #724 (#735 ) This PR fix the issue of generating docs when we take https://github.com/microsoft/mscclpp/pull/724 into main branch. Build docs for main branch separately. Use HEAD request instead of GET to check if a page exist. Filter out versions before v0.4.0 in generate_versions.py. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-01 19:52:01 -08:00
mahdiehghazim	08589bf332	Use native GPU architecture when NVIDIA GPU is detected; otherwise fall back to multi-arch build. (#732 ) This change makes MSCCL++ automatically select CUDA architectures based on the build environment. If an NVIDIA GPU is detected, the build targets the native GPU architecture for optimal performance; otherwise, it falls back to building for multiple architectures for portability. When building for the native architecture, FP8 support is automatically enabled for “a-series” GPUs (e.g., sm_100a), allowing the appropriate optimized code paths to be picked up.	2026-01-26 15:53:36 -05:00
Qinghua Zhou	cc797abc87	Revert "Support versioning for mscclpp document (#724 )" (#734 ) This PR reverts commit 69d3b7 to avoid the github page issue.	2026-01-23 16:42:54 -08:00
Qinghua Zhou	69d3b79ecd	Support versioning for mscclpp document (#724 ) Show all the versions of mscclpp document on the webpage https://microsoft.github.io/mscclpp/ Add sphinx-multiversion to generate documents for different versions. Add version selector on document webpage. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-01-23 09:45:41 -08:00
mahdiehghazim	071dc92d38	fp8 nvls support (e5m2 and e4m3) (#730 ) This PR adds FP8 support to the nvls code. For compilation, we need to add this flag to the cmake command: -DMSCCLPP_GPU_ARCHS=100a	2026-01-23 10:38:38 -05:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	78ce9fac8d	Fix ci pipeline failure (#729 )	2026-01-21 13:28:14 -05:00
Binyang Li	abbdb7f630	Fix ci issue (#727 ) Solve the CI failure when cuda version newer than driver version	2026-01-15 22:21:02 -08:00
Changho Hwang	105239fc6c	Use `GpuIpcMem` for NVLS connections (#719 ) * Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast memory handling. * Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API handles this automatically). * Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a shared pointer with custom deleter for unmapping, which prevents misuse of raw pointers and reduces states to be stored in the `GpuIpcMem` instance. * Now for `RuntimeIpc` type handles, for consistency with other types, `cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of the ctor of `GpuIpcMem`. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-15 13:16:04 +08:00
Changho Hwang	c2a87302bd	Reduce CI build time (#723 ) Specify GPU architecture during CI build to reduce build time	2026-01-15 10:45:40 +08:00
Changho Hwang	a02ba3b1bd	Add `GpuIpcMemHandle` (#704 ) Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-14 10:49:31 +08:00
Changho Hwang	4dd075602c	Bypassing SSCA alerts (#721 ) Remove default image tags to bypass SSCA alerts	2026-01-12 23:46:27 +08:00
Changho Hwang	b8a1b0a134	Add CUDA 13.0 Docker images (#720 ) * Updated Dockerfiles and the build script to support CUDA 13.0 * Added Python3 venv which is required since Python 3.12 * Updated the default MLNX-OFED version to the LTS version * Added docker push instruction for multi-arch manifest	2026-01-09 19:03:33 +08:00
Binyang Li	eab2afb8b9	Update container images for pipeline (#717 ) - Remove cuda11 support for nccl-test pipeline, since nccl build failed for cuda11. - Update to cuda12.9 for CI pipeline. Will consider dropping cuda11 support add cuda13 support in near future	2026-01-07 14:10:49 +08:00
Qinghua Zhou	168a6c7037	Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 (#694 ) Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-01-06 08:59:59 +08:00
Changho Hwang	fc221e234d	Remove UB `std::` declarations (#709 ) Remove custom delcarations inside `std::` of which behaviors are undefined by the standard	2026-01-05 11:11:46 +08:00
Changho Hwang	2cf14ff723	Minor fixes (#715 )	2026-01-05 11:09:48 +08:00
Changho Hwang	bb555277ad	Rename `P2P` log subsys into `GPU` (#716 )	2026-01-05 11:08:43 +08:00
Binyang Li	ca6a4a3274	Replace `__HIP_PLATFORM_AMD__` to use internal macro (#712 ) Replacing most of checks for `__HIP_PLATFORM_AMD__` with `MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source file.	2026-01-04 04:47:58 -08:00
qishilu	b2d96e8ba5	Use uncached memory on Rocm platform to avoid hang (#711 ) MSCCLPP_DEVICE_HIP is undefined because it is defined in device.hpp. Use __HIP_PLATFORM_AMD__ here.	2025-12-24 10:49:36 +08:00
Changho Hwang	7b18a42274	Add copilot-instructions.md (#602 )	2025-12-22 22:15:40 -08:00
Binyang Li	eda74a7f29	Add handle cache for AMD platform (#698 ) Introduce handle cache for AMD platform. Avoid reaching handle limitation if we open too much IPC handles For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-21 18:39:12 -08:00
Caio Rocha	8d998820a3	Improve DSL Documentation (#707 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-19 15:17:08 -08:00
Changho Hwang	9e076da3d4	Make IB more configurable (#703 ) * Added `port` and `gidIndex` field in the IB endpoint config (and `deviceIndex` field for future usages) * Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so * Added `--ib_gid_index` CLI option to `mp_unit_tests` * Other minor fixes	2025-12-18 13:21:07 -08:00
Caio Rocha	11b7b35832	Creating Documentation Section for MSCCL++ DSL (#706 )	2025-12-15 15:07:01 -08:00
Changho Hwang	da60eb7f46	Add an IB multi-node tutorial (#702 )	2025-12-11 15:15:58 -08:00
Changho Hwang	51a86630ff	Build fixes (#696 ) * Fix CMake build for CUDA 13 * Add a missing header file	2025-11-26 20:02:01 -08:00
Changho Hwang	8b75634d31	Optimized logger (#693 ) * Leverage constant folding * Use `shouldLog()` function for early exit * Per-thread timestamp caching to remove mutex	2025-11-25 08:58:17 -08:00
Changho Hwang	ddf84a6b9d	Add `CudaDeviceGuard` (#691 ) Add an RAII guard that sets a proper GPU device before a CUDA API call. We may change this stateful in the future to minimize `cudaGetDevice()` calls. This PR fixes a bug of the tutorial 01.	2025-11-24 13:38:44 -08:00
Caio Rocha	17247cd695	DSL Quick Start (#689 ) Fix #675 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-21 14:45:49 -08:00
Changho Hwang	8b8593ba51	Fix Python bindings and tests (#690 ) Minimal fix to make things work. We need a more careful look at preventing silent fallback of nanobind when it fails to (properly) construct a C++ STL object with mscclpp instances.	2025-11-21 12:53:12 -08:00
Caio Rocha	060c35fec6	No IB Env CI Test (#687 )	2025-11-19 11:13:03 -08:00
Caio Rocha	bbdeafb3ca	Fix Error in Non IB Env at Executor (#686 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-17 16:35:57 -08:00
Qinghua Zhou	b9428341a2	Revise the mscclpp datatype (#671 ) Use mscclpp::DataType to replace the following types in API interface: int dtype; ncclDataType_t dtype; Add data type conversion: Convert ncclDataType_t to mscclpp::DataType	2025-11-17 12:58:47 -08:00
Caio Rocha	a19bca9738	Fix Minor Issue Proxy Python Interface (#685 )	2025-11-17 09:03:00 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Caio Rocha	7eb3ff701a	Supporting New Packet Kernel Operation at Executor (#677 ) This PR introduces three new operations to enhance flexibility and performance at executor. One operation can be invoked directly via the DSL API and two operations are created through fusion of existing operations, reducing overhead and improving efficiency. 1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt format to the remote side in pkt format via the port channel. Both source and destination buffers must be scratch. 2. Reduce Copy Packet (Fusion): Reduce Packet+Copy Packet=Reduce Copy Packet Triggered when the destination buffer of Reduce Packet matches the source buffer of Copy Packet. Purpose: Combine reduction and copy into a single step for better performance. 3. Reduce Copy Send Packet (Fusion): Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer of Reduce Copy Packet matches src buffer of Put Packet) Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt buffer of Reduce Copy Packet matches src buffer of Read Put Packet) Purpose: Combine reduction, copy, and send operations into one optimized pipeline. Fusion Diagram Reduce Packet + Copy Packet → Reduce Copy Packet Reduce Copy Packet + Put Packet → Reduce Copy Send Packet Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet Beyond this, this PR adjust the AllReduce 2 Node algorithm: Message Size \| Latency (µs) 1K \| 15.34 2K \| 15.88 4K \| 15.71 8K \| 16.01 16K \| 15.88 32K \| 16.21 64K \| 16.90 128K \| 18.24 256K \| 20.39 512K \| 25.26 1M \| 32.74 2M \| 53.64	2025-11-13 14:08:44 -08:00
Caio Rocha	eb202780f5	Support Synchronous Initialization for Proxy Service (#679 )	2025-11-12 18:35:57 -08:00

1 2 3 4 5 ...

897 Commits