mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-24 14:54:51 +00:00

Author	SHA1	Message	Date
Qinghua Zhou	935cc70534	fix: resolve illegal memory access and kernel correctness issues in alltoallv 1. Fix pinned buffer race condition (alltoallv_single.py): - The shared pinned CPU buffer was reused for 4 sequential non_blocking H2D copies. GPU DMA read stale data after CPU overwrote the buffer with the next field, corrupting sendCounts/recvCounts and causing the kernel to write to wrong addresses. Fixed by using 5 dedicated pinned buffers — one per field (send_counts, send_displs, recv_counts, recv_displs, remote_recv_displs). 2. Remove C++ periodic reset (alltoallv_fullmesh.cu): - A hardcoded static counter reset destroyed MemoryChannels and semaphores every 1000 kernel calls while inter-GPU signaling was still in progress, causing semaphore epoch mismatch and illegal memory access. 3. Fix semaphore wait (alltoallv_kernel.hpp): - Make wait() unconditional after signal(). Skipping wait() when recvCounts==0 desynced the semaphore epoch counter — subsequent calls wait() returned immediately before the peer finished writing. 4. Add memory fence (alltoallv_kernel.hpp): - Add __threadfence_system() after wait() outside the primary-block guard so ALL thread blocks execute it before kernel exit. Ensures NVLink remote writes from put() are globally visible to subsequent kernels on the receiving GPU.	2026-04-20 17:18:05 +00:00
Qinghua Zhou	02fdcffde8	Add algo->reset after every 1000 calls of alltoallv	2026-04-09 04:49:46 +00:00
Qinghua Zhou	1d271f4cc7	Merge latest multinode branch	2026-04-08 23:03:12 +00:00
Qinghua Zhou	9be576578d	Merge multinode branch	2026-03-25 02:51:24 +00:00
Qinghua Zhou	ec011f14ea	Add detection of torch.baseline and debug info	2026-03-25 01:52:24 +00:00
Qinghua Zhou	8e22010560	Add tri-modal multi-node support for alltoallv: SingleNode, NVSwitch (GpuBuffer staging), and IB (PortChannel)	2026-03-23 08:54:08 +00:00
Qinghua Zhou	7e1cb7b8cf	Support cross-node CudaIPC	2026-03-21 10:41:32 +00:00
Qinghua Zhou	9ef1fb7cee	Run pass the multinode test	2026-03-18 17:08:22 +00:00
Qinghua Zhou	bdb30b56a5	Broadcast UniqueId via TCP; Detect whether torch comparison is possible	2026-03-16 10:01:35 +00:00
Qinghua Zhou	f47e97659d	Update the benchmark to improve the rank mapping, communicator creation, backend selection	2026-03-16 09:25:34 +00:00
Qinghua Zhou	b7b180df24	Exchange recv displacement arrays between all ranks via bootstrap allGather	2026-03-05 15:19:20 +00:00
Qinghua Zhou	acfcca7f87	Support hybrid connections for single and multi node	2026-03-04 15:20:15 +00:00
Qinghua Zhou	d5743e2d6c	Integrate with MoE training flow	2026-03-03 15:17:20 +00:00
Qinghua Zhou	d00713d3c2	Add more real moe workloads for alltoallv	2026-03-02 12:51:21 +00:00
Qinghua Zhou	ee843d445f	Add test of real MoE workloads	2026-02-25 12:39:48 +00:00
Qinghua Zhou	ae59eab6a2	Add unified benchmarking function to test all_to_all_single of mscclpp and torch	2026-02-24 07:17:17 +00:00
Qinghua Zhou	715ecd91cf	Add baseline test of torch.distributed.all_to_all_single	2026-02-24 06:51:10 +00:00
Qinghua Zhou	98be0def08	Use variable sizes in the peformance test	2026-02-24 06:29:46 +00:00
Qinghua Zhou	6292b6ab33	Report undirectional bandwidth	2026-02-24 06:02:33 +00:00
Qinghua Zhou	f803eff8b9	Use multiple thread blocks; Add peer-parallel kernels	2026-02-24 04:05:01 +00:00
Qinghua Zhou	21e3f1ebb3	Get correct remote receive displacements for peers	2026-02-23 14:22:30 +00:00
Qinghua Zhou	7ba83e20dd	PyTorch-compatible all_to_all_single API using mscclpp kernels	2026-02-23 09:51:51 +00:00
Qinghua Zhou	b04df484b6	Add size-adaptive algorithm selection based on message size and world size	2026-02-18 14:24:00 +00:00
Qinghua Zhou	97426b3483	Use same chaneel for both signal and wait for ring kernel; Add pipelined kernel for imbalannced worloads	2026-02-18 12:13:30 +00:00
Qinghua Zhou	43980da455	Use maximum threads (1024) for best bandwidth utilization	2026-02-18 03:00:29 +00:00
Qinghua Zhou	b7485762a5	Improve with memory channels for intra-node communication	2026-02-17 13:55:15 +00:00
Qinghua Zhou	c42579e900	Move the alltoallv kernel to the src directory; Utilize the kernel in mscclpp-test	2026-02-06 02:57:34 +00:00
Qinghua Zhou	ac3e770c42	Add alltoallv kernel and test	2026-02-05 07:41:35 +00:00
Qinghua Zhou	f0441ee4ea	Update document versioning for PR #724 (#735 ) This PR fix the issue of generating docs when we take https://github.com/microsoft/mscclpp/pull/724 into main branch. Build docs for main branch separately. Use HEAD request instead of GET to check if a page exist. Filter out versions before v0.4.0 in generate_versions.py. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-01 19:52:01 -08:00
mahdiehghazim	08589bf332	Use native GPU architecture when NVIDIA GPU is detected; otherwise fall back to multi-arch build. (#732 ) This change makes MSCCL++ automatically select CUDA architectures based on the build environment. If an NVIDIA GPU is detected, the build targets the native GPU architecture for optimal performance; otherwise, it falls back to building for multiple architectures for portability. When building for the native architecture, FP8 support is automatically enabled for “a-series” GPUs (e.g., sm_100a), allowing the appropriate optimized code paths to be picked up.	2026-01-26 15:53:36 -05:00
Qinghua Zhou	cc797abc87	Revert "Support versioning for mscclpp document (#724 )" (#734 ) This PR reverts commit 69d3b7 to avoid the github page issue.	2026-01-23 16:42:54 -08:00
Qinghua Zhou	69d3b79ecd	Support versioning for mscclpp document (#724 ) Show all the versions of mscclpp document on the webpage https://microsoft.github.io/mscclpp/ Add sphinx-multiversion to generate documents for different versions. Add version selector on document webpage. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-01-23 09:45:41 -08:00
mahdiehghazim	071dc92d38	fp8 nvls support (e5m2 and e4m3) (#730 ) This PR adds FP8 support to the nvls code. For compilation, we need to add this flag to the cmake command: -DMSCCLPP_GPU_ARCHS=100a	2026-01-23 10:38:38 -05:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	78ce9fac8d	Fix ci pipeline failure (#729 )	2026-01-21 13:28:14 -05:00
Binyang Li	abbdb7f630	Fix ci issue (#727 ) Solve the CI failure when cuda version newer than driver version	2026-01-15 22:21:02 -08:00
Changho Hwang	105239fc6c	Use `GpuIpcMem` for NVLS connections (#719 ) * Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast memory handling. * Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API handles this automatically). * Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a shared pointer with custom deleter for unmapping, which prevents misuse of raw pointers and reduces states to be stored in the `GpuIpcMem` instance. * Now for `RuntimeIpc` type handles, for consistency with other types, `cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of the ctor of `GpuIpcMem`. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-15 13:16:04 +08:00
Changho Hwang	c2a87302bd	Reduce CI build time (#723 ) Specify GPU architecture during CI build to reduce build time	2026-01-15 10:45:40 +08:00
Changho Hwang	a02ba3b1bd	Add `GpuIpcMemHandle` (#704 ) Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers all existing methods for GPU memory mapping. This PR fixes issues that fail to properly fallback to a feasible type of memory handle on the importing environment. It also consolidates code for creating or destroying various memory handles into a single RAII wrapper. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-14 10:49:31 +08:00
Changho Hwang	4dd075602c	Bypassing SSCA alerts (#721 ) Remove default image tags to bypass SSCA alerts	2026-01-12 23:46:27 +08:00
Changho Hwang	b8a1b0a134	Add CUDA 13.0 Docker images (#720 ) * Updated Dockerfiles and the build script to support CUDA 13.0 * Added Python3 venv which is required since Python 3.12 * Updated the default MLNX-OFED version to the LTS version * Added docker push instruction for multi-arch manifest	2026-01-09 19:03:33 +08:00
Binyang Li	eab2afb8b9	Update container images for pipeline (#717 ) - Remove cuda11 support for nccl-test pipeline, since nccl build failed for cuda11. - Update to cuda12.9 for CI pipeline. Will consider dropping cuda11 support add cuda13 support in near future	2026-01-07 14:10:49 +08:00
Qinghua Zhou	168a6c7037	Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 (#694 ) Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-01-06 08:59:59 +08:00
Changho Hwang	fc221e234d	Remove UB `std::` declarations (#709 ) Remove custom delcarations inside `std::` of which behaviors are undefined by the standard	2026-01-05 11:11:46 +08:00
Changho Hwang	2cf14ff723	Minor fixes (#715 )	2026-01-05 11:09:48 +08:00
Changho Hwang	bb555277ad	Rename `P2P` log subsys into `GPU` (#716 )	2026-01-05 11:08:43 +08:00
Binyang Li	ca6a4a3274	Replace `__HIP_PLATFORM_AMD__` to use internal macro (#712 ) Replacing most of checks for `__HIP_PLATFORM_AMD__` with `MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source file.	2026-01-04 04:47:58 -08:00
qishilu	b2d96e8ba5	Use uncached memory on Rocm platform to avoid hang (#711 ) MSCCLPP_DEVICE_HIP is undefined because it is defined in device.hpp. Use __HIP_PLATFORM_AMD__ here.	2025-12-24 10:49:36 +08:00
Changho Hwang	7b18a42274	Add copilot-instructions.md (#602 )	2025-12-22 22:15:40 -08:00
Binyang Li	eda74a7f29	Add handle cache for AMD platform (#698 ) Introduce handle cache for AMD platform. Avoid reaching handle limitation if we open too much IPC handles For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-21 18:39:12 -08:00

1 2 3 4 5 ...

913 Commits