Commit Graph

910 Commits

Author SHA1 Message Date
Qinghua Zhou
9be576578d Merge multinode branch 2026-03-25 02:51:24 +00:00
Qinghua Zhou
ec011f14ea Add detection of torch.baseline and debug info 2026-03-25 01:52:24 +00:00
Qinghua Zhou
8e22010560 Add tri-modal multi-node support for alltoallv: SingleNode, NVSwitch (GpuBuffer staging), and IB (PortChannel) 2026-03-23 08:54:08 +00:00
Qinghua Zhou
7e1cb7b8cf Support cross-node CudaIPC 2026-03-21 10:41:32 +00:00
Qinghua Zhou
9ef1fb7cee Run pass the multinode test 2026-03-18 17:08:22 +00:00
Qinghua Zhou
bdb30b56a5 Broadcast UniqueId via TCP; Detect whether torch comparison is possible 2026-03-16 10:01:35 +00:00
Qinghua Zhou
f47e97659d Update the benchmark to improve the rank mapping, communicator creation, backend selection 2026-03-16 09:25:34 +00:00
Qinghua Zhou
b7b180df24 Exchange recv displacement arrays between all ranks via bootstrap allGather 2026-03-05 15:19:20 +00:00
Qinghua Zhou
acfcca7f87 Support hybrid connections for single and multi node 2026-03-04 15:20:15 +00:00
Qinghua Zhou
d5743e2d6c Integrate with MoE training flow 2026-03-03 15:17:20 +00:00
Qinghua Zhou
d00713d3c2 Add more real moe workloads for alltoallv 2026-03-02 12:51:21 +00:00
Qinghua Zhou
ee843d445f Add test of real MoE workloads 2026-02-25 12:39:48 +00:00
Qinghua Zhou
ae59eab6a2 Add unified benchmarking function to test all_to_all_single of mscclpp and torch 2026-02-24 07:17:17 +00:00
Qinghua Zhou
715ecd91cf Add baseline test of torch.distributed.all_to_all_single 2026-02-24 06:51:10 +00:00
Qinghua Zhou
98be0def08 Use variable sizes in the peformance test 2026-02-24 06:29:46 +00:00
Qinghua Zhou
6292b6ab33 Report undirectional bandwidth 2026-02-24 06:02:33 +00:00
Qinghua Zhou
f803eff8b9 Use multiple thread blocks; Add peer-parallel kernels 2026-02-24 04:05:01 +00:00
Qinghua Zhou
21e3f1ebb3 Get correct remote receive displacements for peers 2026-02-23 14:22:30 +00:00
Qinghua Zhou
7ba83e20dd PyTorch-compatible all_to_all_single API using mscclpp kernels 2026-02-23 09:51:51 +00:00
Qinghua Zhou
b04df484b6 Add size-adaptive algorithm selection based on message size and world size 2026-02-18 14:24:00 +00:00
Qinghua Zhou
97426b3483 Use same chaneel for both signal and wait for ring kernel; Add pipelined kernel for imbalannced worloads 2026-02-18 12:13:30 +00:00
Qinghua Zhou
43980da455 Use maximum threads (1024) for best bandwidth utilization 2026-02-18 03:00:29 +00:00
Qinghua Zhou
b7485762a5 Improve with memory channels for intra-node communication 2026-02-17 13:55:15 +00:00
Qinghua Zhou
c42579e900 Move the alltoallv kernel to the src directory; Utilize the kernel in mscclpp-test 2026-02-06 02:57:34 +00:00
Qinghua Zhou
ac3e770c42 Add alltoallv kernel and test 2026-02-05 07:41:35 +00:00
Qinghua Zhou
f0441ee4ea Update document versioning for PR #724 (#735)
This PR fix the issue of generating docs when we take
https://github.com/microsoft/mscclpp/pull/724 into main branch.
Build docs for main branch separately.
Use HEAD request instead of GET to check if a page exist.
Filter out versions before v0.4.0 in generate_versions.py.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-02-01 19:52:01 -08:00
mahdiehghazim
08589bf332 Use native GPU architecture when NVIDIA GPU is detected; otherwise fall back to multi-arch build. (#732)
This change makes MSCCL++ automatically select CUDA architectures based
on the build environment. If an NVIDIA GPU is detected, the build
targets the native GPU architecture for optimal performance; otherwise,
it falls back to building for multiple architectures for portability.
When building for the native architecture, FP8 support is automatically
enabled for “a-series” GPUs (e.g., sm_100a), allowing the appropriate
optimized code paths to be picked up.
2026-01-26 15:53:36 -05:00
Qinghua Zhou
cc797abc87 Revert "Support versioning for mscclpp document (#724)" (#734)
This PR reverts commit 69d3b7 to avoid the github page issue.
2026-01-23 16:42:54 -08:00
Qinghua Zhou
69d3b79ecd Support versioning for mscclpp document (#724)
Show all the versions of mscclpp document on the webpage
https://microsoft.github.io/mscclpp/
Add sphinx-multiversion to generate documents for different versions.
Add version selector on document webpage.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-01-23 09:45:41 -08:00
mahdiehghazim
071dc92d38 fp8 nvls support (e5m2 and e4m3) (#730)
This PR adds FP8 support to the nvls code. For compilation, we need to
add this flag to the cmake command:

-DMSCCLPP_GPU_ARCHS=100a
2026-01-23 10:38:38 -05:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Binyang Li
78ce9fac8d Fix ci pipeline failure (#729) 2026-01-21 13:28:14 -05:00
Binyang Li
abbdb7f630 Fix ci issue (#727)
Solve the CI failure when cuda version newer than driver version
2026-01-15 22:21:02 -08:00
Changho Hwang
105239fc6c Use GpuIpcMem for NVLS connections (#719)
* Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast
memory handling.
* Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API
handles this automatically).
* Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a
shared pointer with custom deleter for unmapping, which prevents misuse
of raw pointers and reduces states to be stored in the `GpuIpcMem`
instance.
* Now for `RuntimeIpc` type handles, for consistency with other types,
`cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of
the ctor of `GpuIpcMem`.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-01-15 13:16:04 +08:00
Changho Hwang
c2a87302bd Reduce CI build time (#723)
Specify GPU architecture during CI build to reduce build time
2026-01-15 10:45:40 +08:00
Changho Hwang
a02ba3b1bd Add GpuIpcMemHandle (#704)
Add `GpuIpcMemHandle` that is a generic GPU memory handle that covers
all existing methods for GPU memory mapping. This PR fixes issues that
fail to properly fallback to a feasible type of memory handle on the
importing environment. It also consolidates code for creating or
destroying various memory handles into a single RAII wrapper.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-01-14 10:49:31 +08:00
Changho Hwang
4dd075602c Bypassing SSCA alerts (#721)
Remove default image tags to bypass SSCA alerts
2026-01-12 23:46:27 +08:00
Changho Hwang
b8a1b0a134 Add CUDA 13.0 Docker images (#720)
* Updated Dockerfiles and the build script to support CUDA 13.0
* Added Python3 venv which is required since Python 3.12
* Updated the default MLNX-OFED version to the LTS version
* Added docker push instruction for multi-arch manifest
2026-01-09 19:03:33 +08:00
Binyang Li
eab2afb8b9 Update container images for pipeline (#717)
- Remove cuda11 support for nccl-test pipeline, since nccl build failed
for cuda11.
- Update to cuda12.9 for CI pipeline. Will consider dropping cuda11
support add cuda13 support in near future
2026-01-07 14:10:49 +08:00
Qinghua Zhou
168a6c7037 Tune the nThreadsPerBlock for FP8 and Half datatype on MI300 (#694)
Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300.

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-01-06 08:59:59 +08:00
Changho Hwang
fc221e234d Remove UB std:: declarations (#709)
Remove custom delcarations inside `std::` of which behaviors are
undefined by the standard
2026-01-05 11:11:46 +08:00
Changho Hwang
2cf14ff723 Minor fixes (#715) 2026-01-05 11:09:48 +08:00
Changho Hwang
bb555277ad Rename P2P log subsys into GPU (#716) 2026-01-05 11:08:43 +08:00
Binyang Li
ca6a4a3274 Replace __HIP_PLATFORM_AMD__ to use internal macro (#712)
Replacing most of checks for `__HIP_PLATFORM_AMD__` with
`MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source
file.
2026-01-04 04:47:58 -08:00
qishilu
b2d96e8ba5 Use uncached memory on Rocm platform to avoid hang (#711)
MSCCLPP_DEVICE_HIP is undefined because it is defined in device.hpp. Use __HIP_PLATFORM_AMD__ here.
2025-12-24 10:49:36 +08:00
Changho Hwang
7b18a42274 Add copilot-instructions.md (#602) 2025-12-22 22:15:40 -08:00
Binyang Li
eda74a7f29 Add handle cache for AMD platform (#698)
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the
handle reference internally and reuse the same handle if already be
opened

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-12-21 18:39:12 -08:00
Caio Rocha
8d998820a3 Improve DSL Documentation (#707)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-12-19 15:17:08 -08:00
Changho Hwang
9e076da3d4 Make IB more configurable (#703)
* Added `port` and `gidIndex` field in the IB endpoint config (and
`deviceIndex` field for future usages)
* Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so
* Added `--ib_gid_index` CLI option to `mp_unit_tests`
* Other minor fixes
2025-12-18 13:21:07 -08:00
Caio Rocha
11b7b35832 Creating Documentation Section for MSCCL++ DSL (#706) 2025-12-15 15:07:01 -08:00