mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 02:47:23 +00:00

Author	SHA1	Message	Date
Binyang Li	9aab9cacc0	support rocm7.2 (#819 ) This pull request introduces support for ROCm 7.2 across the build system, CI pipelines, Docker images, and documentation, while also improving ROCm FP8 type selection and CUDA IPC memory handle management. It updates dependencies and configurations to ensure compatibility with ROCm 7.2, adds new options for native FP8 variants, and refines some benchmarking and internal memory handling logic. Pls notice: there is an issue in rocm7.2 (rocm7.2 user lib + rocm6.2 driver) when execution code in this order: allocating memory -> ipc communication -> allocate new memory -> free old memory. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-06-24 16:09:34 -07:00
Binyang Li	c9f8be64bb	Add collective benchmark and correctness check (#814 ) - Add unit-test for float8_e4m3b15 data type. - And tuner and benchmark for allreduce/allgather algo, make sure the correctness and performance.	2026-06-04 09:22:10 -07:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	78ce9fac8d	Fix ci pipeline failure (#729 )	2026-01-21 13:28:14 -05:00
Changho Hwang	8b8593ba51	Fix Python bindings and tests (#690 ) Minimal fix to make things work. We need a more careful look at preventing silent fallback of nanobind when it fails to (properly) construct a C++ STL object with mscclpp instances.	2025-11-21 12:53:12 -08:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Binyang Li	658411ccc4	update pytest and python API to fix ut failure (#598 ) update pytest and python API to fix ut failure	2025-08-05 15:17:33 -07:00
Binyang Li	604c345921	Fix #458 (#568 ) Fix mscclpp_benchmark allreduce test	2025-07-13 18:02:56 -07:00
Changho Hwang	199468bc47	Revise NVLS interface (#458 ) * Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel` * Minor interface improvements	2025-07-12 00:33:03 +00:00
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	1b8d020650	Fix mscclpp_benchmark (#392 ) Enable 1GB message size for NVLS transport in mscclpp_benchmark	2024-11-25 19:59:51 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Roshan Dathathri	7ed13ec4b5	Auto-tune vector sizes for NVLS allreduce6 (#338 ) Also fixes bugs in MscclppAllReduce6 Below is the performance when the algorithm is fixed to MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2. Float16: +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| Size (fp16) \| Time (us) \| AlgBW (GB/s) \| Correctness \| NCCL Time (us) \| NCCL AlgBW (GB/s) \| NCCL Correctness \| Speed Up \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| 2.0 KiB \| 11.15 \| 0.18 \| PASS \| 13.82 \| 0.15 \| PASS \| 1.24 \| \| 4.0 KiB \| 11.15 \| 0.37 \| PASS \| 14.74 \| 0.28 \| PASS \| 1.32 \| \| 8.0 KiB \| 11.14 \| 0.74 \| PASS \| 15.17 \| 0.54 \| PASS \| 1.36 \| \| 16.0 KiB \| 11.16 \| 1.47 \| PASS \| 15.77 \| 1.04 \| PASS \| 1.41 \| \| 32.0 KiB \| 11.15 \| 2.94 \| PASS \| 17.50 \| 1.87 \| PASS \| 1.57 \| \| 64.0 KiB \| 11.18 \| 5.86 \| PASS \| 17.64 \| 3.71 \| PASS \| 1.58 \| \| 128.0 KiB \| 11.16 \| 11.74 \| PASS \| 17.83 \| 7.35 \| PASS \| 1.60 \| \| 256.0 KiB \| 11.21 \| 23.38 \| PASS \| 18.00 \| 14.57 \| PASS \| 1.60 \| \| 512.0 KiB \| 11.70 \| 44.81 \| PASS \| 18.42 \| 28.46 \| PASS \| 1.57 \| \| 1.0 MiB \| 13.64 \| 76.87 \| PASS \| 20.23 \| 51.83 \| PASS \| 1.48 \| \| 2.0 MiB \| 17.29 \| 121.27 \| PASS \| 31.60 \| 66.36 \| PASS \| 1.83 \| \| 4.0 MiB \| 25.26 \| 166.02 \| PASS \| 38.74 \| 108.26 \| PASS \| 1.53 \| \| 8.0 MiB \| 40.17 \| 208.83 \| PASS \| 62.86 \| 133.45 \| PASS \| 1.56 \| \| 16.0 MiB \| 70.92 \| 236.56 \| PASS \| 113.36 \| 147.99 \| PASS \| 1.60 \| \| 32.0 MiB \| 131.38 \| 255.41 \| PASS \| 203.21 \| 165.13 \| PASS \| 1.55 \| \| 64.0 MiB \| 253.39 \| 264.84 \| PASS \| 342.12 \| 196.15 \| PASS \| 1.35 \| \| 128.0 MiB \| 496.74 \| 270.20 \| PASS \| 670.62 \| 200.14 \| PASS \| 1.35 \| \| 256.0 MiB \| 982.42 \| 273.24 \| PASS \| 1318.36 \| 203.61 \| PASS \| 1.34 \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ Float32: +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| Size (fp32) \| Time (us) \| AlgBW (GB/s) \| Correctness \| NCCL Time (us) \| NCCL AlgBW (GB/s) \| NCCL Correctness \| Speed Up \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| 4.0 KiB \| 11.04 \| 0.37 \| PASS \| 14.79 \| 0.28 \| PASS \| 1.34 \| \| 8.0 KiB \| 11.15 \| 0.73 \| PASS \| 15.25 \| 0.54 \| PASS \| 1.37 \| \| 16.0 KiB \| 11.12 \| 1.47 \| PASS \| 15.87 \| 1.03 \| PASS \| 1.43 \| \| 32.0 KiB \| 11.13 \| 2.95 \| PASS \| 17.21 \| 1.90 \| PASS \| 1.55 \| \| 64.0 KiB \| 11.11 \| 5.90 \| PASS \| 17.37 \| 3.77 \| PASS \| 1.56 \| \| 128.0 KiB \| 11.08 \| 11.83 \| PASS \| 17.54 \| 7.47 \| PASS \| 1.58 \| \| 256.0 KiB \| 11.15 \| 23.50 \| PASS \| 17.71 \| 14.80 \| PASS \| 1.59 \| \| 512.0 KiB \| 11.56 \| 45.34 \| PASS \| 18.21 \| 28.79 \| PASS \| 1.57 \| \| 1.0 MiB \| 13.64 \| 76.90 \| PASS \| 19.87 \| 52.77 \| PASS \| 1.46 \| \| 2.0 MiB \| 17.24 \| 121.67 \| PASS \| 31.63 \| 66.30 \| PASS \| 1.84 \| \| 4.0 MiB \| 25.19 \| 166.47 \| PASS \| 38.63 \| 108.57 \| PASS \| 1.53 \| \| 8.0 MiB \| 40.38 \| 207.72 \| PASS \| 62.65 \| 133.89 \| PASS \| 1.55 \| \| 16.0 MiB \| 70.72 \| 237.23 \| PASS \| 114.57 \| 146.44 \| PASS \| 1.62 \| \| 32.0 MiB \| 131.49 \| 255.18 \| PASS \| 200.79 \| 167.11 \| PASS \| 1.53 \| \| 64.0 MiB \| 253.98 \| 264.23 \| PASS \| 342.58 \| 195.89 \| PASS \| 1.35 \| \| 128.0 MiB \| 496.96 \| 270.08 \| PASS \| 670.64 \| 200.13 \| PASS \| 1.35 \| \| 256.0 MiB \| 982.83 \| 273.12 \| PASS \| 1318.90 \| 203.53 \| PASS \| 1.34 \| \| 512.0 MiB \| 1954.07 \| 274.75 \| PASS \| 2609.04 \| 205.77 \| PASS \| 1.34 \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+	2024-08-16 11:11:54 +08:00
Angelica Moreira	0f796bbdf7	Update allreduce_bench.py (#318 ) Replacing hardcoded network interface name for generic discovery strategy. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-06-29 03:41:13 +00:00
Roshan Dathathri	91550dab4c	Simplify/improve barrier in AllReduce6 (#317 ) Drop superfluous __threadfence_system()	2024-06-23 21:08:59 +00:00
Roshan Dathathri	93ed8e1e58	Add support for multicast reduce insruction (#316 )	2024-06-19 13:28:12 -07:00
Roshan Dathathri	41e0964d93	Allow binding allocated memory to NVLS multicast pointer (#290 ) And change NVLS multimem instructions to static functions	2024-04-18 17:11:31 -07:00
Binyang Li	5971508eed	Remove cuda-python from project (#245 ) Remove cuda-python and use CuPy APIs instead --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-13 21:44:11 +08:00
Binyang Li	7c229fbdd8	Fix multi-nodes test failure (#262 ) fix multi-nodes CI pipeline Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-07 18:21:05 -08:00
Saeed Maleki	91d592dcc0	NVLS support. (#250 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-04 20:46:10 -08:00
Changho Hwang	544ff0c21d	ROCm support (#213 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-11-24 16:41:56 +08:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00
Changho Hwang	15f6dcca49	Update documentation (#217 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 12:58:04 -08:00
Changho Hwang	7bd66a938c	Robust correctness test (#221 ) Co-authored-by: Aashaka Shah <aashaka96@gmail.com>	2023-11-22 12:06:50 +08:00

27 Commits