mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Caio Rocha	feda338595	Adjusting Torch Integration Example (#779 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-04-10 13:57:14 -07:00
Caio Rocha	a7273047e9	Fix TBG on DSL Get Operation (#778 )	2026-04-08 17:02:07 -07:00
Caio Rocha	3e5c41c98a	Adding Channel Type in ReduceSend Operation on DSL (#777 ) The reduce send operation in DSL essentially combines the reduce and put operations. The put operation carry the information about the channel type, whereas previously, we were using the channel type from the reduce operation.	2026-04-08 16:59:08 -07:00
Binyang Li	8896cd909a	Add ROCm FP8 E4M3B15 support (#774 ) ## Summary Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using inline assembly. Extends the allpair packet and fullmesh allreduce kernels to support higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data, improving numerical accuracy. Adds Python tests to verify that higher-precision accumulation is at least as accurate as native FP8 accumulation across all algorithm variants. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 09:53:45 -07:00
Binyang Li	96a72bbd3e	Support E4M3B15 datatype (#765 ) ## Summary - Add `fp8_e4m3b15` datatype: A software-defined FP8 type with 4 exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375). Implemented entirely in software with no HW dependency, using Triton-style bit manipulation through fp16 as intermediate for efficient conversion. - Add mixed-precision accumulation for allreduce: All allreduce algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and others) now support a configurable `accumDtype` parameter, enabling FP8 inputs to be reduced in float16 or float32 for higher accuracy. - Propagate `accumDtype` through the full API: The new parameter is threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc` → dispatch → CUDA kernels, with `DataType::AUTO` as the default (resolves to input dtype at runtime). - Add FP8 accumulation correctness tests: New `test_fp8_accum.py` validates that higher-precision accumulation produces results at least as accurate as native FP8 accumulation across multiple algorithms and sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm. - Add `test_fp8_accum.py` to CI: Azure Pipeline `ut.yml` now runs FP8 accumulation tests alongside existing pytests. - NCCL shim logging cleanup: Migrated `printf`-style `WARN`/`INFO` calls to streaming-style logging. ## Key files \| Area \| Files \| \|------\|-------\| \| New datatype + vector ops \| `include/mscclpp/gpu_data_types.hpp` \| \| Accumulation reduce helpers \| `src/core/include/reduce_kernel.hpp` \| \| Algorithm API (`accumDtype`) \| `include/mscclpp/algorithm.hpp`, `src/core/algorithm.cc` \| \| Allreduce kernels \| `src/ext/collectives/allreduce/*.cu` \| \| Dispatch + common \| `src/ext/collectives/include/allreduce/common.hpp` \| \| Python bindings \| `python/csrc/algorithm.cpp`, `python/mscclpp/_core/algorithm.py` \| \| Tests \| `python/test/test_fp8_accum.py` \| \| CI \| `.azure-pipelines/templates/ut.yml` \| ## Test plan - [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15 accumulation tests - [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped - [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP - [x] Existing `test_mscclpp.py` tests continue to pass - [x] NCCL shim builds and runs correctly with new `accumDtype` defaults 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 13:37:02 -07:00
Binyang Li	4f3638b60d	Use PTX red for D2D semaphore signal (#768 ) ## Summary - Replace the two-step `signal()` implementation (`incOutbound()` + `atomicStore()`) with a single fire-and-forget PTX `red.release.sys.global.add.u64` instruction - This eliminates one local atomic fetch-add and replaces a remote store with a remote atomic add that has no return value — more efficient on both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to fire-and-forget `flat_atomic_add_x2`) - Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait ping-pong latency ### Performance (H100, 2 ranks, signal+wait round-trip) ``` SemaphorePerfTest.SignalPingPong: Store-based (old): 2.595 us/iter Red-based (new): 2.345 us/iter Speedup: 1.11x ``` ## Test plan - [x] Builds successfully (`make mp_unit_tests`) - [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter "SemaphorePerfTest"` — 1.11x speedup 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 15:34:43 -07:00
Ekow Wellington	fd76507e9a	Install default plans under MSCCLPP_CACHE_DIR/default (#769 ) ### Summary Update the installer to place bundled default execution plans under `<MSCCLPP_CACHE_DIR>/default`, which is where the runtime already looks for bundled plans. ### Background The C++ runtime treats `MSCCLPP_CACHE_DIR` as the cache root and loads bundled default plans from `<cache root>/default`. When `MSCCLPP_CACHE_DIR` was set, the installer instead wrote bundled plans directly into the cache root, causing the runtime to miss them. This surfaced while running benchmarking tests with a non-default `MSCCLPP_CACHE_DIR`, where the bundled plans were not being discovered. ### Change This PR updates the installer to always install bundled default plans into `<MSCCLPP_CACHE_DIR>/default`, preserving the existing runtime contract. ### Scope - Installer-only change - No runtime behavior changes ### Validation Manual inspection of the updated install path. Successful build --------- Co-authored-by: Ekow Wellington <t-ekoww@microsoft.com>	2026-03-31 14:27:33 -05:00
Caio Rocha	4bc1999001	Adding Support to Setting Message Size Range in Native Algorithm API (#758 )	2026-02-27 17:50:43 -08:00
Binyang Li	184dcbf9d7	Add CI pipeline for no-IB environment testing (#755 ) ## Summary Add CI pipeline support for testing in environments without InfiniBand (IB) hardware. ## Changes ### IB stubs for no-IB builds (`src/core/ib.cc`) - Added stub implementations for `IbMr` and `IbQp` classes in the `#else // !defined(USE_IBVERBS)` block so the library links successfully when built with `-DMSCCLPP_USE_IB=OFF`. ### Environment variable to disable IB tests (`MSCCLPP_DISABLE_IB_TESTS`) - Added `disableIbTests` field to the `Env` class (`include/mscclpp/env.hpp`, `src/core/env.cpp`), reading from `MSCCLPP_DISABLE_IB_TESTS` env var. - Exposed as `disable_ib_tests` in Python bindings (`python/csrc/env_py.cpp`). - Updated `python/test/test_mscclpp.py` to skip IB-dependent tests (`create_group_and_connection` with IB transport, `test_h2h_semaphores`, `test_h2h_semaphores_gil_release`) when `env().disable_ib_tests` is true. ### CI pipeline (`ut-no-ib-env.yaml`, `ut.yml`) The no-IB environment pipeline runs two phases: 1. No-IB build phase: Build with `-DMSCCLPP_USE_IB=OFF`, deploy, run unit tests, multi-process unit tests, and pytests (with `MSCCLPP_DISABLE_IB_TESTS=1`). 2. IB build phase: Rebuild with IB enabled (default), stop the existing container, redeploy, and run pytests (with `MSCCLPP_DISABLE_IB_TESTS=1`) — verifying that the full IB-enabled build works correctly in a non-IB environment when IB tests are skipped. Also increased the job timeout from 40 to 60 minutes to accommodate the two-phase pipeline.	2026-02-24 15:55:59 -08:00
Caio Rocha	7738603d63	Adjusting Communicator in Python API (#752 )	2026-02-23 16:33:52 -08:00
Caio Rocha	b5256032fe	Disabling Nanobind Memory Leak Warnings in Release Builds (#745 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-23 11:55:17 -08:00
Caio Rocha	e2acf7f1c8	Removing MPI Dependency (#743 )	2026-02-20 16:04:12 -08:00
Binyang Li	39865c218b	address flagBuffer ownership issue (#749 ) This pull request updates the handling of the default flag buffer in the C++ and Python bindings to ensure proper memory management when interfacing with Python. Make sure the buffer will not be deallocated when transfer ownership from cpp to python	2026-02-20 13:42:29 -08:00
Binyang Li	4701ae3a95	Update dtype name (#748 ) - Change FP8_E4M3/FP8_E5M2 to FLOAT8_E4M3/FLOAT8_E5M2 - Add torch.uint8 to DataType.uint8 mapping	2026-02-18 10:35:44 -08:00
Binyang Li	bd68319e3e	Refactor algo selection logic and introduce symmetric_memory env (#741 ) This PR refactors the algorithm selection logic in MSCCL++ and introduces support for symmetric memory configuration through environment variables. 1. Algorithm Selection Refactoring Use separate class for algo selection. Could introduce more complex logic for algo selection based on message size, arch, if cuda graph is enabled and memory allocation method 2. Symmetric Memory Support Introduced symmetricMemory parameter in algorithm context key generation. Remove disableChannelCache env as is ambiguous 3. Add new args for build_default_algorithms Add flag_buffer, and flag_buffer_size args to build default algorithm. Then we could use unified flag buffer for different algorithms, avoid application hanging when switch algo for different message size. --------- Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-12 19:06:18 -08:00
Caio Rocha	dff3bc7bbb	Support Fusion for ReadPutPacket Operation at DSL (#742 ) Support is being added for fusing the ReadPutPacket operation on DSL, which reduces the overhead caused by reading packet data multiple times in the scratch buffer. Fusion will occur when two rppkt operations are executed consecutively with the same src_buffer: rppkt(src, dst0) + rppkt(src, dst1) -> rppkt(src, [dst0, dst1] Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-12 17:27:20 -08:00
Changho Hwang	42be3660e0	Add a new IB stack impl that doesn't use RDMA atomics (#728 ) * Added configurable InfiniBand (IB) signaling mode. `EndpointConfig::Ib::Mode` enum selects the mode (`Default`, `Host`, `HostNoAtomic`). `Default` is equivalent to `Host` unless specified different by envrionment `MSCCLPP_IBV_MODE`. `Host` corresponds to the previous implementation using RDMA atomics for signaling, while `HostNoAtomic` uses write-with-immediate instead. * Regarding updates in Python bindings and API.	2026-02-10 01:07:53 +00:00
Binyang Li	c12822a7af	create CI pipeline for rocm (#718 ) Create CI pipeline for AMD GPU.	2026-02-09 16:55:16 -08:00
Qinghua Zhou	620378b4fb	Fix cpplint error in main branch (#740 ) Fix the legacy cpplint error in main branch. --------- Co-authored-by: Qinghua Zhou <qinghuahzhou@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-05 09:25:12 -08:00
Binyang Li	e21513791a	Address comments for PR #692 (#733 ) Rename nanobind-exposed C++ types to Cpp* Replace MSCCLPP_EXECUTION_PLAN_DIR / MSCCLPP_NATIVE_CACHE_DIR with MSCCLPP_CACHE_DIR across C++ and Python.	2026-02-03 10:13:20 -08:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	78ce9fac8d	Fix ci pipeline failure (#729 )	2026-01-21 13:28:14 -05:00
Changho Hwang	105239fc6c	Use `GpuIpcMem` for NVLS connections (#719 ) * Now `NvlsConnection` internally reuses `GpuIpcMem` for multicast memory handling. * Removed unnecessary barriers from `connectNvlsCollective()` (CUDA API handles this automatically). * Updated `GpuIpcMem::map()` and `GpuIpcMem::mapMulticast()` to return a shared pointer with custom deleter for unmapping, which prevents misuse of raw pointers and reduces states to be stored in the `GpuIpcMem` instance. * Now for `RuntimeIpc` type handles, for consistency with other types, `cudaIpcOpenMemHandle` will be called in `GpuIpcMem::map()` instead of the ctor of `GpuIpcMem`. --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-01-15 13:16:04 +08:00
Changho Hwang	b8a1b0a134	Add CUDA 13.0 Docker images (#720 ) * Updated Dockerfiles and the build script to support CUDA 13.0 * Added Python3 venv which is required since Python 3.12 * Updated the default MLNX-OFED version to the LTS version * Added docker push instruction for multi-arch manifest	2026-01-09 19:03:33 +08:00
Changho Hwang	fc221e234d	Remove UB `std::` declarations (#709 ) Remove custom delcarations inside `std::` of which behaviors are undefined by the standard	2026-01-05 11:11:46 +08:00
Binyang Li	ca6a4a3274	Replace `__HIP_PLATFORM_AMD__` to use internal macro (#712 ) Replacing most of checks for `__HIP_PLATFORM_AMD__` with `MSCCLPP_DEVICE_HIP` for device and `MSCCLPP_USE_ROCM` for host source file.	2026-01-04 04:47:58 -08:00
Binyang Li	eda74a7f29	Add handle cache for AMD platform (#698 ) Introduce handle cache for AMD platform. Avoid reaching handle limitation if we open too much IPC handles For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-21 18:39:12 -08:00
Changho Hwang	9e076da3d4	Make IB more configurable (#703 ) * Added `port` and `gidIndex` field in the IB endpoint config (and `deviceIndex` field for future usages) * Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so * Added `--ib_gid_index` CLI option to `mp_unit_tests` * Other minor fixes	2025-12-18 13:21:07 -08:00
Changho Hwang	8b8593ba51	Fix Python bindings and tests (#690 ) Minimal fix to make things work. We need a more careful look at preventing silent fallback of nanobind when it fails to (properly) construct a C++ STL object with mscclpp instances.	2025-11-21 12:53:12 -08:00
Caio Rocha	a19bca9738	Fix Minor Issue Proxy Python Interface (#685 )	2025-11-17 09:03:00 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Caio Rocha	7eb3ff701a	Supporting New Packet Kernel Operation at Executor (#677 ) This PR introduces three new operations to enhance flexibility and performance at executor. One operation can be invoked directly via the DSL API and two operations are created through fusion of existing operations, reducing overhead and improving efficiency. 1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt format to the remote side in pkt format via the port channel. Both source and destination buffers must be scratch. 2. Reduce Copy Packet (Fusion): Reduce Packet+Copy Packet=Reduce Copy Packet Triggered when the destination buffer of Reduce Packet matches the source buffer of Copy Packet. Purpose: Combine reduction and copy into a single step for better performance. 3. Reduce Copy Send Packet (Fusion): Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer of Reduce Copy Packet matches src buffer of Put Packet) Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt buffer of Reduce Copy Packet matches src buffer of Read Put Packet) Purpose: Combine reduction, copy, and send operations into one optimized pipeline. Fusion Diagram Reduce Packet + Copy Packet → Reduce Copy Packet Reduce Copy Packet + Put Packet → Reduce Copy Send Packet Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet Beyond this, this PR adjust the AllReduce 2 Node algorithm: Message Size \| Latency (µs) 1K \| 15.34 2K \| 15.88 4K \| 15.71 8K \| 16.01 16K \| 15.88 32K \| 16.21 64K \| 16.90 128K \| 18.24 256K \| 20.39 512K \| 25.26 1M \| 32.74 2M \| 53.64	2025-11-13 14:08:44 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Changho Hwang	09219c1f6a	Fix #651 (#662 ) * Python cannot distinguish `Communicator::connect(const Endpoint&, ...)` from `Communicator::connect(const EndpointConfig&, ...)`. Temporarily removed the former one. * A few other fixes in Python bindings.	2025-10-24 14:25:51 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Changho Hwang	200cdf946e	Update `EndpointConfig` interfaces (#651 ) * Separate IB-specific options into a nested struct * Enable `connect()` by an `Endpoint`, not only by `EndpointConfig` * Other minor changes	2025-10-22 10:39:39 -07:00
Binyang Li	b1a88d755e	Pipeline fix (#645 ) Co-authored-by: github-actions <github-actions@github.com>	2025-10-10 11:26:33 -07:00
Binyang Li	ddca185add	Address corner case when generating version file (#641 ) Address corner case for version file generation --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com>	2025-10-07 14:32:33 -07:00
Binyang Li	3d94383696	Add MSCCLPP_GIT_COMMIT micro (#640 ) - Add MSCCLPP_GIT_COMMIT micro - Update docs	2025-10-06 15:57:28 -07:00
Caio Rocha	b76f3ebf39	Add 2 Node AllReduce DSL Algorithm (#636 ) This PR creates two allreduce algorithms designed for a 2-node environment. These algorithms are in-place and non-zero copy.	2025-10-01 17:00:17 -07:00
Qinghua Zhou	16a96ea77b	Support detailed version tracking that captures git repository information (#639 ) #### Version Format The package version includes the git commit hash directly in the version string for development builds: - Release version: `0.7.0` - Development version: `0.7.0.dev36+g6e2360d69` (includes short commit hash) - Development with uncommitted changes: `0.7.0.dev36+g6e2360d69.dirty` #### Checking Version Information After installation, you can check the version information in several ways: From Python: ```python import mscclpp # Access individual attributes print(f"Version: {mscclpp.__version__}") # Full version with commit Version: 0.7.0.dev36+g6e2360d69 # Get as dictionary mscclpp.version() {'version': '0.7.0.dev46+gb0d27c58f', 'base_version': '0.7.0', 'git_commit': 'b0d27c58f'} ``` #### Version Information Details The version tracking captures: - Package Version (`mscclpp.__version__`): Full version string including git commit (e.g., `0.7.0.dev36+g6e2360d69`) This information is embedded during the package build process and remains accessible even after distribution, making it easier to debug issues and ensure reproducibility. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-09-30 09:00:33 -07:00
Caio Rocha	c3473b1794	Thread Block Group DSL (#621 ) Supporting the creation of a group of thread block to perform some operation.	2025-09-03 14:58:40 -07:00
Caio Rocha	4d9bb9f015	Adding Channel Id Field DSL Port Channel Operations (#615 )	2025-08-15 16:10:52 -07:00
Caio Rocha	9261b1d278	AlltoAll Test Support (#606 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-15 16:00:41 -07:00
Binyang Li	699cc45eed	Fix ut (#613 ) Fix pytest	2025-08-14 17:15:28 -07:00
Changho Hwang	2eadbaf86f	python doc auto generation (#605 ) Add Python API references	2025-08-11 10:34:29 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Changho Hwang	542a10f69e	Merge ChannelTrigger with ProxyTrigger (#601 )	2025-08-08 19:07:50 +00:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Binyang Li	658411ccc4	update pytest and python API to fix ut failure (#598 ) update pytest and python API to fix ut failure	2025-08-05 15:17:33 -07:00

1 2 3 4

180 Commits