mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Ubuntu	54c2f5098e	merge main	2026-04-10 23:19:15 +00:00
Caio Rocha	feda338595	Adjusting Torch Integration Example (#779 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-04-10 13:57:14 -07:00
Ubuntu	68690ecdcd	revert dsl	2026-04-10 17:21:50 +00:00
Ubuntu	96defbd8a8	add executor for testing	2026-04-10 15:39:03 +00:00
Ubuntu	6d8fb00a91	add extra signal/wait and avoid local flush	2026-04-09 15:58:07 +00:00
Changho Hwang	d63f9403c0	IB `host-no-atomic`: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753 ) Major enhancements to the IB signal forwarding mechanisms (`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5 Direct Verbs, and refactoring the signal forwarding path for IB HostNoAtomic mode. The changes fix memory consistency issues and reduce signaling latency. - GDRCopy and MLX5 Direct Verbs MR integration - Signal forwarding path redesign - Semaphore and connection API updates - Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates	2026-04-09 09:24:30 +00:00
Caio Rocha	a7273047e9	Fix TBG on DSL Get Operation (#778 )	2026-04-08 17:02:07 -07:00
Caio Rocha	3e5c41c98a	Adding Channel Type in ReduceSend Operation on DSL (#777 ) The reduce send operation in DSL essentially combines the reduce and put operations. The put operation carry the information about the channel type, whereas previously, we were using the channel type from the reduce operation.	2026-04-08 16:59:08 -07:00
Qinghua Zhou	ed565ceb33	Fix missing directory of document for new tag v0.9.0 (#776 ) The v0.9.0 conf.py (introduced in #775) dynamically loads the version from python/mscclpp/_version.py. This file is generated at build time by setuptools_scm and is listed in .gitignore — it is never committed to the repo. Earlier tags (v0.8.0 and below) used a hardcoded release (e.g., "v0.8.0") in conf.py, so they had no dependency on generated files. sphinx-multiversion checks out each tag using git archive, which only extracts committed files. Since _version.py is not committed, the v0.9.0 checkout is missing it, and conf.py crashes on import. All future tags will have this same problem. Three changes: 1. docs/build_multiversion.py (new): A wrapper around sphinx-multiversion that monkey-patches copy_tree to generate _version.py in each tag checkout after extraction. The version string is parsed from the tag name (e.g., v0.9.0 → __version__ = "0.9.0"). 2. Makefile: The multiversion target now calls build_multiversion.py instead of sphinx-multiversion directly. 3. conf.py: Added a fallback so that if _version.py doesn't exist, it reads the version from the VERSION file instead. This makes conf.py resilient for any future scenario where _version.py is missing. Testing Verified locally: • make multiversion now successfully builds all 11 versions (v0.4.0 through v0.9.0) • v0.9.0 docs are correctly generated under _build/html/v0.9.0/ Version selector shows v0.9.0 as latest v0.9.0	2026-04-08 17:59:05 -04:00
Binyang Li	8896cd909a	Add ROCm FP8 E4M3B15 support (#774 ) ## Summary Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using inline assembly. Extends the allpair packet and fullmesh allreduce kernels to support higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data, improving numerical accuracy. Adds Python tests to verify that higher-precision accumulation is at least as accurate as native FP8 accumulation across all algorithm variants. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 09:53:45 -07:00
Mahdieh Ghazi	e66ce39647	Mahdieh/update version number (#775 ) Update the version number for v0.9.0	2026-04-08 12:38:56 -04:00
Binyang Li	96a72bbd3e	Support E4M3B15 datatype (#765 ) ## Summary - Add `fp8_e4m3b15` datatype: A software-defined FP8 type with 4 exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375). Implemented entirely in software with no HW dependency, using Triton-style bit manipulation through fp16 as intermediate for efficient conversion. - Add mixed-precision accumulation for allreduce: All allreduce algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and others) now support a configurable `accumDtype` parameter, enabling FP8 inputs to be reduced in float16 or float32 for higher accuracy. - Propagate `accumDtype` through the full API: The new parameter is threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc` → dispatch → CUDA kernels, with `DataType::AUTO` as the default (resolves to input dtype at runtime). - Add FP8 accumulation correctness tests: New `test_fp8_accum.py` validates that higher-precision accumulation produces results at least as accurate as native FP8 accumulation across multiple algorithms and sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm. - Add `test_fp8_accum.py` to CI: Azure Pipeline `ut.yml` now runs FP8 accumulation tests alongside existing pytests. - NCCL shim logging cleanup: Migrated `printf`-style `WARN`/`INFO` calls to streaming-style logging. ## Key files \| Area \| Files \| \|------\|-------\| \| New datatype + vector ops \| `include/mscclpp/gpu_data_types.hpp` \| \| Accumulation reduce helpers \| `src/core/include/reduce_kernel.hpp` \| \| Algorithm API (`accumDtype`) \| `include/mscclpp/algorithm.hpp`, `src/core/algorithm.cc` \| \| Allreduce kernels \| `src/ext/collectives/allreduce/*.cu` \| \| Dispatch + common \| `src/ext/collectives/include/allreduce/common.hpp` \| \| Python bindings \| `python/csrc/algorithm.cpp`, `python/mscclpp/_core/algorithm.py` \| \| Tests \| `python/test/test_fp8_accum.py` \| \| CI \| `.azure-pipelines/templates/ut.yml` \| ## Test plan - [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15 accumulation tests - [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped - [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP - [x] Existing `test_mscclpp.py` tests continue to pass - [x] NCCL shim builds and runs correctly with new `accumDtype` defaults 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 13:37:02 -07:00
Binyang Li	fa95e82e18	Fix CI/CD pipeline issues (#773 ) This pull request updates the deployment pipeline to allow custom CMake arguments to be passed to the pip install process on remote VMs. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 08:41:51 -07:00
Ubuntu	3f2ade22cb	add barrier	2026-04-07 01:40:15 +00:00
Ubuntu	812f6cfded	fix hang on 4 ranks and make send/recv test more like nccl-test	2026-04-07 01:33:48 +00:00
Ubuntu	1a065dd6ad	add help scripts	2026-04-06 20:06:21 +00:00
Ubuntu	2c3f125d4c	add changes from ib and connection	2026-04-06 03:29:54 +00:00
Ubuntu	e487f831e6	debug	2026-04-06 03:01:30 +00:00
Ubuntu	ad56728c6d	fix	2026-04-06 02:32:58 +00:00
Ubuntu	8cecfee270	debug	2026-04-06 02:24:23 +00:00
Ubuntu	07d97f6f17	Unique QP per channel and env-controlled GID index - Change executor to create one connection (unique QP) per channel entry instead of sharing connections per peer. This is required for HostNoAtomic IB mode where each connection can only forward signals to one semaphore via setSignalForwardingDst. - Add MSCCLPP_IB_GID_INDEX environment variable to override the default GID index (3) used for IB transport. Set to the desired GID index value, or leave unset/-1 to use the default.	2026-04-06 02:18:56 +00:00
Ubuntu	251873ca8e	update	2026-04-06 02:14:52 +00:00
Ubuntu	1e6d4939a8	update	2026-04-06 02:11:36 +00:00
Ubuntu	289f89ddfe	update	2026-04-06 02:07:05 +00:00
Ubuntu	a4118eae73	update the number of instances	2026-04-06 02:06:37 +00:00
Ubuntu	b1cc649470	re-format output	2026-04-06 02:05:53 +00:00
Ubuntu	a191f16b76	add scripts	2026-04-06 02:04:49 +00:00
Ubuntu	d07a1ba28c	show scale in output	2026-04-06 02:02:10 +00:00
Ubuntu	27fbddb707	update the executor so we have message size range	2026-04-06 02:00:04 +00:00
Ubuntu	49979e58ab	tune #instances and remoce extra barriers	2026-04-06 01:55:43 +00:00
Ubuntu	194a79f772	add sendrecv correctness check	2026-04-06 01:46:55 +00:00
Ubuntu	a4bb8fb4bf	add debugging code	2026-04-06 01:37:18 +00:00
Changho Hwang	b04fa2daa7	lint	2026-04-04 06:22:04 +00:00
Changho Hwang	f62633ad41	mlx5dv bug fixes & enhanced unit tests perf reporting	2026-04-04 06:18:44 +00:00
Changho Hwang	53099a7cf9	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-04-01 22:45:58 -07:00
Binyang Li	be9126ca1b	Fix run-remote.sh to support multi-command scripts (#770 ) ## Summary - Fix `run-remote.sh` to correctly execute multi-command scripts (e.g., multiple `mpirun` calls) - The old approach piped decoded script through `base64 -d \| bash`, which feeds the script via bash's stdin. When `mpirun` (or its child processes) runs, it can consume the remaining stdin, causing bash to never see subsequent commands — only the first command would execute. - The fix decodes the script to a temp file and runs `bash -euxo pipefail "$TMP"` instead, so bash reads commands from the file and `mpirun` consuming stdin has no effect. - Applied to both the docker path (pssh + docker exec) and the non-docker path (pssh only). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-01 16:25:19 -07:00
Changho Hwang	553fd3b2d8	lint	2026-04-01 21:20:55 +00:00
Changho Hwang	94d0508ec2	prerequisites update	2026-04-01 21:18:47 +00:00
Changho Hwang	ff4d825652	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-04-01 14:01:55 -07:00
Changho Hwang	848b89b59c	64-bit token reconstruction	2026-04-01 21:00:54 +00:00
Changho Hwang	4cf53328ad	updates	2026-04-01 19:36:52 +00:00
Changho Hwang	f8e94d9971	disable mlx5dv_reg_dmabuf_mr	2026-04-01 19:00:03 +00:00
Changho Hwang	144046b818	revert	2026-04-01 18:22:16 +00:00
Changho Hwang	d1124fba29	revert	2026-04-01 18:20:29 +00:00
Changho Hwang	67f9933ba1	fix data direct	2026-04-01 10:20:43 +00:00
Changho Hwang	d2f7056cf4	Add unit testing framework readme (#766 )	2026-04-01 05:30:35 +00:00
Binyang Li	4f3638b60d	Use PTX red for D2D semaphore signal (#768 ) ## Summary - Replace the two-step `signal()` implementation (`incOutbound()` + `atomicStore()`) with a single fire-and-forget PTX `red.release.sys.global.add.u64` instruction - This eliminates one local atomic fetch-add and replaces a remote store with a remote atomic add that has no return value — more efficient on both NVIDIA (PTX `red`) and AMD (compiler optimizes `(void)fetch_add` to fire-and-forget `flat_atomic_add_x2`) - Add a C++ perf test (`PERF_TEST`) in `mp_unit` for signal+wait ping-pong latency ### Performance (H100, 2 ranks, signal+wait round-trip) ``` SemaphorePerfTest.SignalPingPong: Store-based (old): 2.595 us/iter Red-based (new): 2.345 us/iter Speedup: 1.11x ``` ## Test plan - [x] Builds successfully (`make mp_unit_tests`) - [x] `mpirun -np 2 ./build/bin/mp_unit_tests --filter "SemaphorePerfTest"` — 1.11x speedup 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-31 15:34:43 -07:00
Ekow Wellington	fd76507e9a	Install default plans under MSCCLPP_CACHE_DIR/default (#769 ) ### Summary Update the installer to place bundled default execution plans under `<MSCCLPP_CACHE_DIR>/default`, which is where the runtime already looks for bundled plans. ### Background The C++ runtime treats `MSCCLPP_CACHE_DIR` as the cache root and loads bundled default plans from `<cache root>/default`. When `MSCCLPP_CACHE_DIR` was set, the installer instead wrote bundled plans directly into the cache root, causing the runtime to miss them. This surfaced while running benchmarking tests with a non-default `MSCCLPP_CACHE_DIR`, where the bundled plans were not being discovered. ### Change This PR updates the installer to always install bundled default plans into `<MSCCLPP_CACHE_DIR>/default`, preserving the existing runtime contract. ### Scope - Installer-only change - No runtime behavior changes ### Validation Manual inspection of the updated install path. Successful build --------- Co-authored-by: Ekow Wellington <t-ekoww@microsoft.com>	2026-03-31 14:27:33 -05:00
Changho Hwang	80f554ebaf	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-03-26 18:02:43 -04:00
Copilot	93f6eeaa6b	Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 ) - Removes the GTest dependency, replacing it with a minimal custom framework (`test/framework.`) that covers only what the tests actually use — a unified `TEST()` macro with SFINAE-based fixture auto-detection, `EXPECT_`/`ASSERT_*` assertions, environments, and setup/teardown. - `--exclude-perf-tests` flag and substring-based negative filtering - `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to Codecov - Merges standalone `test/perf/` into main test targets - Refactors Azure pipelines to reduce redundancies & make more readable --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-03-24 23:34:38 -04:00

1 2 3 4 5 ...

1058 Commits