mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Ekow Wellington	fa52c565e6	updates to expand worldsize	2026-04-27 23:45:08 -05:00
Caio Rocha	719e9124af	wip	2026-04-14 23:18:03 +00:00
Caio Rocha	17774b5f83	wip	2026-04-14 22:52:27 +00:00
Caio Rocha	e6602b4a8b	wip	2026-04-14 20:47:02 +00:00
Ubuntu	1fd5ed8f18	update the script	2026-04-13 21:20:04 +00:00
binyli	a2a1b89181	for 4 nodes	2026-04-13 20:56:15 +00:00
Ubuntu	36abcbedd3	WIP	2026-04-11 06:40:19 +00:00
Ubuntu	456ef7e5ba	fix	2026-04-11 06:33:36 +00:00
Ubuntu	65139d6f6d	WIP	2026-04-11 06:12:46 +00:00
Ubuntu	57f7be6260	WIP	2026-04-11 05:28:29 +00:00
Ubuntu	76fdd1db7a	WIP	2026-04-11 04:53:49 +00:00
Ubuntu	f83a5571b8	Add sendrecv support with double-buffer to executor_test - Add TEST_DATA_SEND_RECV verifier kernel that replays fill_data PRNG with peer_rank seed to validate received data - Add double-buffer support for sendrecv in executor_test.py: allocate 2 input/result/test buffers, alternate per iteration - Create two executor funcs for sendrecv, one per buffer pair - Update bench_correctness and bench_time to handle double-buffer - Add bandwidth reporting to output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-11 04:47:33 +00:00
Ubuntu	54c2f5098e	merge main	2026-04-10 23:19:15 +00:00
Caio Rocha	feda338595	Adjusting Torch Integration Example (#779 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-04-10 13:57:14 -07:00
Ubuntu	68690ecdcd	revert dsl	2026-04-10 17:21:50 +00:00
Ubuntu	96defbd8a8	add executor for testing	2026-04-10 15:39:03 +00:00
Ubuntu	6d8fb00a91	add extra signal/wait and avoid local flush	2026-04-09 15:58:07 +00:00
Changho Hwang	d63f9403c0	IB `host-no-atomic`: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753 ) Major enhancements to the IB signal forwarding mechanisms (`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5 Direct Verbs, and refactoring the signal forwarding path for IB HostNoAtomic mode. The changes fix memory consistency issues and reduce signaling latency. - GDRCopy and MLX5 Direct Verbs MR integration - Signal forwarding path redesign - Semaphore and connection API updates - Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates	2026-04-09 09:24:30 +00:00
Caio Rocha	a7273047e9	Fix TBG on DSL Get Operation (#778 )	2026-04-08 17:02:07 -07:00
Caio Rocha	3e5c41c98a	Adding Channel Type in ReduceSend Operation on DSL (#777 ) The reduce send operation in DSL essentially combines the reduce and put operations. The put operation carry the information about the channel type, whereas previously, we were using the channel type from the reduce operation.	2026-04-08 16:59:08 -07:00
Qinghua Zhou	ed565ceb33	Fix missing directory of document for new tag v0.9.0 (#776 ) The v0.9.0 conf.py (introduced in #775) dynamically loads the version from python/mscclpp/_version.py. This file is generated at build time by setuptools_scm and is listed in .gitignore — it is never committed to the repo. Earlier tags (v0.8.0 and below) used a hardcoded release (e.g., "v0.8.0") in conf.py, so they had no dependency on generated files. sphinx-multiversion checks out each tag using git archive, which only extracts committed files. Since _version.py is not committed, the v0.9.0 checkout is missing it, and conf.py crashes on import. All future tags will have this same problem. Three changes: 1. docs/build_multiversion.py (new): A wrapper around sphinx-multiversion that monkey-patches copy_tree to generate _version.py in each tag checkout after extraction. The version string is parsed from the tag name (e.g., v0.9.0 → __version__ = "0.9.0"). 2. Makefile: The multiversion target now calls build_multiversion.py instead of sphinx-multiversion directly. 3. conf.py: Added a fallback so that if _version.py doesn't exist, it reads the version from the VERSION file instead. This makes conf.py resilient for any future scenario where _version.py is missing. Testing Verified locally: • make multiversion now successfully builds all 11 versions (v0.4.0 through v0.9.0) • v0.9.0 docs are correctly generated under _build/html/v0.9.0/ Version selector shows v0.9.0 as latest v0.9.0	2026-04-08 17:59:05 -04:00
Binyang Li	8896cd909a	Add ROCm FP8 E4M3B15 support (#774 ) ## Summary Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using inline assembly. Extends the allpair packet and fullmesh allreduce kernels to support higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data, improving numerical accuracy. Adds Python tests to verify that higher-precision accumulation is at least as accurate as native FP8 accumulation across all algorithm variants. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 09:53:45 -07:00
Mahdieh Ghazi	e66ce39647	Mahdieh/update version number (#775 ) Update the version number for v0.9.0	2026-04-08 12:38:56 -04:00
Binyang Li	96a72bbd3e	Support E4M3B15 datatype (#765 ) ## Summary - Add `fp8_e4m3b15` datatype: A software-defined FP8 type with 4 exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375). Implemented entirely in software with no HW dependency, using Triton-style bit manipulation through fp16 as intermediate for efficient conversion. - Add mixed-precision accumulation for allreduce: All allreduce algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and others) now support a configurable `accumDtype` parameter, enabling FP8 inputs to be reduced in float16 or float32 for higher accuracy. - Propagate `accumDtype` through the full API: The new parameter is threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc` → dispatch → CUDA kernels, with `DataType::AUTO` as the default (resolves to input dtype at runtime). - Add FP8 accumulation correctness tests: New `test_fp8_accum.py` validates that higher-precision accumulation produces results at least as accurate as native FP8 accumulation across multiple algorithms and sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm. - Add `test_fp8_accum.py` to CI: Azure Pipeline `ut.yml` now runs FP8 accumulation tests alongside existing pytests. - NCCL shim logging cleanup: Migrated `printf`-style `WARN`/`INFO` calls to streaming-style logging. ## Key files \| Area \| Files \| \|------\|-------\| \| New datatype + vector ops \| `include/mscclpp/gpu_data_types.hpp` \| \| Accumulation reduce helpers \| `src/core/include/reduce_kernel.hpp` \| \| Algorithm API (`accumDtype`) \| `include/mscclpp/algorithm.hpp`, `src/core/algorithm.cc` \| \| Allreduce kernels \| `src/ext/collectives/allreduce/*.cu` \| \| Dispatch + common \| `src/ext/collectives/include/allreduce/common.hpp` \| \| Python bindings \| `python/csrc/algorithm.cpp`, `python/mscclpp/_core/algorithm.py` \| \| Tests \| `python/test/test_fp8_accum.py` \| \| CI \| `.azure-pipelines/templates/ut.yml` \| ## Test plan - [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15 accumulation tests - [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped - [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP - [x] Existing `test_mscclpp.py` tests continue to pass - [x] NCCL shim builds and runs correctly with new `accumDtype` defaults 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 13:37:02 -07:00
Binyang Li	fa95e82e18	Fix CI/CD pipeline issues (#773 ) This pull request updates the deployment pipeline to allow custom CMake arguments to be passed to the pip install process on remote VMs. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 08:41:51 -07:00
Ubuntu	3f2ade22cb	add barrier	2026-04-07 01:40:15 +00:00
Ubuntu	812f6cfded	fix hang on 4 ranks and make send/recv test more like nccl-test	2026-04-07 01:33:48 +00:00
Ubuntu	1a065dd6ad	add help scripts	2026-04-06 20:06:21 +00:00
Ubuntu	2c3f125d4c	add changes from ib and connection	2026-04-06 03:29:54 +00:00
Ubuntu	e487f831e6	debug	2026-04-06 03:01:30 +00:00
Ubuntu	ad56728c6d	fix	2026-04-06 02:32:58 +00:00
Ubuntu	8cecfee270	debug	2026-04-06 02:24:23 +00:00
Ubuntu	07d97f6f17	Unique QP per channel and env-controlled GID index - Change executor to create one connection (unique QP) per channel entry instead of sharing connections per peer. This is required for HostNoAtomic IB mode where each connection can only forward signals to one semaphore via setSignalForwardingDst. - Add MSCCLPP_IB_GID_INDEX environment variable to override the default GID index (3) used for IB transport. Set to the desired GID index value, or leave unset/-1 to use the default.	2026-04-06 02:18:56 +00:00
Ubuntu	251873ca8e	update	2026-04-06 02:14:52 +00:00
Ubuntu	1e6d4939a8	update	2026-04-06 02:11:36 +00:00
Ubuntu	289f89ddfe	update	2026-04-06 02:07:05 +00:00
Ubuntu	a4118eae73	update the number of instances	2026-04-06 02:06:37 +00:00
Ubuntu	b1cc649470	re-format output	2026-04-06 02:05:53 +00:00
Ubuntu	a191f16b76	add scripts	2026-04-06 02:04:49 +00:00
Ubuntu	d07a1ba28c	show scale in output	2026-04-06 02:02:10 +00:00
Ubuntu	27fbddb707	update the executor so we have message size range	2026-04-06 02:00:04 +00:00
Ubuntu	49979e58ab	tune #instances and remoce extra barriers	2026-04-06 01:55:43 +00:00
Ubuntu	194a79f772	add sendrecv correctness check	2026-04-06 01:46:55 +00:00
Ubuntu	a4bb8fb4bf	add debugging code	2026-04-06 01:37:18 +00:00
Changho Hwang	b04fa2daa7	lint	2026-04-04 06:22:04 +00:00
Changho Hwang	f62633ad41	mlx5dv bug fixes & enhanced unit tests perf reporting	2026-04-04 06:18:44 +00:00
Changho Hwang	53099a7cf9	Merge branch 'main' into chhwang/fix-ib-no-atomic	2026-04-01 22:45:58 -07:00
Binyang Li	be9126ca1b	Fix run-remote.sh to support multi-command scripts (#770 ) ## Summary - Fix `run-remote.sh` to correctly execute multi-command scripts (e.g., multiple `mpirun` calls) - The old approach piped decoded script through `base64 -d \| bash`, which feeds the script via bash's stdin. When `mpirun` (or its child processes) runs, it can consume the remaining stdin, causing bash to never see subsequent commands — only the first command would execute. - The fix decodes the script to a temp file and runs `bash -euxo pipefail "$TMP"` instead, so bash reads commands from the file and `mpirun` consuming stdin has no effect. - Applied to both the docker path (pssh + docker exec) and the non-docker path (pssh only). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-01 16:25:19 -07:00
Changho Hwang	553fd3b2d8	lint	2026-04-01 21:20:55 +00:00
Changho Hwang	94d0508ec2	prerequisites update	2026-04-01 21:18:47 +00:00

1 2 3 4 5 ...

1070 Commits