1070 Commits

Author SHA1 Message Date
Ekow Wellington
fa52c565e6 updates to expand worldsize 2026-04-27 23:45:08 -05:00
Caio Rocha
719e9124af wip 2026-04-14 23:18:03 +00:00
Caio Rocha
17774b5f83 wip 2026-04-14 22:52:27 +00:00
Caio Rocha
e6602b4a8b wip 2026-04-14 20:47:02 +00:00
Ubuntu
1fd5ed8f18 update the script 2026-04-13 21:20:04 +00:00
binyli
a2a1b89181 for 4 nodes 2026-04-13 20:56:15 +00:00
Ubuntu
36abcbedd3 WIP 2026-04-11 06:40:19 +00:00
Ubuntu
456ef7e5ba fix 2026-04-11 06:33:36 +00:00
Ubuntu
65139d6f6d WIP 2026-04-11 06:12:46 +00:00
Ubuntu
57f7be6260 WIP 2026-04-11 05:28:29 +00:00
Ubuntu
76fdd1db7a WIP 2026-04-11 04:53:49 +00:00
Ubuntu
f83a5571b8 Add sendrecv support with double-buffer to executor_test
- Add TEST_DATA_SEND_RECV verifier kernel that replays fill_data PRNG
  with peer_rank seed to validate received data
- Add double-buffer support for sendrecv in executor_test.py:
  allocate 2 input/result/test buffers, alternate per iteration
- Create two executor funcs for sendrecv, one per buffer pair
- Update bench_correctness and bench_time to handle double-buffer
- Add bandwidth reporting to output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-11 04:47:33 +00:00
Ubuntu
54c2f5098e merge main 2026-04-10 23:19:15 +00:00
Caio Rocha
feda338595 Adjusting Torch Integration Example (#779)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-04-10 13:57:14 -07:00
Ubuntu
68690ecdcd revert dsl 2026-04-10 17:21:50 +00:00
Ubuntu
96defbd8a8 add executor for testing 2026-04-10 15:39:03 +00:00
Ubuntu
6d8fb00a91 add extra signal/wait and avoid local flush 2026-04-09 15:58:07 +00:00
Changho Hwang
d63f9403c0 IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753)
Major enhancements to the IB signal forwarding mechanisms
(`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5
Direct Verbs, and refactoring the signal forwarding path for IB
HostNoAtomic mode. The changes fix memory consistency issues and reduce
signaling latency.
- GDRCopy and MLX5 Direct Verbs MR integration
- Signal forwarding path redesign
- Semaphore and connection API updates
- Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates
2026-04-09 09:24:30 +00:00
Caio Rocha
a7273047e9 Fix TBG on DSL Get Operation (#778) 2026-04-08 17:02:07 -07:00
Caio Rocha
3e5c41c98a Adding Channel Type in ReduceSend Operation on DSL (#777)
The reduce send operation in DSL essentially combines the reduce and put
operations. The put operation carry the information about the channel
type, whereas previously, we were using the channel type from the reduce
operation.
2026-04-08 16:59:08 -07:00
Qinghua Zhou
ed565ceb33 Fix missing directory of document for new tag v0.9.0 (#776)
The v0.9.0 conf.py (introduced in #775) dynamically loads the version
from python/mscclpp/_version.py.

This file is generated at build time by setuptools_scm and is listed in
.gitignore — it is never committed to the repo. Earlier tags (v0.8.0 and
below) used a hardcoded release (e.g., "v0.8.0") in conf.py, so they had
no dependency on generated files.
sphinx-multiversion checks out each tag using git archive, which only
extracts committed files.
Since _version.py is not committed, the v0.9.0 checkout is missing it,
and conf.py crashes on import. All future tags will have this same
problem.

**Three changes:**
1. docs/build_multiversion.py (new): A wrapper around
sphinx-multiversion that monkey-patches copy_tree to generate
_version.py in each tag checkout after extraction. The version string is
parsed from the tag name (e.g., v0.9.0 → __version__ = "0.9.0").
2. Makefile: The multiversion target now calls build_multiversion.py
instead of sphinx-multiversion directly.
3. conf.py: Added a fallback so that if _version.py doesn't exist, it
reads the version from the VERSION file instead. This makes conf.py
resilient for any future scenario where _version.py is missing.

**Testing**
Verified locally:
• make multiversion now successfully builds all 11 versions (v0.4.0
through v0.9.0)
• v0.9.0 docs are correctly generated under _build/html/v0.9.0/
Version selector shows v0.9.0 as latest
v0.9.0
2026-04-08 17:59:05 -04:00
Binyang Li
8896cd909a Add ROCm FP8 E4M3B15 support (#774)
## Summary

Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including
optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using
inline assembly.

Extends the allpair packet and fullmesh allreduce kernels to support
higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data,
improving numerical accuracy.

Adds Python tests to verify that higher-precision accumulation is at
least as accurate as native FP8 accumulation across all algorithm
variants.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 09:53:45 -07:00
Mahdieh Ghazi
e66ce39647 Mahdieh/update version number (#775)
Update the version number for v0.9.0
2026-04-08 12:38:56 -04:00
Binyang Li
96a72bbd3e Support E4M3B15 datatype (#765)
## Summary

- **Add `fp8_e4m3b15` datatype**: A software-defined FP8 type with 4
exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375).
Implemented entirely in software with no HW dependency, using
Triton-style bit manipulation through fp16 as intermediate for efficient
conversion.
- **Add mixed-precision accumulation for allreduce**: All allreduce
algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and
others) now support a configurable `accumDtype` parameter, enabling FP8
inputs to be reduced in float16 or float32 for higher accuracy.
- **Propagate `accumDtype` through the full API**: The new parameter is
threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc`
→ dispatch → CUDA kernels, with `DataType::AUTO` as the default
(resolves to input dtype at runtime).
- **Add FP8 accumulation correctness tests**: New `test_fp8_accum.py`
validates that higher-precision accumulation produces results at least
as accurate as native FP8 accumulation across multiple algorithms and
sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm.
- **Add `test_fp8_accum.py` to CI**: Azure Pipeline `ut.yml` now runs
FP8 accumulation tests alongside existing pytests.
- **NCCL shim logging cleanup**: Migrated `printf`-style `WARN`/`INFO`
calls to streaming-style logging.

## Key files

| Area | Files |
|------|-------|
| New datatype + vector ops | `include/mscclpp/gpu_data_types.hpp` |
| Accumulation reduce helpers | `src/core/include/reduce_kernel.hpp` |
| Algorithm API (`accumDtype`) | `include/mscclpp/algorithm.hpp`,
`src/core/algorithm.cc` |
| Allreduce kernels | `src/ext/collectives/allreduce/*.cu` |
| Dispatch + common | `src/ext/collectives/include/allreduce/common.hpp`
|
| Python bindings | `python/csrc/algorithm.cpp`,
`python/mscclpp/_core/algorithm.py` |
| Tests | `python/test/test_fp8_accum.py` |
| CI | `.azure-pipelines/templates/ut.yml` |

## Test plan

- [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15
accumulation tests
- [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped
- [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP
- [x] Existing `test_mscclpp.py` tests continue to pass
- [x] NCCL shim builds and runs correctly with new `accumDtype` defaults

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 13:37:02 -07:00
Binyang Li
fa95e82e18 Fix CI/CD pipeline issues (#773)
This pull request updates the deployment pipeline to allow custom CMake
arguments to be passed to the pip install process on remote VMs.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 08:41:51 -07:00
Ubuntu
3f2ade22cb add barrier 2026-04-07 01:40:15 +00:00
Ubuntu
812f6cfded fix hang on 4 ranks and make send/recv test more like nccl-test 2026-04-07 01:33:48 +00:00
Ubuntu
1a065dd6ad add help scripts 2026-04-06 20:06:21 +00:00
Ubuntu
2c3f125d4c add changes from ib and connection 2026-04-06 03:29:54 +00:00
Ubuntu
e487f831e6 debug 2026-04-06 03:01:30 +00:00
Ubuntu
ad56728c6d fix 2026-04-06 02:32:58 +00:00
Ubuntu
8cecfee270 debug 2026-04-06 02:24:23 +00:00
Ubuntu
07d97f6f17 Unique QP per channel and env-controlled GID index
- Change executor to create one connection (unique QP) per channel entry
  instead of sharing connections per peer. This is required for HostNoAtomic
  IB mode where each connection can only forward signals to one semaphore
  via setSignalForwardingDst.

- Add MSCCLPP_IB_GID_INDEX environment variable to override the default
  GID index (3) used for IB transport. Set to the desired GID index value,
  or leave unset/-1 to use the default.
2026-04-06 02:18:56 +00:00
Ubuntu
251873ca8e update 2026-04-06 02:14:52 +00:00
Ubuntu
1e6d4939a8 update 2026-04-06 02:11:36 +00:00
Ubuntu
289f89ddfe update 2026-04-06 02:07:05 +00:00
Ubuntu
a4118eae73 update the number of instances 2026-04-06 02:06:37 +00:00
Ubuntu
b1cc649470 re-format output 2026-04-06 02:05:53 +00:00
Ubuntu
a191f16b76 add scripts 2026-04-06 02:04:49 +00:00
Ubuntu
d07a1ba28c show scale in output 2026-04-06 02:02:10 +00:00
Ubuntu
27fbddb707 update the executor so we have message size range 2026-04-06 02:00:04 +00:00
Ubuntu
49979e58ab tune #instances and remoce extra barriers 2026-04-06 01:55:43 +00:00
Ubuntu
194a79f772 add sendrecv correctness check 2026-04-06 01:46:55 +00:00
Ubuntu
a4bb8fb4bf add debugging code 2026-04-06 01:37:18 +00:00
Changho Hwang
b04fa2daa7 lint 2026-04-04 06:22:04 +00:00
Changho Hwang
f62633ad41 mlx5dv bug fixes & enhanced unit tests perf reporting 2026-04-04 06:18:44 +00:00
Changho Hwang
53099a7cf9 Merge branch 'main' into chhwang/fix-ib-no-atomic 2026-04-01 22:45:58 -07:00
Binyang Li
be9126ca1b Fix run-remote.sh to support multi-command scripts (#770)
## Summary
- Fix `run-remote.sh` to correctly execute multi-command scripts (e.g.,
multiple `mpirun` calls)
- The old approach piped decoded script through `base64 -d | bash`,
which feeds the script via bash's **stdin**. When `mpirun` (or its child
processes) runs, it can consume the remaining stdin, causing bash to
never see subsequent commands — only the first command would execute.
- The fix decodes the script to a **temp file** and runs `bash -euxo
pipefail "$TMP"` instead, so bash reads commands from the file and
`mpirun` consuming stdin has no effect.
- Applied to both the docker path (pssh + docker exec) and the
non-docker path (pssh only).


🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-01 16:25:19 -07:00
Changho Hwang
553fd3b2d8 lint 2026-04-01 21:20:55 +00:00
Changho Hwang
94d0508ec2 prerequisites update 2026-04-01 21:18:47 +00:00