Commit Graph

1029 Commits

Author SHA1 Message Date
empyreus
812d43d406 return failed result for new test 2026-05-04 20:40:12 +00:00
empyreus
f6637cc458 attempt to print nvidia-smi for cuda drivers
Co-authored-by: Copilot <copilot@github.com>
2026-05-04 20:11:21 +00:00
empyreus
21197f7c0a change directory 2026-05-04 20:02:34 +00:00
empyreus
dfdc9f701e update pool 2026-05-04 18:29:54 +00:00
empyreus
eaa611f220 split multi node test 2026-05-04 18:10:42 +00:00
empyreus
de244e528b update sglang bench 2026-05-04 18:04:30 +00:00
empyreus
97a4b1aa69 remove duplicate stop 2026-05-04 17:23:01 +00:00
empyreus
cb430b35d4 clean up deploy 2026-05-04 17:21:56 +00:00
empyreus
a8b959946a Inital new test 2026-05-04 17:18:02 +00:00
empyreus
e091f65143 Merge branch 'main' into rjsouza/sglang-tests 2026-05-04 17:06:18 +00:00
Changho Hwang
c97be492d5 GDRCopy status message to string (#793) 2026-04-27 10:32:20 -07:00
Copilot
e874bf1666 fix: isCuMemMapAllocated crashes on non-NVLS systems even with MSCCLPP_FORCE_DISABLE_NVLS=true (#790)
- [x] Fix `isCuMemMapAllocated()` to just return `true/false` without
throwing when NVLS is not supported
- [x] Fix `isNvlsSupported()` caching bug where `result`/`isChecked`
were never updated
- [x] Restore `[[maybe_unused]]` on `result` and `isChecked` statics —
needed in HIP/ROCm env where `CUDA_NVLS_API_AVAILABLE` is not defined
and the variables would otherwise be unused
- [x] Run linter (`./tools/lint.sh`)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
2026-04-22 10:12:40 -07:00
Binyang Li
eeea00b298 Support python wheel build (#787)
## Support Python wheel build

This PR modernizes the Python packaging for MSCCL++ by defining
dependencies and optional extras in `pyproject.toml`, enabling proper
wheel builds with `pip install ".[cuda12]"`.

### Changes

**`pyproject.toml`**
- Add `dependencies` (numpy, blake3, pybind11, sortedcontainers)
- Add `optional-dependencies` for platform-specific CuPy (`cuda11`,
`cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras
- Bump minimum Python version from 3.8 to 3.10

**`test/deploy/setup.sh`**
- Use `pip install ".[<platform>,benchmark,test]"` instead of separate
`pip install -r requirements_*.txt` + `pip install .` steps
- Add missing CUDA 13 case

**`docs/quickstart.md`**
- Update install instructions to use extras (e.g., `pip install
".[cuda12]"`)
- Document all available extras and clarify that `rocm6` builds CuPy
from source
- Update Python version references to 3.10

**`python/csrc/CMakeLists.txt`**, **`python/test/CMakeLists.txt`**
- Update `find_package(Python)` from 3.8 to 3.10

### Notes
- The `requirements_*.txt` files are kept for Docker base image builds
where only dependencies (not the project itself) should be installed.
- CuPy is intentionally not in base dependencies — users must specify a
platform extra to get the correct pre-built wheel (or source build for
ROCm).

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 21:24:45 -07:00
Binyang Li
572028ea3d Fix nccl-test CI building for all GPU architectures (#786)
## Problem

`nccl-test.yml` was the only CI template calling `deploy.yml` without
passing `gpuArch`. Since the CI build machine has no GPU, CMake fell
back to building for **all** supported architectures (`80;90;100;120`),
unnecessarily slowing down CI builds.

## Fix

- Add `gpuArch` parameter to `nccl-test.yml` and forward it to
`deploy.yml`
- Pass `gpuArch: '80'` (A100) and `gpuArch: '90'` (H100) from
`nccl-api-test.yml`

All other templates were already passing `gpuArch` correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-15 12:55:40 -07:00
Binyang Li
ecd33722d4 Fix multi-node H100 CI: CUDA compat, deploy improvements (#781)
## Summary

- **Multi-node H100 CI setup**: Improve architecture detection and GPU
configuration
- **Remove hardcoded VMSS hostnames** from deploy files
- **Fix CUDA compat library issue**: Remove stale compat paths from
Docker image for CUDA 12+. Instead, `peer_access_test` now returns a
distinct exit code (2) for CUDA init failure, and `setup.sh`
conditionally adds compat libs only when needed. This fixes
`cudaErrorSystemNotReady` (error 803) when the host driver is newer than
the container's compat libs.
- **Speed up deploy**: Replace recursive `parallel-scp` with
tar+scp+untar to avoid per-file SSH overhead.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-13 21:51:29 -07:00
Caio Rocha
b6d0ca13ca Adding CI Test to DSL Executor (#782) 2026-04-13 13:55:45 -07:00
Caio Rocha
b59e6d7f00 Updating NpKit (#785) 2026-04-13 13:36:42 -07:00
Binyang Li
5380a4ac6e Add MSCCLPP_IB_GID_INDEX env (#780)
Use MSCCLPP_IB_GID_INDEX to control ib gid index

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-13 09:59:42 -07:00
Caio Rocha
feda338595 Adjusting Torch Integration Example (#779)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-04-10 13:57:14 -07:00
Changho Hwang
d63f9403c0 IB host-no-atomic: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753)
Major enhancements to the IB signal forwarding mechanisms
(`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5
Direct Verbs, and refactoring the signal forwarding path for IB
HostNoAtomic mode. The changes fix memory consistency issues and reduce
signaling latency.
- GDRCopy and MLX5 Direct Verbs MR integration
- Signal forwarding path redesign
- Semaphore and connection API updates
- Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates
2026-04-09 09:24:30 +00:00
Caio Rocha
a7273047e9 Fix TBG on DSL Get Operation (#778) 2026-04-08 17:02:07 -07:00
Caio Rocha
3e5c41c98a Adding Channel Type in ReduceSend Operation on DSL (#777)
The reduce send operation in DSL essentially combines the reduce and put
operations. The put operation carry the information about the channel
type, whereas previously, we were using the channel type from the reduce
operation.
2026-04-08 16:59:08 -07:00
Qinghua Zhou
ed565ceb33 Fix missing directory of document for new tag v0.9.0 (#776)
The v0.9.0 conf.py (introduced in #775) dynamically loads the version
from python/mscclpp/_version.py.

This file is generated at build time by setuptools_scm and is listed in
.gitignore — it is never committed to the repo. Earlier tags (v0.8.0 and
below) used a hardcoded release (e.g., "v0.8.0") in conf.py, so they had
no dependency on generated files.
sphinx-multiversion checks out each tag using git archive, which only
extracts committed files.
Since _version.py is not committed, the v0.9.0 checkout is missing it,
and conf.py crashes on import. All future tags will have this same
problem.

**Three changes:**
1. docs/build_multiversion.py (new): A wrapper around
sphinx-multiversion that monkey-patches copy_tree to generate
_version.py in each tag checkout after extraction. The version string is
parsed from the tag name (e.g., v0.9.0 → __version__ = "0.9.0").
2. Makefile: The multiversion target now calls build_multiversion.py
instead of sphinx-multiversion directly.
3. conf.py: Added a fallback so that if _version.py doesn't exist, it
reads the version from the VERSION file instead. This makes conf.py
resilient for any future scenario where _version.py is missing.

**Testing**
Verified locally:
• make multiversion now successfully builds all 11 versions (v0.4.0
through v0.9.0)
• v0.9.0 docs are correctly generated under _build/html/v0.9.0/
Version selector shows v0.9.0 as latest
v0.9.0
2026-04-08 17:59:05 -04:00
empyreus
14f75d8e76 fix pool 2026-04-08 19:00:16 +00:00
empyreus
e1687e885f update to h100 multinode 2026-04-08 18:55:38 +00:00
Binyang Li
8896cd909a Add ROCm FP8 E4M3B15 support (#774)
## Summary

Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including
optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using
inline assembly.

Extends the allpair packet and fullmesh allreduce kernels to support
higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data,
improving numerical accuracy.

Adds Python tests to verify that higher-precision accumulation is at
least as accurate as native FP8 accumulation across all algorithm
variants.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-08 09:53:45 -07:00
Mahdieh Ghazi
e66ce39647 Mahdieh/update version number (#775)
Update the version number for v0.9.0
2026-04-08 12:38:56 -04:00
Binyang Li
96a72bbd3e Support E4M3B15 datatype (#765)
## Summary

- **Add `fp8_e4m3b15` datatype**: A software-defined FP8 type with 4
exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375).
Implemented entirely in software with no HW dependency, using
Triton-style bit manipulation through fp16 as intermediate for efficient
conversion.
- **Add mixed-precision accumulation for allreduce**: All allreduce
algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and
others) now support a configurable `accumDtype` parameter, enabling FP8
inputs to be reduced in float16 or float32 for higher accuracy.
- **Propagate `accumDtype` through the full API**: The new parameter is
threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc`
→ dispatch → CUDA kernels, with `DataType::AUTO` as the default
(resolves to input dtype at runtime).
- **Add FP8 accumulation correctness tests**: New `test_fp8_accum.py`
validates that higher-precision accumulation produces results at least
as accurate as native FP8 accumulation across multiple algorithms and
sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm.
- **Add `test_fp8_accum.py` to CI**: Azure Pipeline `ut.yml` now runs
FP8 accumulation tests alongside existing pytests.
- **NCCL shim logging cleanup**: Migrated `printf`-style `WARN`/`INFO`
calls to streaming-style logging.

## Key files

| Area | Files |
|------|-------|
| New datatype + vector ops | `include/mscclpp/gpu_data_types.hpp` |
| Accumulation reduce helpers | `src/core/include/reduce_kernel.hpp` |
| Algorithm API (`accumDtype`) | `include/mscclpp/algorithm.hpp`,
`src/core/algorithm.cc` |
| Allreduce kernels | `src/ext/collectives/allreduce/*.cu` |
| Dispatch + common | `src/ext/collectives/include/allreduce/common.hpp`
|
| Python bindings | `python/csrc/algorithm.cpp`,
`python/mscclpp/_core/algorithm.py` |
| Tests | `python/test/test_fp8_accum.py` |
| CI | `.azure-pipelines/templates/ut.yml` |

## Test plan

- [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15
accumulation tests
- [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped
- [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP
- [x] Existing `test_mscclpp.py` tests continue to pass
- [x] NCCL shim builds and runs correctly with new `accumDtype` defaults

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 13:37:02 -07:00
empyreus
1ad0f1c9d5 hostentries 2026-04-07 19:16:09 +00:00
empyreus
4f677b64c9 host entries 2026-04-07 19:15:11 +00:00
empyreus
512416edc2 add resourcegroup 2026-04-07 19:13:55 +00:00
empyreus
a1bc727e51 fix deploy step 2026-04-07 19:11:22 +00:00
empyreus
1fbcbfdec7 fix formatting 2026-04-07 19:06:56 +00:00
empyreus
d8e1de7a7f fix formatting 2026-04-07 17:31:14 +00:00
empyreus
0bf599837d try multi-pipeline 2026-04-07 17:29:47 +00:00
empyreus
8fb751470b add multi node 2026-04-07 17:15:05 +00:00
Binyang Li
fa95e82e18 Fix CI/CD pipeline issues (#773)
This pull request updates the deployment pipeline to allow custom CMake
arguments to be passed to the pip install process on remote VMs.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 08:41:51 -07:00
empyreus
88e1ac71c7 fix paths 2026-04-06 20:53:39 +00:00
empyreus
ea97444a8d change path 2026-04-06 19:44:14 +00:00
empyreus
68cf67d24e unit test 2026-04-06 18:12:46 +00:00
empyreus
58c5234243 ignore version mismatch 2026-04-03 23:20:11 +00:00
empyreus
e68125f270 change to h100 machine 2026-04-03 22:20:22 +00:00
empyreus
e8266a1794 running on a100 2026-04-03 15:08:12 +00:00
empyreus
53d6f76a24 simplify container 2026-04-03 14:07:46 +00:00
empyreus
7b03ece609 add prints 2026-04-03 14:02:44 +00:00
empyreus
10648a42c5 add --priveldged 2026-04-02 20:35:15 +00:00
empyreus
149be8e828 fix - 2026-04-02 19:47:30 +00:00
empyreus
376a6a299d rmobe build 2026-04-02 19:27:09 +00:00
empyreus
61e0540cbc update for new cmake 2026-04-02 17:55:58 +00:00
empyreus
6fd8b18e83 change cmake version 2026-04-02 17:48:43 +00:00