mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Binyang Li	eeea00b298	Support python wheel build (#787 ) ## Support Python wheel build This PR modernizes the Python packaging for MSCCL++ by defining dependencies and optional extras in `pyproject.toml`, enabling proper wheel builds with `pip install ".[cuda12]"`. ### Changes `pyproject.toml` - Add `dependencies` (numpy, blake3, pybind11, sortedcontainers) - Add `optional-dependencies` for platform-specific CuPy (`cuda11`, `cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras - Bump minimum Python version from 3.8 to 3.10 `test/deploy/setup.sh` - Use `pip install ".[<platform>,benchmark,test]"` instead of separate `pip install -r requirements_.txt` + `pip install .` steps - Add missing CUDA 13 case `docs/quickstart.md`* - Update install instructions to use extras (e.g., `pip install ".[cuda12]"`) - Document all available extras and clarify that `rocm6` builds CuPy from source - Update Python version references to 3.10 `python/csrc/CMakeLists.txt`, `python/test/CMakeLists.txt` - Update `find_package(Python)` from 3.8 to 3.10 ### Notes - The `requirements_*.txt` files are kept for Docker base image builds where only dependencies (not the project itself) should be installed. - CuPy is intentionally not in base dependencies — users must specify a platform extra to get the correct pre-built wheel (or source build for ROCm). --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 21:24:45 -07:00
Binyang Li	8896cd909a	Add ROCm FP8 E4M3B15 support (#774 ) ## Summary Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using inline assembly. Extends the allpair packet and fullmesh allreduce kernels to support higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data, improving numerical accuracy. Adds Python tests to verify that higher-precision accumulation is at least as accurate as native FP8 accumulation across all algorithm variants. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 09:53:45 -07:00
Binyang Li	c12822a7af	create CI pipeline for rocm (#718 ) Create CI pipeline for AMD GPU.	2026-02-09 16:55:16 -08:00
Binyang Li	0c7311e83f	Add CI for rocm (#346 )	2024-09-15 22:30:54 +00:00

4 Commits