mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 17:26:04 +00:00

Author	SHA1	Message	Date
Binyang Li	eeea00b298	Support python wheel build (#787 ) ## Support Python wheel build This PR modernizes the Python packaging for MSCCL++ by defining dependencies and optional extras in `pyproject.toml`, enabling proper wheel builds with `pip install ".[cuda12]"`. ### Changes `pyproject.toml` - Add `dependencies` (numpy, blake3, pybind11, sortedcontainers) - Add `optional-dependencies` for platform-specific CuPy (`cuda11`, `cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras - Bump minimum Python version from 3.8 to 3.10 `test/deploy/setup.sh` - Use `pip install ".[<platform>,benchmark,test]"` instead of separate `pip install -r requirements_.txt` + `pip install .` steps - Add missing CUDA 13 case `docs/quickstart.md`* - Update install instructions to use extras (e.g., `pip install ".[cuda12]"`) - Document all available extras and clarify that `rocm6` builds CuPy from source - Update Python version references to 3.10 `python/csrc/CMakeLists.txt`, `python/test/CMakeLists.txt` - Update `find_package(Python)` from 3.8 to 3.10 ### Notes - The `requirements_*.txt` files are kept for Docker base image builds where only dependencies (not the project itself) should be installed. - CuPy is intentionally not in base dependencies — users must specify a platform extra to get the correct pre-built wheel (or source build for ROCm). --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 21:24:45 -07:00
Binyang Li	572028ea3d	Fix nccl-test CI building for all GPU architectures (#786 ) ## Problem `nccl-test.yml` was the only CI template calling `deploy.yml` without passing `gpuArch`. Since the CI build machine has no GPU, CMake fell back to building for all supported architectures (`80;90;100;120`), unnecessarily slowing down CI builds. ## Fix - Add `gpuArch` parameter to `nccl-test.yml` and forward it to `deploy.yml` - Pass `gpuArch: '80'` (A100) and `gpuArch: '90'` (H100) from `nccl-api-test.yml` All other templates were already passing `gpuArch` correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-15 12:55:40 -07:00
Binyang Li	ecd33722d4	Fix multi-node H100 CI: CUDA compat, deploy improvements (#781 ) ## Summary - Multi-node H100 CI setup: Improve architecture detection and GPU configuration - Remove hardcoded VMSS hostnames from deploy files - Fix CUDA compat library issue: Remove stale compat paths from Docker image for CUDA 12+. Instead, `peer_access_test` now returns a distinct exit code (2) for CUDA init failure, and `setup.sh` conditionally adds compat libs only when needed. This fixes `cudaErrorSystemNotReady` (error 803) when the host driver is newer than the container's compat libs. - Speed up deploy: Replace recursive `parallel-scp` with tar+scp+untar to avoid per-file SSH overhead. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-13 21:51:29 -07:00
Binyang Li	fa95e82e18	Fix CI/CD pipeline issues (#773 ) This pull request updates the deployment pipeline to allow custom CMake arguments to be passed to the pip install process on remote VMs. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 08:41:51 -07:00
Binyang Li	be9126ca1b	Fix run-remote.sh to support multi-command scripts (#770 ) ## Summary - Fix `run-remote.sh` to correctly execute multi-command scripts (e.g., multiple `mpirun` calls) - The old approach piped decoded script through `base64 -d \| bash`, which feeds the script via bash's stdin. When `mpirun` (or its child processes) runs, it can consume the remaining stdin, causing bash to never see subsequent commands — only the first command would execute. - The fix decodes the script to a temp file and runs `bash -euxo pipefail "$TMP"` instead, so bash reads commands from the file and `mpirun` consuming stdin has no effect. - Applied to both the docker path (pssh + docker exec) and the non-docker path (pssh only). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-01 16:25:19 -07:00
Copilot	93f6eeaa6b	Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 ) - Removes the GTest dependency, replacing it with a minimal custom framework (`test/framework.`) that covers only what the tests actually use — a unified `TEST()` macro with SFINAE-based fixture auto-detection, `EXPECT_`/`ASSERT_*` assertions, environments, and setup/teardown. - `--exclude-perf-tests` flag and substring-based negative filtering - `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to Codecov - Merges standalone `test/perf/` into main test targets - Refactors Azure pipelines to reduce redundancies & make more readable --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-03-24 23:34:38 -04:00
Binyang Li	c12822a7af	create CI pipeline for rocm (#718 ) Create CI pipeline for AMD GPU.	2026-02-09 16:55:16 -08:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Caio Rocha	060c35fec6	No IB Env CI Test (#687 )	2025-11-19 11:13:03 -08:00
Changho Hwang	a2f1279c60	Test peer accessibility after deployment (#661 ) Test GPUs' peer accessibility before integration testing to distinguish VM issues.	2025-10-24 11:09:36 -07:00
Binyang Li	b1a88d755e	Pipeline fix (#645 ) Co-authored-by: github-actions <github-actions@github.com>	2025-10-10 11:26:33 -07:00
Changho Hwang	20eca28942	Fix a FIFO correctness bug (#549 ) * Add a FIFO test code that reproduced a correctness issue * Fix the correctness issue by using pinned memory instead of cudaMemcpy --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-07-11 23:53:59 +00:00
Binyang Li	a18e91cee4	Set Up a CI Pipeline for H100 (#526 ) Set Up a CI Pipeline for H100	2025-05-15 14:50:23 -07:00
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Changho Hwang	def68ced64	Add CUDA 12.8 images (#488 )	2025-03-29 00:31:26 +00:00
Binyang Li	c65f19ad1a	Move pipeline to official org (#406 ) Move pipeline to official org. Unify all pipelines	2024-12-16 09:43:00 -08:00
Binyang Li	7a3dcb0627	Setup pipeline for mscclpp over nccl (#401 ) Setup pipeline for mscclpp over nccl Run `all_reduce_perf` via nccl API	2024-12-07 08:57:45 -08:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 )	2024-03-27 11:53:09 -07:00
Binyang Li	bc465aefcd	Add __launch_bounds__ for mscclpp-test (#273 )	2024-03-25 15:55:37 -07:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00
Changho Hwang	060fda12e6	mscclpp-test in Python (#204 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com>	2023-11-16 12:45:25 +08:00
Binyang2014	8a938de9c5	fix pipeline (#209 ) fix pipeline for multi-node test	2023-11-03 05:18:32 +00:00
Binyang2014	952f2da9cc	Improve single node allreduce performance (#169 ) Improve all reduce performance for single node. New number: \| n_ctx \| size \| target latency (us) \| allreduce5 \| allreduce6 \| \|---------\|---------\|----------------\|------------\|------------\| \| 1 \| 24.0kB \| 7.7 \| \| 7.23\| \| 2 \| 48.0kB \| 7.7 \| \| 7.69\| \| 4 \| 96.0kB \| 8 \| \| 8.34\| \| 8 \| 192.0kB \| 12.6 \| \| 9.75\| \| 12 \| 288.0kB \| 13 \| \| 11.34\| \| 16 \| 384.0kB \| 13.3 \| \| 12.99\| \| 768 \| 18.0MB \| 158.7 \| 160.3\| \| \| 896 \| 21.0MB \| 184.5 \| 183.8\| \| \| 1024 \| 24.0MB \| 209.5 \| 207.5\| \| \| 1152 \| 27.0MB \| 234.3 \| 231.9\| \| \| 1280 \| 30.0MB \| 260 \| 255.6\| \| \| 1408 \| 33.0MB \| 284.9 \| 278.7\| \| \| 1536 \| 36.0MB \| 310.3 \| 302.0\| \| \| 1664 \| 39.0MB \| 336.2 \| 325.3\| \| \| 1792 \| 42.0MB \| 361.4 \| 348.8\| \| \| 1920 \| 45.0MB \| 384.6 \| 372.2\| \| \| 2048 \| 48.0MB \| 409.1 \| 395.4\| \| --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-09-13 14:30:08 +00:00
Binyang2014	097aa8843a	Fix pytest unstable issue. (#170 ) - remove `#include <cstdint>` from `poll.hpp`. To make it only contains device-side code - Fix compilation issue, which will cause pytest fail randomly. Reuse the compiled result for same kernel with different arguments	2023-09-06 17:09:04 -07:00
Binyang2014	858e381829	Pytest (#162 ) Port python tests to mscclpp. Please run `mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest --------- Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>	2023-09-01 21:22:11 +08:00
Binyang2014	56bdbc2f32	Enable test for both cuda11 and cuda12 (#124 ) Update pipeline: enable test for both cuda11 and cuda12	2023-07-10 13:19:14 +08:00
Changho Hwang	bb7b85a810	2-node AllReduce improvements (#118 ) * Added `get()` interfaces to `SmChannel` * Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel 3) and 99GB/s for 48MB (kernel 4) * Fixed a FIFO perf bug * Several fixes & validations in mscclpp-test --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-07-07 07:05:46 +00:00
Binyang2014	2640578b22	Add performance check for mscclpp-test (#110 ) - Add ndmv4 perf baseline - change mscclpp-test to output perf number into a json file - add python script to check the perf result with the baseline	2023-06-21 07:42:53 +00:00
Binyang2014	8efacae332	update pipeline (#103 ) Update Azure pipeline: - Using mscclpp:base-cuda12.1 image for building and testing - Add mp-ut tests for multi-nodes	2023-06-14 20:14:57 +08:00

29 Commits