mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 09:17:06 +00:00

Author	SHA1	Message	Date
Binyang Li	ecd33722d4	Fix multi-node H100 CI: CUDA compat, deploy improvements (#781 ) ## Summary - Multi-node H100 CI setup: Improve architecture detection and GPU configuration - Remove hardcoded VMSS hostnames from deploy files - Fix CUDA compat library issue: Remove stale compat paths from Docker image for CUDA 12+. Instead, `peer_access_test` now returns a distinct exit code (2) for CUDA init failure, and `setup.sh` conditionally adds compat libs only when needed. This fixes `cudaErrorSystemNotReady` (error 803) when the host driver is newer than the container's compat libs. - Speed up deploy: Replace recursive `parallel-scp` with tar+scp+untar to avoid per-file SSH overhead. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-13 21:51:29 -07:00
Copilot	93f6eeaa6b	Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 ) - Removes the GTest dependency, replacing it with a minimal custom framework (`test/framework.`) that covers only what the tests actually use — a unified `TEST()` macro with SFINAE-based fixture auto-detection, `EXPECT_`/`ASSERT_*` assertions, environments, and setup/teardown. - `--exclude-perf-tests` flag and substring-based negative filtering - `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to Codecov - Merges standalone `test/perf/` into main test targets - Refactors Azure pipelines to reduce redundancies & make more readable --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-03-24 23:34:38 -04:00
Binyang Li	c12822a7af	create CI pipeline for rocm (#718 ) Create CI pipeline for AMD GPU.	2026-02-09 16:55:16 -08:00
Caio Rocha	060c35fec6	No IB Env CI Test (#687 )	2025-11-19 11:13:03 -08:00
Binyang Li	c65f19ad1a	Move pipeline to official org (#406 ) Move pipeline to official org. Unify all pipelines	2024-12-16 09:43:00 -08:00
Binyang Li	7a3dcb0627	Setup pipeline for mscclpp over nccl (#401 ) Setup pipeline for mscclpp over nccl Run `all_reduce_perf` via nccl API	2024-12-07 08:57:45 -08:00
Changho Hwang	060fda12e6	mscclpp-test in Python (#204 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com>	2023-11-16 12:45:25 +08:00
Binyang2014	858e381829	Pytest (#162 ) Port python tests to mscclpp. Please run `mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest --------- Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>	2023-09-01 21:22:11 +08:00
Binyang2014	56bdbc2f32	Enable test for both cuda11 and cuda12 (#124 ) Update pipeline: enable test for both cuda11 and cuda12	2023-07-10 13:19:14 +08:00
Binyang2014	2640578b22	Add performance check for mscclpp-test (#110 ) - Add ndmv4 perf baseline - change mscclpp-test to output perf number into a json file - add python script to check the perf result with the baseline	2023-06-21 07:42:53 +00:00
Binyang2014	8efacae332	update pipeline (#103 ) Update Azure pipeline: - Using mscclpp:base-cuda12.1 image for building and testing - Add mp-ut tests for multi-nodes	2023-06-14 20:14:57 +08:00

11 Commits