Commit Graph

26 Commits

Author SHA1 Message Date
Binyang Li
fa95e82e18 Fix CI/CD pipeline issues (#773)
This pull request updates the deployment pipeline to allow custom CMake
arguments to be passed to the pip install process on remote VMs.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 08:41:51 -07:00
Binyang Li
be9126ca1b Fix run-remote.sh to support multi-command scripts (#770)
## Summary
- Fix `run-remote.sh` to correctly execute multi-command scripts (e.g.,
multiple `mpirun` calls)
- The old approach piped decoded script through `base64 -d | bash`,
which feeds the script via bash's **stdin**. When `mpirun` (or its child
processes) runs, it can consume the remaining stdin, causing bash to
never see subsequent commands — only the first command would execute.
- The fix decodes the script to a **temp file** and runs `bash -euxo
pipefail "$TMP"` instead, so bash reads commands from the file and
`mpirun` consuming stdin has no effect.
- Applied to both the docker path (pssh + docker exec) and the
non-docker path (pssh only).


🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-04-01 16:25:19 -07:00
Copilot
93f6eeaa6b Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744)
- Removes the GTest dependency, replacing it with a minimal custom
framework (`test/framework.*`) that covers only what the tests actually
use — a unified `TEST()` macro with SFINAE-based fixture auto-detection,
`EXPECT_*`/`ASSERT_*` assertions, environments, and setup/teardown.
- `--exclude-perf-tests` flag and substring-based negative filtering
- `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to
Codecov
- Merges standalone `test/perf/` into main test targets
- Refactors Azure pipelines to reduce redundancies & make more readable

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2026-03-24 23:34:38 -04:00
Binyang Li
c12822a7af create CI pipeline for rocm (#718)
Create CI pipeline for AMD GPU.
2026-02-09 16:55:16 -08:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00
Caio Rocha
060c35fec6 No IB Env CI Test (#687) 2025-11-19 11:13:03 -08:00
Changho Hwang
a2f1279c60 Test peer accessibility after deployment (#661)
Test GPUs' peer accessibility before integration testing to distinguish
VM issues.
2025-10-24 11:09:36 -07:00
Binyang Li
b1a88d755e Pipeline fix (#645)
Co-authored-by: github-actions <github-actions@github.com>
2025-10-10 11:26:33 -07:00
Changho Hwang
20eca28942 Fix a FIFO correctness bug (#549)
* Add a FIFO test code that reproduced a correctness issue
* Fix the correctness issue by using pinned memory instead of cudaMemcpy

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-07-11 23:53:59 +00:00
Binyang Li
a18e91cee4 Set Up a CI Pipeline for H100 (#526)
Set Up a CI Pipeline for H100
2025-05-15 14:50:23 -07:00
Binyang Li
06f31994dc Fix performance issue introduced in PR: 499 (#505)
1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is
more efficient on A100
2. Update the deviceSyncer. Previous one cannot handle threadBlock
number change correctly. Use three counters to solve this issue. Reset
previous counter before sync on current counter.
3. Introduce relaxedWait which can be used with relaxedSignal for case
doesn't need guarantee the memory visibility
2025-04-22 14:03:37 -07:00
Changho Hwang
def68ced64 Add CUDA 12.8 images (#488) 2025-03-29 00:31:26 +00:00
Binyang Li
c65f19ad1a Move pipeline to official org (#406)
Move pipeline to official org. Unify all pipelines
2024-12-16 09:43:00 -08:00
Binyang Li
7a3dcb0627 Setup pipeline for mscclpp over nccl (#401)
Setup pipeline for mscclpp over nccl
Run `all_reduce_perf` via nccl API
2024-12-07 08:57:45 -08:00
Changho Hwang
1a7cb98e3a v0.4.3 (#279) 2024-03-27 11:53:09 -07:00
Binyang Li
bc465aefcd Add __launch_bounds__ for mscclpp-test (#273) 2024-03-25 15:55:37 -07:00
Changho Hwang
dab19e00c1 Templatize Dockerfiles & update workflows (#223)
Now build images by a script with a shared Dockerfile template

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-11-22 13:29:12 -08:00
Changho Hwang
060fda12e6 mscclpp-test in Python (#204)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Esha Choukse <eschouks@microsoft.com>
2023-11-16 12:45:25 +08:00
Binyang2014
8a938de9c5 fix pipeline (#209)
fix pipeline for multi-node test
2023-11-03 05:18:32 +00:00
Binyang2014
952f2da9cc Improve single node allreduce performance (#169)
Improve all reduce performance for single node.
New number:
|   n_ctx | size    |  target latency (us) | allreduce5 | allreduce6 |
|---------|---------|----------------|------------|------------|
|       1 | 24.0kB  |            7.7 |            |        7.23|
|       2 | 48.0kB  |            7.7 |            |        7.69|
|       4 | 96.0kB  |            8   |            |        8.34|
|       8 | 192.0kB |           12.6 |            |        9.75|
|      12 | 288.0kB |           13   |            |       11.34|
|      16 | 384.0kB |           13.3 |            |       12.99|
|     768 | 18.0MB  |          158.7 |       160.3|            |
|     896 | 21.0MB  |          184.5 |       183.8|            |
|    1024 | 24.0MB  |          209.5 |       207.5|            |
|    1152 | 27.0MB  |          234.3 |       231.9|            |
|    1280 | 30.0MB  |          260   |       255.6|            |
|    1408 | 33.0MB  |          284.9 |       278.7|            |
|    1536 | 36.0MB  |          310.3 |       302.0|            |
|    1664 | 39.0MB  |          336.2 |       325.3|            |
|    1792 | 42.0MB  |          361.4 |       348.8|            |
|    1920 | 45.0MB  |          384.6 |       372.2|            |
|    2048 | 48.0MB  |          409.1 |       395.4|            |

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-09-13 14:30:08 +00:00
Binyang2014
097aa8843a Fix pytest unstable issue. (#170)
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Binyang2014
858e381829 Pytest (#162)
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest

---------

Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>
2023-09-01 21:22:11 +08:00
Binyang2014
56bdbc2f32 Enable test for both cuda11 and cuda12 (#124)
Update pipeline: enable test for both cuda11 and cuda12
2023-07-10 13:19:14 +08:00
Changho Hwang
bb7b85a810 2-node AllReduce improvements (#118)
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-07-07 07:05:46 +00:00
Binyang2014
2640578b22 Add performance check for mscclpp-test (#110)
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Binyang2014
8efacae332 update pipeline (#103)
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
2023-06-14 20:14:57 +08:00