Changho Hwang
def68ced64
Add CUDA 12.8 images ( #488 )
2025-03-29 00:31:26 +00:00
Binyang Li
c65f19ad1a
Move pipeline to official org ( #406 )
...
Move pipeline to official org. Unify all pipelines
2024-12-16 09:43:00 -08:00
Binyang Li
7a3dcb0627
Setup pipeline for mscclpp over nccl ( #401 )
...
Setup pipeline for mscclpp over nccl
Run `all_reduce_perf` via nccl API
2024-12-07 08:57:45 -08:00
Changho Hwang
1a7cb98e3a
v0.4.3 ( #279 )
2024-03-27 11:53:09 -07:00
Binyang Li
bc465aefcd
Add __launch_bounds__ for mscclpp-test ( #273 )
2024-03-25 15:55:37 -07:00
Changho Hwang
dab19e00c1
Templatize Dockerfiles & update workflows ( #223 )
...
Now build images by a script with a shared Dockerfile template
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-11-22 13:29:12 -08:00
Changho Hwang
060fda12e6
mscclpp-test in Python ( #204 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Esha Choukse <eschouks@microsoft.com >
2023-11-16 12:45:25 +08:00
Binyang2014
8a938de9c5
fix pipeline ( #209 )
...
fix pipeline for multi-node test
2023-11-03 05:18:32 +00:00
Binyang2014
952f2da9cc
Improve single node allreduce performance ( #169 )
...
Improve all reduce performance for single node.
New number:
| n_ctx | size | target latency (us) | allreduce5 | allreduce6 |
|---------|---------|----------------|------------|------------|
| 1 | 24.0kB | 7.7 | | 7.23|
| 2 | 48.0kB | 7.7 | | 7.69|
| 4 | 96.0kB | 8 | | 8.34|
| 8 | 192.0kB | 12.6 | | 9.75|
| 12 | 288.0kB | 13 | | 11.34|
| 16 | 384.0kB | 13.3 | | 12.99|
| 768 | 18.0MB | 158.7 | 160.3| |
| 896 | 21.0MB | 184.5 | 183.8| |
| 1024 | 24.0MB | 209.5 | 207.5| |
| 1152 | 27.0MB | 234.3 | 231.9| |
| 1280 | 30.0MB | 260 | 255.6| |
| 1408 | 33.0MB | 284.9 | 278.7| |
| 1536 | 36.0MB | 310.3 | 302.0| |
| 1664 | 39.0MB | 336.2 | 325.3| |
| 1792 | 42.0MB | 361.4 | 348.8| |
| 1920 | 45.0MB | 384.6 | 372.2| |
| 2048 | 48.0MB | 409.1 | 395.4| |
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-09-13 14:30:08 +00:00
Binyang2014
097aa8843a
Fix pytest unstable issue. ( #170 )
...
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Binyang2014
858e381829
Pytest ( #162 )
...
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest
---------
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com >
2023-09-01 21:22:11 +08:00
Binyang2014
56bdbc2f32
Enable test for both cuda11 and cuda12 ( #124 )
...
Update pipeline: enable test for both cuda11 and cuda12
2023-07-10 13:19:14 +08:00
Changho Hwang
bb7b85a810
2-node AllReduce improvements ( #118 )
...
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-07-07 07:05:46 +00:00
Binyang2014
2640578b22
Add performance check for mscclpp-test ( #110 )
...
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Binyang2014
8efacae332
update pipeline ( #103 )
...
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
2023-06-14 20:14:57 +08:00