Commit Graph

15 Commits

Author SHA1 Message Date
Changho Hwang
def68ced64 Add CUDA 12.8 images (#488) 2025-03-29 00:31:26 +00:00
Binyang Li
c65f19ad1a Move pipeline to official org (#406)
Move pipeline to official org. Unify all pipelines
2024-12-16 09:43:00 -08:00
Binyang Li
7a3dcb0627 Setup pipeline for mscclpp over nccl (#401)
Setup pipeline for mscclpp over nccl
Run `all_reduce_perf` via nccl API
2024-12-07 08:57:45 -08:00
Changho Hwang
1a7cb98e3a v0.4.3 (#279) 2024-03-27 11:53:09 -07:00
Binyang Li
bc465aefcd Add __launch_bounds__ for mscclpp-test (#273) 2024-03-25 15:55:37 -07:00
Changho Hwang
dab19e00c1 Templatize Dockerfiles & update workflows (#223)
Now build images by a script with a shared Dockerfile template

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-11-22 13:29:12 -08:00
Changho Hwang
060fda12e6 mscclpp-test in Python (#204)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Esha Choukse <eschouks@microsoft.com>
2023-11-16 12:45:25 +08:00
Binyang2014
8a938de9c5 fix pipeline (#209)
fix pipeline for multi-node test
2023-11-03 05:18:32 +00:00
Binyang2014
952f2da9cc Improve single node allreduce performance (#169)
Improve all reduce performance for single node.
New number:
|   n_ctx | size    |  target latency (us) | allreduce5 | allreduce6 |
|---------|---------|----------------|------------|------------|
|       1 | 24.0kB  |            7.7 |            |        7.23|
|       2 | 48.0kB  |            7.7 |            |        7.69|
|       4 | 96.0kB  |            8   |            |        8.34|
|       8 | 192.0kB |           12.6 |            |        9.75|
|      12 | 288.0kB |           13   |            |       11.34|
|      16 | 384.0kB |           13.3 |            |       12.99|
|     768 | 18.0MB  |          158.7 |       160.3|            |
|     896 | 21.0MB  |          184.5 |       183.8|            |
|    1024 | 24.0MB  |          209.5 |       207.5|            |
|    1152 | 27.0MB  |          234.3 |       231.9|            |
|    1280 | 30.0MB  |          260   |       255.6|            |
|    1408 | 33.0MB  |          284.9 |       278.7|            |
|    1536 | 36.0MB  |          310.3 |       302.0|            |
|    1664 | 39.0MB  |          336.2 |       325.3|            |
|    1792 | 42.0MB  |          361.4 |       348.8|            |
|    1920 | 45.0MB  |          384.6 |       372.2|            |
|    2048 | 48.0MB  |          409.1 |       395.4|            |

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-09-13 14:30:08 +00:00
Binyang2014
097aa8843a Fix pytest unstable issue. (#170)
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Binyang2014
858e381829 Pytest (#162)
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest

---------

Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>
2023-09-01 21:22:11 +08:00
Binyang2014
56bdbc2f32 Enable test for both cuda11 and cuda12 (#124)
Update pipeline: enable test for both cuda11 and cuda12
2023-07-10 13:19:14 +08:00
Changho Hwang
bb7b85a810 2-node AllReduce improvements (#118)
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-07-07 07:05:46 +00:00
Binyang2014
2640578b22 Add performance check for mscclpp-test (#110)
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Binyang2014
8efacae332 update pipeline (#103)
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
2023-06-14 20:14:57 +08:00