mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Changho Hwang	def68ced64	Add CUDA 12.8 images (#488 )	2025-03-29 00:31:26 +00:00
Binyang Li	c65f19ad1a	Move pipeline to official org (#406 ) Move pipeline to official org. Unify all pipelines	2024-12-16 09:43:00 -08:00
Binyang Li	7a3dcb0627	Setup pipeline for mscclpp over nccl (#401 ) Setup pipeline for mscclpp over nccl Run `all_reduce_perf` via nccl API	2024-12-07 08:57:45 -08:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 )	2024-03-27 11:53:09 -07:00
Binyang Li	bc465aefcd	Add __launch_bounds__ for mscclpp-test (#273 )	2024-03-25 15:55:37 -07:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00
Changho Hwang	060fda12e6	mscclpp-test in Python (#204 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com>	2023-11-16 12:45:25 +08:00
Binyang2014	8a938de9c5	fix pipeline (#209 ) fix pipeline for multi-node test	2023-11-03 05:18:32 +00:00
Binyang2014	952f2da9cc	Improve single node allreduce performance (#169 ) Improve all reduce performance for single node. New number: \| n_ctx \| size \| target latency (us) \| allreduce5 \| allreduce6 \| \|---------\|---------\|----------------\|------------\|------------\| \| 1 \| 24.0kB \| 7.7 \| \| 7.23\| \| 2 \| 48.0kB \| 7.7 \| \| 7.69\| \| 4 \| 96.0kB \| 8 \| \| 8.34\| \| 8 \| 192.0kB \| 12.6 \| \| 9.75\| \| 12 \| 288.0kB \| 13 \| \| 11.34\| \| 16 \| 384.0kB \| 13.3 \| \| 12.99\| \| 768 \| 18.0MB \| 158.7 \| 160.3\| \| \| 896 \| 21.0MB \| 184.5 \| 183.8\| \| \| 1024 \| 24.0MB \| 209.5 \| 207.5\| \| \| 1152 \| 27.0MB \| 234.3 \| 231.9\| \| \| 1280 \| 30.0MB \| 260 \| 255.6\| \| \| 1408 \| 33.0MB \| 284.9 \| 278.7\| \| \| 1536 \| 36.0MB \| 310.3 \| 302.0\| \| \| 1664 \| 39.0MB \| 336.2 \| 325.3\| \| \| 1792 \| 42.0MB \| 361.4 \| 348.8\| \| \| 1920 \| 45.0MB \| 384.6 \| 372.2\| \| \| 2048 \| 48.0MB \| 409.1 \| 395.4\| \| --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-09-13 14:30:08 +00:00
Binyang2014	097aa8843a	Fix pytest unstable issue. (#170 ) - remove `#include <cstdint>` from `poll.hpp`. To make it only contains device-side code - Fix compilation issue, which will cause pytest fail randomly. Reuse the compiled result for same kernel with different arguments	2023-09-06 17:09:04 -07:00
Binyang2014	858e381829	Pytest (#162 ) Port python tests to mscclpp. Please run `mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest --------- Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>	2023-09-01 21:22:11 +08:00
Binyang2014	56bdbc2f32	Enable test for both cuda11 and cuda12 (#124 ) Update pipeline: enable test for both cuda11 and cuda12	2023-07-10 13:19:14 +08:00
Changho Hwang	bb7b85a810	2-node AllReduce improvements (#118 ) * Added `get()` interfaces to `SmChannel` * Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel 3) and 99GB/s for 48MB (kernel 4) * Fixed a FIFO perf bug * Several fixes & validations in mscclpp-test --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-07-07 07:05:46 +00:00
Binyang2014	2640578b22	Add performance check for mscclpp-test (#110 ) - Add ndmv4 perf baseline - change mscclpp-test to output perf number into a json file - add python script to check the perf result with the baseline	2023-06-21 07:42:53 +00:00
Binyang2014	8efacae332	update pipeline (#103 ) Update Azure pipeline: - Using mscclpp:base-cuda12.1 image for building and testing - Add mp-ut tests for multi-nodes	2023-06-14 20:14:57 +08:00

15 Commits