mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Changho Hwang	def68ced64	Add CUDA 12.8 images (#488 )	2025-03-29 00:31:26 +00:00
Binyang Li	c65f19ad1a	Move pipeline to official org (#406 ) Move pipeline to official org. Unify all pipelines	2024-12-16 09:43:00 -08:00
Binyang Li	7a3dcb0627	Setup pipeline for mscclpp over nccl (#401 ) Setup pipeline for mscclpp over nccl Run `all_reduce_perf` via nccl API	2024-12-07 08:57:45 -08:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 )	2024-03-27 11:53:09 -07:00
Binyang Li	bc465aefcd	Add __launch_bounds__ for mscclpp-test (#273 )	2024-03-25 15:55:37 -07:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00
Changho Hwang	060fda12e6	mscclpp-test in Python (#204 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com>	2023-11-16 12:45:25 +08:00
Binyang2014	8a938de9c5	fix pipeline (#209 ) fix pipeline for multi-node test	2023-11-03 05:18:32 +00:00
Binyang2014	952f2da9cc	Improve single node allreduce performance (#169 ) Improve all reduce performance for single node. New number: \| n_ctx \| size \| target latency (us) \| allreduce5 \| allreduce6 \| \|---------\|---------\|----------------\|------------\|------------\| \| 1 \| 24.0kB \| 7.7 \| \| 7.23\| \| 2 \| 48.0kB \| 7.7 \| \| 7.69\| \| 4 \| 96.0kB \| 8 \| \| 8.34\| \| 8 \| 192.0kB \| 12.6 \| \| 9.75\| \| 12 \| 288.0kB \| 13 \| \| 11.34\| \| 16 \| 384.0kB \| 13.3 \| \| 12.99\| \| 768 \| 18.0MB \| 158.7 \| 160.3\| \| \| 896 \| 21.0MB \| 184.5 \| 183.8\| \| \| 1024 \| 24.0MB \| 209.5 \| 207.5\| \| \| 1152 \| 27.0MB \| 234.3 \| 231.9\| \| \| 1280 \| 30.0MB \| 260 \| 255.6\| \| \| 1408 \| 33.0MB \| 284.9 \| 278.7\| \| \| 1536 \| 36.0MB \| 310.3 \| 302.0\| \| \| 1664 \| 39.0MB \| 336.2 \| 325.3\| \| \| 1792 \| 42.0MB \| 361.4 \| 348.8\| \| \| 1920 \| 45.0MB \| 384.6 \| 372.2\| \| \| 2048 \| 48.0MB \| 409.1 \| 395.4\| \| --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-09-13 14:30:08 +00:00
Binyang2014	097aa8843a	Fix pytest unstable issue. (#170 ) - remove `#include <cstdint>` from `poll.hpp`. To make it only contains device-side code - Fix compilation issue, which will cause pytest fail randomly. Reuse the compiled result for same kernel with different arguments	2023-09-06 17:09:04 -07:00
Binyang2014	858e381829	Pytest (#162 ) Port python tests to mscclpp. Please run `mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest --------- Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>	2023-09-01 21:22:11 +08:00
Binyang2014	56bdbc2f32	Enable test for both cuda11 and cuda12 (#124 ) Update pipeline: enable test for both cuda11 and cuda12	2023-07-10 13:19:14 +08:00
Changho Hwang	bb7b85a810	2-node AllReduce improvements (#118 ) * Added `get()` interfaces to `SmChannel` * Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel 3) and 99GB/s for 48MB (kernel 4) * Fixed a FIFO perf bug * Several fixes & validations in mscclpp-test --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-07-07 07:05:46 +00:00
Binyang2014	2640578b22	Add performance check for mscclpp-test (#110 ) - Add ndmv4 perf baseline - change mscclpp-test to output perf number into a json file - add python script to check the perf result with the baseline	2023-06-21 07:42:53 +00:00
Binyang2014	8efacae332	update pipeline (#103 ) Update Azure pipeline: - Using mscclpp:base-cuda12.1 image for building and testing - Add mp-ut tests for multi-nodes	2023-06-14 20:14:57 +08:00

16 Commits