mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Changho Hwang	5b84c8a3d1	Separate linters from cmake (#587 )	2025-07-28 09:59:20 +08:00
Binyang Li	adc9ee5684	Export mscclpp GpuBuffer to dlpack format (#492 ) For mscclpp, to use nvls we require the buffer is allocated by mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export the raw buffer to dlpack format. User can use this feature to create buffer with type supported by pytorch ```python buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16 dl_pack = buffer.to_dlpack(str(torch.bfloat16)) tensor = torch.utils.dlpack.from_dlpack(dl_pack) ```	2025-04-03 12:59:32 -07:00
Changho Hwang	def68ced64	Add CUDA 12.8 images (#488 )	2025-03-29 00:31:26 +00:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Binyang Li	af0bb86e07	Merge mscclpp-lang to mscclpp project (#442 ) First step to merge msccl-tools into mscclpp repo. In this step will move all msccl related code, pass the current tests and do some necessary refactor. Add `mscclpp.language` module Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag Add `DagLower` to lower dag to intermediate representation Add documents for mscclpp.language Remove msccl related code	2025-01-22 09:47:37 -08:00
Changho Hwang	2b54af7e27	Auto-update version numbers in CMakeLists.txt (#450 )	2025-01-09 17:54:10 -08:00
Binyang Li	3d6bfed2cf	Update version number (#433 ) Co-authored-by: github-actions <github-actions@github.com>	2025-01-02 16:45:08 -08:00
Binyang Li	f18a440feb	trigger ci for release branches (#426 )	2024-12-21 00:05:13 +00:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	4136153a76	[Doc] mscclpp docs (#348 ) Generate docs for mescclpp. Setup github action to auto-deploy github-page doc link here: https://microsoft.github.io/mscclpp --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-10-18 06:08:31 +00:00
Changho Hwang	8a330f9135	Update ROCm CI (#357 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-09-20 17:57:02 +00:00
caiomcbr	7493e2f075	Double buffering for NCCL APIs (#324 ) Using two scratch buffers in each peer to exchange data. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-15 22:18:53 +00:00
Binyang Li	422c81f0f8	remove make pylib-copy command (#249 ) Fix #216 Remove `make pylib-copy`	2024-01-19 12:29:15 -08:00
Changho Hwang	5fa5bd2706	Check `nvidia_peermem` during runtime (#234 )	2023-12-25 12:02:10 +08:00
Changho Hwang	c15a166cf0	Add a documentation issue template (#230 )	2023-12-05 01:01:45 +00:00
Changho Hwang	544ff0c21d	ROCm support (#213 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-11-24 16:41:56 +08:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00
Changho Hwang	f68820436c	Explicit build dependency on `nvidia_peermem` (#201 )	2023-10-23 04:29:30 +00:00
Changho Hwang	8c0f9e84d0	v0.3.0 (#171 )	2023-10-11 22:35:54 +08:00
Changho Hwang	11ac824cc7	Align interfaces of put/get/putPackets/getPackets (#185 )	2023-10-07 22:18:26 +08:00
Changho Hwang	497a9e0c82	Add backup workflows (#189 )	2023-10-07 15:13:49 +08:00
Changho Hwang	bb64f68d74	Update issue templates (#179 )	2023-09-15 04:05:09 +00:00
Saeed Maleki	e7d5e652df	Python bindings (#125 ) Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-07-19 15:35:54 +08:00
Binyang2014	56bdbc2f32	Enable test for both cuda11 and cuda12 (#124 ) Update pipeline: enable test for both cuda11 and cuda12	2023-07-10 13:19:14 +08:00
Changho Hwang	4114d65c60	Documents & minor updates (#119 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-07-07 17:35:05 +08:00
Changho Hwang	bb7b85a810	2-node AllReduce improvements (#118 ) * Added `get()` interfaces to `SmChannel` * Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel 3) and 99GB/s for 48MB (kernel 4) * Fixed a FIFO perf bug * Several fixes & validations in mscclpp-test --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-07-07 07:05:46 +00:00
Binyang2014	2640578b22	Add performance check for mscclpp-test (#110 ) - Add ndmv4 perf baseline - change mscclpp-test to output perf number into a json file - add python script to check the perf result with the baseline	2023-06-21 07:42:53 +00:00
Changho Hwang	5a4885ccbb	Misc updates (#95 )	2023-06-12 13:53:43 +08:00
Changho Hwang	798631bd52	Update unit tests (#81 )	2023-06-08 09:58:05 +00:00
Changho Hwang	7346e70109	Use MSCCL++ Docker image for CodeQL (#94 )	2023-06-06 18:42:22 +08:00
Changho Hwang	0581bfb431	Fix CodeQL workflow (#80 )	2023-05-22 14:03:30 +08:00
Changho Hwang	8d54bf3301	Update CI (#79 )	2023-05-21 11:45:41 -07:00
Binyang Li	5704fb7c6a	update	2023-05-11 08:55:51 +00:00
Binyang Li	1487596dc8	update cpplint	2023-05-11 08:34:57 +00:00
Binyang Li	669c67b3de	enable github action on all ranches	2023-05-05 08:42:25 +00:00
Changho Hwang	72431957fd	Use clang-format-12	2023-03-27 14:00:03 +00:00
Binyang Li	7ec6ae9d6a	add cpplint and CI	2023-03-27 03:32:10 +00:00

38 Commits