mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Changho Hwang	def68ced64	Add CUDA 12.8 images (#488 )	2025-03-29 00:31:26 +00:00
Binyang Li	a3d8d6807b	Remove the requirement for `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support (#489 ) Remove the requirement for `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support. Fix #487	2025-03-28 16:46:54 -07:00
Qinghua Zhou	0f21ed44b8	Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations (#485 ) Add CI test for fallback allgather, allreduce, broadcast, and reducescatter to NCCL operations Test following parameters: -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl/nccl/build/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allgather, allreduce, broadcast, reducescatter" or "all"	2025-03-27 21:13:07 +00:00
Caio Rocha	ac5cc647e0	Reduce Operation Support to the Executor (#484 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-25 13:58:12 -07:00
Binyang Li	b4062462fd	Fix reduceMin failaure issue (#486 ) Remove the reduceOp check, as this already done at `getReduceOp` method	2025-03-25 10:15:24 -07:00
Qinghua Zhou	a7c364beb8	nccl/rccl integration (#469 ) Use dlopen to load nccl/rccl Apis from shared library to enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations. Add three related environment variables -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all" By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-20 11:31:37 -07:00
Binyang Li	89f7573adf	Fix correctness issue when mscclppDisableChannelCache set to true (#483 ) If `mscclppDisableChannelCache` set to true, we need to keep every channel information avoid the channel info in GPU side be released.	2025-03-19 14:55:37 -07:00
Caio Rocha	b6a179faff	NCCL API CI Test for ReduceScatter (#465 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-17 13:58:32 -07:00
Binyang Li	f124dc1df9	Add min operation for allreduce (#481 ) Add min operation for allreduce	2025-03-16 20:47:36 -07:00
Binyang Li	0b840baa05	Update allgather fallback algo (#476 ) Enhancements to all-gather operation, a temporary solution to fix the memory overhead when integrating msccl++ with pytorch. This solution will not register input/output buffer to msccl++, so the temp output buffer for allgather could be reused by torch automatically. * Introduced a new `allgather8` kernel function in `apps/nccl/src/allgather.hpp` to handle larger data sizes more efficiently. This includes double buffering to hide synchronization overhead and support for both in-place and out-of-place operations. * Modified the `allgather` function to decide between `allgather6` and `allgather8` based on data size and platform, improving performance for large data sizes. Configuration and environment improvements: * Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to control whether the channel cache is disabled, enhancing configurability. This variable is now part of the `Env` class and is logged during environment initialization. * Removed the redundant global variable `mscclppDisableChannelCache` from `src/debug.cc` and updated its usage to refer to the new environment variable.	2025-03-14 11:18:03 -07:00
Changho Hwang	e4012ded48	Mark mscclpp-test as deprecated in the doc (#478 )	2025-03-11 22:44:38 +00:00
Binyang Li	79b5eefa6c	Fix memory OOM issue (#479 ) This pull request includes changes to improve memory management in GPU-related functions by ensuring proper release of memory handles. Fix #470	2025-03-10 11:15:17 -07:00
Caio Rocha	bac3c90f6a	Improving Get Operation at MSCCLLang (#475 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-05 09:41:38 -08:00
Yang Wang	1ff217d5f3	Fix minor typos and errors in documentation (#474 )	2025-02-28 17:46:24 -08:00
Caio Rocha	6074eeeac9	Adjust NPKit IB Event (#472 )	2025-02-28 10:16:47 -08:00
Caio Rocha	986c45b71a	NPKit Support to Read Put Packet Operation (#471 )	2025-02-27 12:02:16 -08:00
Caio Rocha	b3992a8b29	Adding Read Put Packet operation at Executor (#441 )	2025-02-26 09:29:43 -08:00
Caio Rocha	0222bb324d	Adjusting AllGather Collective in MSCCLLang (#466 ) Co-authored-by: Caio Rocha <aiorocha@microsoft.com>	2025-02-25 08:35:26 -08:00
Qinghua Zhou	591276f9d0	Disable channel cache (#463 ) Add workaround of disabling channel cache. Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE (Default value: False) In this PR, some other features (e.g., ncclCommSplit) come from branch binyangli/nccl-api --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-02-19 19:26:12 +00:00
Caio Rocha	8a564977e5	Updating MSCCLLang Examples (#462 ) Co-authored-by: Caio Rocha <aiorocha@microsoft.com>	2025-02-19 09:48:31 -08:00
Caio Rocha	55789bc551	Support ReduceScatter in the NCCL interface (#460 ) Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net> Co-authored-by: Caio Rocha <aiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-02-11 13:28:19 -08:00
Binyang Li	a6e00cc449	remove unnecessary sync (#461 ) `nop` instruction is only for synchronization within the same threadblock. Cross threadblock synchronization is handled by `barrier` instruction. So insert `nop` only if the dependency is within the same threadblock.	2025-02-10 15:31:49 +08:00
Caio Rocha	e7cff899ce	Adjusting BFS to seek circular dependencies in the msccl-tools DAG (#459 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-02-07 11:24:27 -08:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Binyang Li	af0bb86e07	Merge mscclpp-lang to mscclpp project (#442 ) First step to merge msccl-tools into mscclpp repo. In this step will move all msccl related code, pass the current tests and do some necessary refactor. Add `mscclpp.language` module Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag Add `DagLower` to lower dag to intermediate representation Add documents for mscclpp.language Remove msccl related code	2025-01-22 09:47:37 -08:00
Changho Hwang	4ee15b7ad0	Fix PR #449 (#453 )	2025-01-15 11:59:12 -08:00
Changho Hwang	d12247b54a	Lazily create streams for CudaIpcConnection (#449 )	2025-01-15 11:50:02 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Binyang Li	8ac50dc85d	Resolve cuMemMap error (#451 ) * Updated `RegisteredMemory::Impl::Impl(const std::vector<char>& serialization)` to use both minimum and recommended granularities for memory address reservation and mapping. This will resolve the cuMemMap error	2025-01-10 14:22:14 -08:00
Changho Hwang	2b54af7e27	Auto-update version numbers in CMakeLists.txt (#450 )	2025-01-09 17:54:10 -08:00
Changho Hwang	f2b52c6318	Fix Python binding of exceptions (#444 ) * Fixed errors to be catchable from Python code * Skip IB tests in Python unit tests when IB ports are down	2025-01-09 11:58:23 -08:00
Caio Rocha	80abce59ef	Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-01-08 09:02:36 -08:00
Changho Hwang	1989d4be9c	Fix CMake build messages (#443 )	2025-01-08 02:44:01 +00:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	6d26b92665	Fix azure pipeline (#437 )	2025-01-04 19:41:10 -08:00
Pedram Alizadeh	97eaca2bd2	[NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl (#399 )	2025-01-03 20:38:57 +00:00
Qinghua Zhou	ba0d0d68b8	Enhance the nccl error message handling (#434 ) Add WARN or INFO before returning the nccl error message. Change NCCL_DEBUG to MSCCLPP_DEBUG in debug message.	2025-01-03 00:50:36 +00:00
Binyang Li	3d6bfed2cf	Update version number (#433 ) Co-authored-by: github-actions <github-actions@github.com>	2025-01-02 16:45:08 -08:00
Changho Hwang	dff21905e3	Fix typos in the pipeline (#420 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-01-02 09:49:50 -08:00
Binyang Li	3e7801b1a8	Fix CI trigger issue (#428 )	2024-12-20 16:27:52 -08:00
Binyang Li	f18a440feb	trigger ci for release branches (#426 )	2024-12-21 00:05:13 +00:00
Changho Hwang	e2230aab26	Tackle build warnings (#422 ) * Comply with [CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html) * Tackle other warnings during build	2024-12-19 16:51:50 -08:00
Binyang Li	6fedb7c0e8	Fix nccl-test failure issue (#421 )	2024-12-19 12:07:00 -08:00
Binyang Li	776f24e787	update READMED (#414 )	2024-12-19 05:54:27 +00:00
SreevatsaAnantharamu	0c7ed2c674	Add ncclBcast / ncclBroadcast support (#419 ) A simple broadcast using scratch buffer and option to use an executor.	2024-12-19 01:16:30 +00:00
David Sidler	d8d0dfbffa	Fix synchronization in allreduce8 kernel (#407 ) Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a synchronization bug. The PR addresses it by ensuring that signals are only issued after all threads in the block have issued their writes to guarantee correct ordering between data writes and signal writes. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 17:10:22 -08:00
Caio Rocha	774602d49c	Supporting Executor multi node in NCCL API (#412 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-12-18 15:50:58 -08:00
Binyang Li	fcb2e46cb1	NVLS support for NCCL API (#410 ) Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 09:55:35 +00:00
Binyang Li	863a599360	Disable CuMemMap check for ROCm (#411 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-17 08:36:25 +00:00

1 2 3 4 5 ...

731 Commits