mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 01:10:22 +00:00

Author	SHA1	Message	Date
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	fcb2e46cb1	NVLS support for NCCL API (#410 ) Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 09:55:35 +00:00
Binyang Li	ee75caf365	Reduce memory usage for scratch buffer (#403 ) In the executor, we allocate the scratch buffer based on `sendMemRange`. However, for certain execution plans, this allocation may be unsuitable, as the plan does not support messages of this size. To avoid allocating to much data and cause OOM error, set scratch buffer size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan), scratchBufferSize(sendMemRange))`	2024-12-13 13:00:04 -08:00
Caio Rocha	01fd813f1b	Exception Max Number Operation per Tb (#405 )	2024-12-11 16:06:15 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Binyang Li	1baea89fa0	Fix light load bug (#379 ) Fix lightLoadExecutionPlan issue. An execution context many have multi device execution plans. These plans share the channel connections which are constructed before. A deviceExecutionPlanKey is introduced to identify these plans. We can get the current device execution plan key via: `contexts.currentDevicePlan`	2024-11-13 07:58:43 +00:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	26a87535f9	Fix bug for construct sempaphore (#341 ) Current semaphore construction requires two-way communication, e.g., to construct a semaphore signaling from rank 0 to rank 1, both rank 0 and rank 1 need to send a message to each other. This PR fixes an executor bug that fails to conduct two-way communication for constructing such one-way semaphores, and instead hangs during the semaphore construction. In the future, we may need to change the implementation to construct semaphore via one-way communication. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-04 19:42:03 +08:00
Caio Rocha	1af62ea43d	ProxyChannel Support in Executor (#342 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-08-27 10:09:44 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
Ziyue Yang	b5a48f836c	Separate NPKit CPU timestamp access from different blocks for AMD platform (#321 ) Reference: https://github.com/ROCm/rccl/pull/1229	2024-07-02 19:36:48 +08:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Binyang Li	80aefe55bc	Cumulative Updates (#309 ) Bug fix: Unable to execute communication primitives with the same execution plan but varying message sizes. Add reduce_packets OP	2024-06-12 19:17:57 +08:00
Binyang Li	6226556ce2	Optimized the execution kernel (#294 )	2024-05-03 11:54:50 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00

20 Commits