mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-02 20:51:26 +00:00

Author	SHA1	Message	Date
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Caio Rocha	eb202780f5	Support Synchronous Initialization for Proxy Service (#679 )	2025-11-12 18:35:57 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Binyang Li	70b8297c56	Revise NCCL API implementation (#617 ) - Make nccl interface extensible. Customer can register their own algo to NCCL API. User can provide customized algo selection function. - Fallback to NCCL/RCCL if no algo is selected based on algo selection function - MSCCLPP interfaces now works for any scale	2025-09-26 10:08:12 -07:00
Caio Rocha	f839184a27	Fix deadlock in Executor channel setup (#616 ) In cases where we have circular channel creation, such as: creating channel 0 <-> 1 creating channel 1 <-> 2 creating channel 2 <-> 3 creating channel 3 <-> 0 creating channel 0 <-> 3 creating channel 1 <-> 0 creating channel 2 <-> 1 creating channel 3 <-> 2 This setup can result in a deadlock during the first channel creation for each rank. The current code requires sharing the semaphore for the first channel before moving on, which leads to the following sequence: creating channel 0 <-> 1 creating channel 1 <-> 2 creating channel 2 <-> 3 creating channel 3 <-> 0 <-- HANG ISSUE --> The process hangs because, for example, rank 0 will only share the semaphore with rank 3 after receiving it from rank 1. However, rank 1 is waiting for a semaphore from rank 2, rank 2 is waiting for one from rank 3, and rank 3 is waiting for one from rank 0. The solution is to make this creation asynchronous and only retrieve the semaphore after all semaphores have been requested.	2025-08-19 16:44:44 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Changho Hwang	199468bc47	Revise NVLS interface (#458 ) * Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel` * Minor interface improvements	2025-07-12 00:33:03 +00:00
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	fcb2e46cb1	NVLS support for NCCL API (#410 ) Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 09:55:35 +00:00
Binyang Li	ee75caf365	Reduce memory usage for scratch buffer (#403 ) In the executor, we allocate the scratch buffer based on `sendMemRange`. However, for certain execution plans, this allocation may be unsuitable, as the plan does not support messages of this size. To avoid allocating to much data and cause OOM error, set scratch buffer size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan), scratchBufferSize(sendMemRange))`	2024-12-13 13:00:04 -08:00
Caio Rocha	01fd813f1b	Exception Max Number Operation per Tb (#405 )	2024-12-11 16:06:15 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Binyang Li	1baea89fa0	Fix light load bug (#379 ) Fix lightLoadExecutionPlan issue. An execution context many have multi device execution plans. These plans share the channel connections which are constructed before. A deviceExecutionPlanKey is introduced to identify these plans. We can get the current device execution plan key via: `contexts.currentDevicePlan`	2024-11-13 07:58:43 +00:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	26a87535f9	Fix bug for construct sempaphore (#341 ) Current semaphore construction requires two-way communication, e.g., to construct a semaphore signaling from rank 0 to rank 1, both rank 0 and rank 1 need to send a message to each other. This PR fixes an executor bug that fails to conduct two-way communication for constructing such one-way semaphores, and instead hangs during the semaphore construction. In the future, we may need to change the implementation to construct semaphore via one-way communication. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-04 19:42:03 +08:00
Caio Rocha	1af62ea43d	ProxyChannel Support in Executor (#342 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-08-27 10:09:44 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
Ziyue Yang	b5a48f836c	Separate NPKit CPU timestamp access from different blocks for AMD platform (#321 ) Reference: https://github.com/ROCm/rccl/pull/1229	2024-07-02 19:36:48 +08:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Binyang Li	80aefe55bc	Cumulative Updates (#309 ) Bug fix: Unable to execute communication primitives with the same execution plan but varying message sizes. Add reduce_packets OP	2024-06-12 19:17:57 +08:00
Binyang Li	6226556ce2	Optimized the execution kernel (#294 )	2024-05-03 11:54:50 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00

31 Commits