mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-25 07:14:40 +00:00

Author	SHA1	Message	Date
Caio Rocha	ac5cc647e0	Reduce Operation Support to the Executor (#484 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-25 13:58:12 -07:00
Qinghua Zhou	a7c364beb8	nccl/rccl integration (#469 ) Use dlopen to load nccl/rccl Apis from shared library to enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations. Add three related environment variables -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all" By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-20 11:31:37 -07:00
Binyang Li	0b840baa05	Update allgather fallback algo (#476 ) Enhancements to all-gather operation, a temporary solution to fix the memory overhead when integrating msccl++ with pytorch. This solution will not register input/output buffer to msccl++, so the temp output buffer for allgather could be reused by torch automatically. * Introduced a new `allgather8` kernel function in `apps/nccl/src/allgather.hpp` to handle larger data sizes more efficiently. This includes double buffering to hide synchronization overhead and support for both in-place and out-of-place operations. * Modified the `allgather` function to decide between `allgather6` and `allgather8` based on data size and platform, improving performance for large data sizes. Configuration and environment improvements: * Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to control whether the channel cache is disabled, enhancing configurability. This variable is now part of the `Env` class and is logged during environment initialization. * Removed the redundant global variable `mscclppDisableChannelCache` from `src/debug.cc` and updated its usage to refer to the new environment variable.	2025-03-14 11:18:03 -07:00
Binyang Li	79b5eefa6c	Fix memory OOM issue (#479 ) This pull request includes changes to improve memory management in GPU-related functions by ensuring proper release of memory handles. Fix #470	2025-03-10 11:15:17 -07:00
Caio Rocha	bac3c90f6a	Improving Get Operation at MSCCLLang (#475 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-05 09:41:38 -08:00
Caio Rocha	6074eeeac9	Adjust NPKit IB Event (#472 )	2025-02-28 10:16:47 -08:00
Caio Rocha	b3992a8b29	Adding Read Put Packet operation at Executor (#441 )	2025-02-26 09:29:43 -08:00
Caio Rocha	0222bb324d	Adjusting AllGather Collective in MSCCLLang (#466 ) Co-authored-by: Caio Rocha <aiorocha@microsoft.com>	2025-02-25 08:35:26 -08:00
Qinghua Zhou	591276f9d0	Disable channel cache (#463 ) Add workaround of disabling channel cache. Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE (Default value: False) In this PR, some other features (e.g., ncclCommSplit) come from branch binyangli/nccl-api --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-02-19 19:26:12 +00:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Changho Hwang	4ee15b7ad0	Fix PR #449 (#453 )	2025-01-15 11:59:12 -08:00
Changho Hwang	d12247b54a	Lazily create streams for CudaIpcConnection (#449 )	2025-01-15 11:50:02 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Binyang Li	8ac50dc85d	Resolve cuMemMap error (#451 ) * Updated `RegisteredMemory::Impl::Impl(const std::vector<char>& serialization)` to use both minimum and recommended granularities for memory address reservation and mapping. This will resolve the cuMemMap error	2025-01-10 14:22:14 -08:00
Changho Hwang	f2b52c6318	Fix Python binding of exceptions (#444 ) * Fixed errors to be catchable from Python code * Skip IB tests in Python unit tests when IB ports are down	2025-01-09 11:58:23 -08:00
Caio Rocha	80abce59ef	Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-01-08 09:02:36 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	6fedb7c0e8	Fix nccl-test failure issue (#421 )	2024-12-19 12:07:00 -08:00
Binyang Li	fcb2e46cb1	NVLS support for NCCL API (#410 ) Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 09:55:35 +00:00
Binyang Li	863a599360	Disable CuMemMap check for ROCm (#411 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-17 08:36:25 +00:00
Binyang Li	ee75caf365	Reduce memory usage for scratch buffer (#403 ) In the executor, we allocate the scratch buffer based on `sendMemRange`. However, for certain execution plans, this allocation may be unsuitable, as the plan does not support messages of this size. To avoid allocating to much data and cause OOM error, set scratch buffer size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan), scratchBufferSize(sendMemRange))`	2024-12-13 13:00:04 -08:00
Caio Rocha	01fd813f1b	Exception Max Number Operation per Tb (#405 )	2024-12-11 16:06:15 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Ziyue Yang	f6305a3c1d	Add connection events for NPKit (#386 )	2024-12-05 00:06:37 +08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Binyang Li	593478e1b7	Add cross threadblock barrier (#383 )	2024-11-26 05:13:30 +00:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Ziyue Yang	3e51e9b359	Fix missing packet parameter for executor (#385 )	2024-11-19 08:36:37 +08:00
Binyang Li	1baea89fa0	Fix light load bug (#379 ) Fix lightLoadExecutionPlan issue. An execution context many have multi device execution plans. These plans share the channel connections which are constructed before. A deviceExecutionPlanKey is introduced to identify these plans. We can get the current device execution plan key via: `contexts.currentDevicePlan`	2024-11-13 07:58:43 +00:00
Caio Rocha	d5d608abdc	Fixing Bug Const Offset in Execution Plan (#380 ) The offset was not differentiating between the buffer types, causing the offset to be incorrect when the buffer type was not `SCRATCH`. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-11-11 20:02:02 -08:00
Changho Hwang	85fdde7a73	Lazily create the context stream (#381 ) Create the context stream only when needed.	2024-11-11 10:39:32 +08:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Changho Hwang	0c150e5166	Fix copyright messages (#367 )	2024-10-17 21:25:46 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Ziyue Yang	5c4e105814	Fix NPKit exit event offset (#356 )	2024-09-19 13:35:44 +08:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	7bedb25054	Add proxy channel related operations (#351 ) Add Flush, PutWithSignal, PutWithFlushAndSignal operation	2024-09-15 13:24:57 -07:00
Binyang Li	26a87535f9	Fix bug for construct sempaphore (#341 ) Current semaphore construction requires two-way communication, e.g., to construct a semaphore signaling from rank 0 to rank 1, both rank 0 and rank 1 need to send a message to each other. This PR fixes an executor bug that fails to conduct two-way communication for constructing such one-way semaphores, and instead hangs during the semaphore construction. In the future, we may need to change the implementation to construct semaphore via one-way communication. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-04 19:42:03 +08:00
Changho Hwang	72b99a4229	Fix for ROCm 6.0 (#347 )	2024-09-01 20:22:33 -07:00
Caio Rocha	4eca6f1e95	Support executors to send packets over ProxyChannel (#344 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-08-30 22:10:33 +00:00
Caio Rocha	1af62ea43d	ProxyChannel Support in Executor (#342 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-08-27 10:09:44 -07:00
Changho Hwang	1e82dd444f	Make ibverbs optional at compile time (#340 ) Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-08-21 12:47:05 -07:00
Caio Rocha	ead4efc315	Dynamically load libibverbs (#337 )	2024-08-13 23:48:39 -07:00
Changho Hwang	8c6fb429e9	bfloat16 support (#336 ) * Add bfloat16 support for executor and NCCL interface * Changed `gpu_data_types.hpp` into an internal header file	2024-08-12 15:41:58 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
Roshan Dathathri	f131fae3ec	Add support for different vector sizes in multimem instructions (#332 )	2024-07-25 10:14:02 -07:00
Ziyue Yang	b5a48f836c	Separate NPKit CPU timestamp access from different blocks for AMD platform (#321 ) Reference: https://github.com/ROCm/rccl/pull/1229	2024-07-02 19:36:48 +08:00
Ziyue Yang	f29095b3b1	Fix NPKit support for AMD (#312 )	2024-06-14 16:22:14 +08:00

1 2 3 4 5 ...

413 Commits