mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-24 23:06:17 +00:00

Author	SHA1	Message	Date
Changho Hwang	83356957bd	Improved documentation & minor interface revision (#541 )	2025-06-03 14:26:27 -07:00
Changho Hwang	c184485808	DLPack fixes (#537 ) * Fix typos in type name * Make it work without current context set	2025-05-27 21:40:50 +00:00
Changho Hwang	7278b51e61	Rename `ChannelTrigger` fields and check field values in debug builds (#529 )	2025-05-27 14:36:22 -07:00
Caio Rocha	29c3af2ac6	Properly setting up the device in Ethernet Connection (#527 ) When we create the thread to receive messages in the Ethernet Connection, it resets the Device ID, causing faults in the Ethernet Connection unit tests. ![image](https://github.com/user-attachments/assets/ba609c16-0f52-4624-807a-5ad776a0c18d) This PR aims to properly set up the device when the thread is created. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-05-19 10:05:45 -07:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Changho Hwang	d636093336	Asynchronous setup (#514 ) Cherry-picked a part of features from #167: now `Communicator::setup()` is unneeded. `Communicator::sendMemory()` conducts the task inline, and `Communicator::recvMemory()` and `Communicator::connect()` conducts the task asynchronously without explicit setup.	2025-05-08 22:01:51 +00:00
Qinghua Zhou	b4f0af8f9f	Support ibv_reg_dmabuf_mr for buffer allocated by cuMemMalloc (#513 ) Fix #496 For buffer allocated by cuMemMalloc, use ibv_reg_dmabuf_mr to register a dma-buf based memory region.	2025-05-07 17:26:14 -07:00
Binyang Li	affca7d9bc	Add NVLS based fallback algo (#507 ) Add two nvls based fallback algo. allreduce9 is for nvls with zero copy. allreduce10 is for nvls need to copy to scratch buffer, do reduce operation then copy result back to result buffer. Perf number for allreduce9 ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 5.45 0.19 0.33 0 5.35 0.19 0.33 0 2048 512 float sum -1 5.57 0.37 0.64 0 5.53 0.37 0.65 0 4096 1024 float sum -1 5.80 0.71 1.24 0 5.78 0.71 1.24 0 8192 2048 float sum -1 5.94 1.38 2.42 0 5.85 1.40 2.45 0 16384 4096 float sum -1 6.40 2.56 4.48 0 6.27 2.61 4.57 0 32768 8192 float sum -1 7.45 4.40 7.70 0 7.39 4.43 7.76 0 65536 16384 float sum -1 8.03 8.17 14.29 0 8.32 7.88 13.79 0 131072 32768 float sum -1 7.28 18.00 31.49 0 7.07 18.53 32.43 0 262144 65536 float sum -1 7.72 33.95 59.41 0 7.59 34.56 60.48 0 524288 131072 float sum -1 8.70 60.29 105.51 0 8.37 62.61 109.57 0 1048576 262144 float sum -1 10.56 99.26 173.70 0 10.32 101.64 177.87 0 2097152 524288 float sum -1 14.45 145.14 253.99 0 14.02 149.58 261.76 0 4194304 1048576 float sum -1 22.83 183.73 321.52 0 23.03 182.14 318.75 0 8388608 2097152 float sum -1 38.63 217.14 380.00 0 38.57 217.52 380.65 0 16777216 4194304 float sum -1 70.03 239.58 419.27 0 69.96 239.80 419.66 0 33554432 8388608 float sum -1 131.5 255.17 446.55 0 131.3 255.59 447.28 0 67108864 16777216 float sum -1 255.8 262.37 459.15 0 255.4 262.75 459.82 0 134217728 33554432 float sum -1 500.9 267.94 468.90 0 500.0 268.42 469.74 0 268435456 67108864 float sum -1 989.0 271.41 474.97 0 988.9 271.45 475.05 0 536870912 134217728 float sum -1 1967.4 272.88 477.54 0 1966.0 273.08 477.88 0 1073741824 268435456 float sum -1 3908.5 274.72 480.77 0 3904.6 274.99 481.24 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 218.734 ``` Perf number for allreduce10 ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 5.60 0.18 0.32 0 5.52 0.19 0.32 0 2048 512 float sum -1 5.79 0.35 0.62 0 5.64 0.36 0.64 0 4096 1024 float sum -1 5.92 0.69 1.21 0 5.82 0.70 1.23 0 8192 2048 float sum -1 6.03 1.36 2.38 0 5.95 1.38 2.41 0 16384 4096 float sum -1 6.58 2.49 4.35 0 6.39 2.56 4.49 0 32768 8192 float sum -1 7.54 4.34 7.60 0 7.41 4.42 7.74 0 65536 16384 float sum -1 7.95 8.24 14.42 0 8.10 8.09 14.16 0 131072 32768 float sum -1 9.56 13.72 24.00 0 9.47 13.84 24.23 0 262144 65536 float sum -1 11.49 22.81 39.92 0 11.41 22.97 40.20 0 524288 131072 float sum -1 14.19 36.94 64.64 0 13.88 37.76 66.09 0 1048576 262144 float sum -1 19.10 54.89 96.06 0 18.98 55.24 96.67 0 2097152 524288 float sum -1 31.12 67.38 117.91 0 31.34 66.92 117.10 0 4194304 1048576 float sum -1 44.88 93.46 163.56 0 44.76 93.70 163.97 0 8388608 2097152 float sum -1 63.23 132.68 232.18 0 62.53 134.14 234.75 0 16777216 4194304 float sum -1 106.8 157.03 274.80 0 105.9 158.46 277.30 0 33554432 8388608 float sum -1 172.2 194.91 341.09 0 172.0 195.05 341.35 0 67108864 16777216 float sum -1 299.8 223.83 391.70 0 300.8 223.12 390.46 0 134217728 33554432 float sum -1 553.1 242.66 424.66 0 553.8 242.38 424.16 0 268435456 67108864 float sum -1 1056.1 254.18 444.82 0 1057.4 253.86 444.26 0 536870912 134217728 float sum -1 2064.0 260.11 455.20 0 2063.8 260.14 455.25 0 1073741824 268435456 float sum -1 4074.4 263.53 461.18 0 4065.8 264.09 462.16 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 169.799 ``` --------- Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-04-27 14:09:31 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Binyang Li	7da11b35d5	Add flag to disable nvls (#500 ) Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported by Azure vm. Add this flag to force to use cudaMalloc for memory allocation and disable nvls feature	2025-04-22 17:09:19 -07:00
Qinghua Zhou	f1115210bf	Fix the virtual address mapping issue of cuMemMap in fallback code (#501 ) Fix the virtual address mapping issue of cuMemMap by using the size of device memory allocation instead of user buffer size	2025-04-16 14:16:52 -07:00
Binyang Li	adc9ee5684	Export mscclpp GpuBuffer to dlpack format (#492 ) For mscclpp, to use nvls we require the buffer is allocated by mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export the raw buffer to dlpack format. User can use this feature to create buffer with type supported by pytorch ```python buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16 dl_pack = buffer.to_dlpack(str(torch.bfloat16)) tensor = torch.utils.dlpack.from_dlpack(dl_pack) ```	2025-04-03 12:59:32 -07:00
Binyang Li	a3d8d6807b	Remove the requirement for `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support (#489 ) Remove the requirement for `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support. Fix #487	2025-03-28 16:46:54 -07:00
Caio Rocha	ac5cc647e0	Reduce Operation Support to the Executor (#484 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-25 13:58:12 -07:00
Qinghua Zhou	a7c364beb8	nccl/rccl integration (#469 ) Use dlopen to load nccl/rccl Apis from shared library to enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations. Add three related environment variables -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all" By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-20 11:31:37 -07:00
Binyang Li	0b840baa05	Update allgather fallback algo (#476 ) Enhancements to all-gather operation, a temporary solution to fix the memory overhead when integrating msccl++ with pytorch. This solution will not register input/output buffer to msccl++, so the temp output buffer for allgather could be reused by torch automatically. * Introduced a new `allgather8` kernel function in `apps/nccl/src/allgather.hpp` to handle larger data sizes more efficiently. This includes double buffering to hide synchronization overhead and support for both in-place and out-of-place operations. * Modified the `allgather` function to decide between `allgather6` and `allgather8` based on data size and platform, improving performance for large data sizes. Configuration and environment improvements: * Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to control whether the channel cache is disabled, enhancing configurability. This variable is now part of the `Env` class and is logged during environment initialization. * Removed the redundant global variable `mscclppDisableChannelCache` from `src/debug.cc` and updated its usage to refer to the new environment variable.	2025-03-14 11:18:03 -07:00
Binyang Li	79b5eefa6c	Fix memory OOM issue (#479 ) This pull request includes changes to improve memory management in GPU-related functions by ensuring proper release of memory handles. Fix #470	2025-03-10 11:15:17 -07:00
Caio Rocha	bac3c90f6a	Improving Get Operation at MSCCLLang (#475 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-05 09:41:38 -08:00
Caio Rocha	6074eeeac9	Adjust NPKit IB Event (#472 )	2025-02-28 10:16:47 -08:00
Caio Rocha	b3992a8b29	Adding Read Put Packet operation at Executor (#441 )	2025-02-26 09:29:43 -08:00
Caio Rocha	0222bb324d	Adjusting AllGather Collective in MSCCLLang (#466 ) Co-authored-by: Caio Rocha <aiorocha@microsoft.com>	2025-02-25 08:35:26 -08:00
Qinghua Zhou	591276f9d0	Disable channel cache (#463 ) Add workaround of disabling channel cache. Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE (Default value: False) In this PR, some other features (e.g., ncclCommSplit) come from branch binyangli/nccl-api --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-02-19 19:26:12 +00:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Changho Hwang	4ee15b7ad0	Fix PR #449 (#453 )	2025-01-15 11:59:12 -08:00
Changho Hwang	d12247b54a	Lazily create streams for CudaIpcConnection (#449 )	2025-01-15 11:50:02 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Binyang Li	8ac50dc85d	Resolve cuMemMap error (#451 ) * Updated `RegisteredMemory::Impl::Impl(const std::vector<char>& serialization)` to use both minimum and recommended granularities for memory address reservation and mapping. This will resolve the cuMemMap error	2025-01-10 14:22:14 -08:00
Changho Hwang	f2b52c6318	Fix Python binding of exceptions (#444 ) * Fixed errors to be catchable from Python code * Skip IB tests in Python unit tests when IB ports are down	2025-01-09 11:58:23 -08:00
Caio Rocha	80abce59ef	Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-01-08 09:02:36 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	6fedb7c0e8	Fix nccl-test failure issue (#421 )	2024-12-19 12:07:00 -08:00
Binyang Li	fcb2e46cb1	NVLS support for NCCL API (#410 ) Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 09:55:35 +00:00
Binyang Li	863a599360	Disable CuMemMap check for ROCm (#411 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-17 08:36:25 +00:00
Binyang Li	ee75caf365	Reduce memory usage for scratch buffer (#403 ) In the executor, we allocate the scratch buffer based on `sendMemRange`. However, for certain execution plans, this allocation may be unsuitable, as the plan does not support messages of this size. To avoid allocating to much data and cause OOM error, set scratch buffer size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan), scratchBufferSize(sendMemRange))`	2024-12-13 13:00:04 -08:00
Caio Rocha	01fd813f1b	Exception Max Number Operation per Tb (#405 )	2024-12-11 16:06:15 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Ziyue Yang	f6305a3c1d	Add connection events for NPKit (#386 )	2024-12-05 00:06:37 +08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Binyang Li	593478e1b7	Add cross threadblock barrier (#383 )	2024-11-26 05:13:30 +00:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Ziyue Yang	3e51e9b359	Fix missing packet parameter for executor (#385 )	2024-11-19 08:36:37 +08:00
Binyang Li	1baea89fa0	Fix light load bug (#379 ) Fix lightLoadExecutionPlan issue. An execution context many have multi device execution plans. These plans share the channel connections which are constructed before. A deviceExecutionPlanKey is introduced to identify these plans. We can get the current device execution plan key via: `contexts.currentDevicePlan`	2024-11-13 07:58:43 +00:00
Caio Rocha	d5d608abdc	Fixing Bug Const Offset in Execution Plan (#380 ) The offset was not differentiating between the buffer types, causing the offset to be incorrect when the buffer type was not `SCRATCH`. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-11-11 20:02:02 -08:00
Changho Hwang	85fdde7a73	Lazily create the context stream (#381 ) Create the context stream only when needed.	2024-11-11 10:39:32 +08:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Changho Hwang	0c150e5166	Fix copyright messages (#367 )	2024-10-17 21:25:46 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Ziyue Yang	5c4e105814	Fix NPKit exit event offset (#356 )	2024-09-19 13:35:44 +08:00

1 2 3 4 5 ...

426 Commits