mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-04-20 06:49:29 +00:00

Author	SHA1	Message	Date
Changho Hwang	20eca28942	Fix a FIFO correctness bug (#549 ) * Add a FIFO test code that reproduced a correctness issue * Fix the correctness issue by using pinned memory instead of cudaMemcpy --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-07-11 23:53:59 +00:00
Binyang Li	9b71d524b3	Fix pytest failure (#567 ) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-07-11 16:49:28 -07:00
Binyang Li	65c10fa8ec	Support any GPUs per node for NCCL_API (#566 ) Support any GPUs per node for NCCL API	2025-07-11 13:42:39 -07:00
Binyang Li	1d85ac6f66	Fix multi-nodes CI pipeline (#564 ) Fix multi-nodes CI pipeline	2025-07-08 09:51:44 -07:00
Changho Hwang	22e8db4885	Support connection between local endpoints (#561 )	2025-07-02 13:02:44 -07:00
Changho Hwang	3de6d5b63a	Fix #557 (#560 ) * Page-locking `Host2DeviceSemaphore::outboundSemaphore_` caused unexpected performance issues so reverting it back. We may revisit this later. * Removed reference to connections from context as now connections refer to context.	2025-06-30 11:33:19 -07:00
Changho Hwang	b4dde38db8	FIFO improvements (#557 ) * Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default. * Optimize `FifoDeviceHandle`. * Do not use `cudaHostAllocWriteCombined` that increases latency. * Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`. * Fix proxy NUMA binding issues. * Prevent graph capture inside proxy threads. * Now `CudaIpcConnection` skips stream sync when unnecessary. * Now any type of connection needs to hold a shared pointer to the context for memory safety. * Now a context should be always managed by a shared pointer for memory safety. * Minor docs & interface improvements. * Minor fix in `mscclpp-test` correctness test.	2025-06-24 09:50:28 -07:00
Changho Hwang	2796cfa5ba	New FIFO test (#558 ) Comprehensive FIFO testing	2025-06-23 15:42:44 -07:00
Wenxuan Tan	2151790463	Fix some typos in docs (#555 )	2025-06-19 19:39:37 +00:00
Binyang Li	81699a5bdd	DeviceSemaphore fix (#553 ) Fix the bug, make sure a thread will be wake up if semaphore be released. This pull request includes a modification to the `DeviceSemaphore` struct in the `concurrency_device.hpp` file, specifically in the `acquire` method. The change refines the logic for acquiring a semaphore by adjusting the condition used to handle contention scenarios.	2025-06-19 12:30:01 -07:00
Changho Hwang	a36dcd56bf	Do not use tail replica by default (#544 ) Added `MSCCLPP_FIFO_USE_TAIL_REPLICA` environment variable to control whether to use a tail replica for the FIFO buffer. Default is false.	2025-06-12 14:07:10 -07:00
Changho Hwang	17d8e7c9e9	Fix build processes (#545 ) * Let CMake read version numbers from the `VERSION` file. * Upgrade dlpack and drop `CMAKE_POLICY_VERSION_MINIMUM`. * Do not install dlpack. * Add license files in the wheel and exclude `*.cpp` files.	2025-06-06 13:37:40 -07:00
Changho Hwang	f694f2e46b	Fix #509 (#546 ) Fix a destruction order issue	2025-06-05 19:36:02 -07:00
Changho Hwang	125d6f5809	Multi-stream CUDA IPC (#326 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>	2025-06-04 10:31:04 -07:00
Changho Hwang	253a1ba1a9	Use a stream pool for `gpuCalloc()` (#509 ) Previous `gpuCalloc()` creates a new stream for each allocation, which messes the timeline up in profiler traces. Now `GpuStreamPool` allows reusing the temporal streams.	2025-06-04 10:07:20 -07:00
Changho Hwang	83356957bd	Improved documentation & minor interface revision (#541 )	2025-06-03 14:26:27 -07:00
Changho Hwang	c184485808	DLPack fixes (#537 ) * Fix typos in type name * Make it work without current context set	2025-05-27 21:40:50 +00:00
Changho Hwang	7278b51e61	Rename `ChannelTrigger` fields and check field values in debug builds (#529 )	2025-05-27 14:36:22 -07:00
Changho Hwang	2b9b18d562	Address NVCC warning #20012-D (#528 )	2025-05-21 10:37:50 -07:00
Binyang Li	d1869011c2	Add device semaphore API (#523 ) Add deviceSemaphore structure, implement a new NVLS based algo to show how to use these APIs. Current perf for NVLS non-zero copy version is: ``` # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 512 half sum -1 6.10 0.17 0.29 0 5.65 0.18 0.32 0 2048 1024 half sum -1 5.94 0.35 0.60 0 5.85 0.35 0.61 0 4096 2048 half sum -1 6.11 0.67 1.17 0 5.97 0.69 1.20 0 8192 4096 half sum -1 6.22 1.32 2.31 0 6.17 1.33 2.33 0 16384 8192 half sum -1 6.68 2.45 4.29 0 6.52 2.51 4.39 0 32768 16384 half sum -1 8.02 4.09 7.15 0 7.66 4.28 7.49 0 65536 32768 half sum -1 8.09 8.10 14.18 0 7.91 8.29 14.51 0 131072 65536 half sum -1 9.58 13.68 23.93 0 9.61 13.64 23.86 0 262144 131072 half sum -1 12.60 20.81 36.42 0 12.28 21.35 37.37 0 524288 262144 half sum -1 14.51 36.12 63.22 0 14.09 37.21 65.12 0 1048576 524288 half sum -1 19.45 53.92 94.36 0 19.29 54.35 95.12 0 2097152 1048576 half sum -1 31.00 67.66 118.40 0 30.80 68.08 119.14 0 4194304 2097152 half sum -1 44.71 93.80 164.16 0 44.66 93.91 164.34 0 8388608 4194304 half sum -1 62.96 133.24 233.17 0 62.49 134.24 234.91 0 16777216 8388608 half sum -1 105.1 159.68 279.45 0 104.4 160.74 281.29 0 33554432 16777216 half sum -1 169.9 197.55 345.71 0 169.8 197.64 345.87 0 67108864 33554432 half sum -1 298.1 225.12 393.96 0 298.1 225.09 393.91 0 134217728 67108864 half sum -1 552.9 242.77 424.84 0 553.7 242.39 424.18 0 268435456 134217728 half sum -1 1055.8 254.24 444.91 0 1056.9 253.98 444.47 0 536870912 268435456 half sum -1 2040.1 263.15 460.52 0 2045.1 262.52 459.40 0 1073741824 536870912 half sum -1 3996.9 268.65 470.13 0 4007.7 267.92 468.86 0 ```	2025-05-20 09:32:38 -07:00
Caio Rocha	29c3af2ac6	Properly setting up the device in Ethernet Connection (#527 ) When we create the thread to receive messages in the Ethernet Connection, it resets the Device ID, causing faults in the Ethernet Connection unit tests. ![image](https://github.com/user-attachments/assets/ba609c16-0f52-4624-807a-5ad776a0c18d) This PR aims to properly set up the device when the thread is created. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-05-19 10:05:45 -07:00
Binyang Li	a18e91cee4	Set Up a CI Pipeline for H100 (#526 ) Set Up a CI Pipeline for H100	2025-05-15 14:50:23 -07:00
Changho Hwang	908659318b	Update citations (#524 ) Co-authored-by: Aashaka Shah <aashaka96@gmail.com>	2025-05-13 17:52:04 -07:00
Changho Hwang	2c63059014	Add a CMake option `MSCCLPP_GPU_ARCHS` (#525 ) `MSCCLPP_GPU_ARCHS` allows specifying GPU architectures with delimiters (comma, space, or semicolon).	2025-05-13 20:51:23 +00:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Changho Hwang	5205618c4a	Fix device assert (#522 ) * Fixed a bug that external `assert()`s may not be compiled with mscclpp headers * Use a macro assert instead of a function	2025-05-12 13:38:11 -07:00
Binyang Li	a464b9f21e	Adding maxSpinCount to port channel flush (#518 ) Fix #482 --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-05-08 21:24:48 -07:00
Changho Hwang	d636093336	Asynchronous setup (#514 ) Cherry-picked a part of features from #167: now `Communicator::setup()` is unneeded. `Communicator::sendMemory()` conducts the task inline, and `Communicator::recvMemory()` and `Communicator::connect()` conducts the task asynchronously without explicit setup.	2025-05-08 22:01:51 +00:00
Qinghua Zhou	8bc369ceb4	Fix the issue of echo message for nccl fallback in CI test (#520 ) Update the echo message for nccl fallback in CI test	2025-05-08 10:37:54 -07:00
Qinghua Zhou	b4f0af8f9f	Support ibv_reg_dmabuf_mr for buffer allocated by cuMemMalloc (#513 ) Fix #496 For buffer allocated by cuMemMalloc, use ibv_reg_dmabuf_mr to register a dma-buf based memory region.	2025-05-07 17:26:14 -07:00
Caio Rocha	51eca89d20	Enhance Collective Check at MSCCLang (#511 )	2025-04-29 13:29:28 -07:00
Binyang Li	affca7d9bc	Add NVLS based fallback algo (#507 ) Add two nvls based fallback algo. allreduce9 is for nvls with zero copy. allreduce10 is for nvls need to copy to scratch buffer, do reduce operation then copy result back to result buffer. Perf number for allreduce9 ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 5.45 0.19 0.33 0 5.35 0.19 0.33 0 2048 512 float sum -1 5.57 0.37 0.64 0 5.53 0.37 0.65 0 4096 1024 float sum -1 5.80 0.71 1.24 0 5.78 0.71 1.24 0 8192 2048 float sum -1 5.94 1.38 2.42 0 5.85 1.40 2.45 0 16384 4096 float sum -1 6.40 2.56 4.48 0 6.27 2.61 4.57 0 32768 8192 float sum -1 7.45 4.40 7.70 0 7.39 4.43 7.76 0 65536 16384 float sum -1 8.03 8.17 14.29 0 8.32 7.88 13.79 0 131072 32768 float sum -1 7.28 18.00 31.49 0 7.07 18.53 32.43 0 262144 65536 float sum -1 7.72 33.95 59.41 0 7.59 34.56 60.48 0 524288 131072 float sum -1 8.70 60.29 105.51 0 8.37 62.61 109.57 0 1048576 262144 float sum -1 10.56 99.26 173.70 0 10.32 101.64 177.87 0 2097152 524288 float sum -1 14.45 145.14 253.99 0 14.02 149.58 261.76 0 4194304 1048576 float sum -1 22.83 183.73 321.52 0 23.03 182.14 318.75 0 8388608 2097152 float sum -1 38.63 217.14 380.00 0 38.57 217.52 380.65 0 16777216 4194304 float sum -1 70.03 239.58 419.27 0 69.96 239.80 419.66 0 33554432 8388608 float sum -1 131.5 255.17 446.55 0 131.3 255.59 447.28 0 67108864 16777216 float sum -1 255.8 262.37 459.15 0 255.4 262.75 459.82 0 134217728 33554432 float sum -1 500.9 267.94 468.90 0 500.0 268.42 469.74 0 268435456 67108864 float sum -1 989.0 271.41 474.97 0 988.9 271.45 475.05 0 536870912 134217728 float sum -1 1967.4 272.88 477.54 0 1966.0 273.08 477.88 0 1073741824 268435456 float sum -1 3908.5 274.72 480.77 0 3904.6 274.99 481.24 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 218.734 ``` Perf number for allreduce10 ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 5.60 0.18 0.32 0 5.52 0.19 0.32 0 2048 512 float sum -1 5.79 0.35 0.62 0 5.64 0.36 0.64 0 4096 1024 float sum -1 5.92 0.69 1.21 0 5.82 0.70 1.23 0 8192 2048 float sum -1 6.03 1.36 2.38 0 5.95 1.38 2.41 0 16384 4096 float sum -1 6.58 2.49 4.35 0 6.39 2.56 4.49 0 32768 8192 float sum -1 7.54 4.34 7.60 0 7.41 4.42 7.74 0 65536 16384 float sum -1 7.95 8.24 14.42 0 8.10 8.09 14.16 0 131072 32768 float sum -1 9.56 13.72 24.00 0 9.47 13.84 24.23 0 262144 65536 float sum -1 11.49 22.81 39.92 0 11.41 22.97 40.20 0 524288 131072 float sum -1 14.19 36.94 64.64 0 13.88 37.76 66.09 0 1048576 262144 float sum -1 19.10 54.89 96.06 0 18.98 55.24 96.67 0 2097152 524288 float sum -1 31.12 67.38 117.91 0 31.34 66.92 117.10 0 4194304 1048576 float sum -1 44.88 93.46 163.56 0 44.76 93.70 163.97 0 8388608 2097152 float sum -1 63.23 132.68 232.18 0 62.53 134.14 234.75 0 16777216 4194304 float sum -1 106.8 157.03 274.80 0 105.9 158.46 277.30 0 33554432 8388608 float sum -1 172.2 194.91 341.09 0 172.0 195.05 341.35 0 67108864 16777216 float sum -1 299.8 223.83 391.70 0 300.8 223.12 390.46 0 134217728 33554432 float sum -1 553.1 242.66 424.66 0 553.8 242.38 424.16 0 268435456 67108864 float sum -1 1056.1 254.18 444.82 0 1057.4 253.86 444.26 0 536870912 134217728 float sum -1 2064.0 260.11 455.20 0 2063.8 260.14 455.25 0 1073741824 268435456 float sum -1 4074.4 263.53 461.18 0 4065.8 264.09 462.16 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 169.799 ``` --------- Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-04-27 14:09:31 -07:00
Changho Hwang	b310783603	Fix #508 (#515 ) * Wrong offsets in `unpackPackets()` * Added Python binding of `BaseMemoryChannel`	2025-04-25 09:52:05 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Nusrat Islam	9df2bdb2bf	apps/nccl: fix a bug in allreduce kernels for graph mode (#502 ) `allreduce7` and `allreduceAllpairs` kernels were updating the LL protocol flag on the host side. So, it was not properly captured in graph mode. This PR fixes the issue by updating the flag in the kernels.	2025-04-24 16:43:47 -07:00
Changho Hwang	cbdcf9064c	Use implicit ctors for default device ctors (#512 ) By using implicit constructors, the compiler doesn't need to dynamically initialize the instances.	2025-04-24 12:38:19 -07:00
Caio Rocha	7a25e51b07	Automatic creation of Scratch Buffer at MSCCLLang (#510 )	2025-04-23 16:37:14 -07:00
Changho Hwang	474ef0b696	Optimized allreduce fallback for ~10KB sizes (#506 ) * Pass the op type as a template parameter * Use the all-pairs algorithm for ~10KB * Don't write channel handles on the shared memory for small sizes * A reduction bug fix & cleanup	2025-04-23 10:38:15 -07:00
Binyang Li	7da11b35d5	Add flag to disable nvls (#500 ) Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported by Azure vm. Add this flag to force to use cudaMalloc for memory allocation and disable nvls feature	2025-04-22 17:09:19 -07:00
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Binyang Li	e412804eab	Improve signal/wait performance and fix barrier issue (#499 ) Remove __assert_fail for release build. This will reduce the number of PTX instructions inside the loop. Also Trying to resolve this issue reported in #497. Reduce the number of PTX instructions from 8 to 6. 8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5 Also NDEBUG flag is confused here, sometime it will not be set. Use customized flag for debug build. Here is current PTX: ``` ld.u64 %rd12, [%rd2+-24]; mov.u64 %rd13, %rd12; mov.u64 %rd11, %rd13; ld.acquire.sys.b64 %rd10,[%rd11]; setp.lt.u64 %p1, %rd10, %rd3; @%p1 bra $L__BB2_1; ``` If we change to `asm volatile("ld.global.acquire.sys.b64 %0, [%1];" : "=l"(flag) : "l"(flag_addr));` will reduce to 4 instructions. We can get 2.1 us for 8 ranks signal/wait ``` ld.u64 %rd9, [%rd1+-24]; ld.global.acquire.sys.b64 %rd8, [%rd9]; setp.lt.u64 %p1, %rd8, %rd2; @%p1 bra $L__BB2_1; ```	2025-04-16 14:22:10 -07:00
Qinghua Zhou	f1115210bf	Fix the virtual address mapping issue of cuMemMap in fallback code (#501 ) Fix the virtual address mapping issue of cuMemMap by using the size of device memory allocation instead of user buffer size	2025-04-16 14:16:52 -07:00
Binyang Li	adc9ee5684	Export mscclpp GpuBuffer to dlpack format (#492 ) For mscclpp, to use nvls we require the buffer is allocated by mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export the raw buffer to dlpack format. User can use this feature to create buffer with type supported by pytorch ```python buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16 dl_pack = buffer.to_dlpack(str(torch.bfloat16)) tensor = torch.utils.dlpack.from_dlpack(dl_pack) ```	2025-04-03 12:59:32 -07:00
Changho Hwang	5a7a59ff14	Fix CMake installation in Dockerfile for arm64 (#491 )	2025-03-31 17:38:47 +00:00
Changho Hwang	3aeb1cb9c6	Add a devcontainer configuration (#490 )	2025-03-31 10:34:11 -07:00
Changho Hwang	def68ced64	Add CUDA 12.8 images (#488 )	2025-03-29 00:31:26 +00:00
Binyang Li	a3d8d6807b	Remove the requirement for `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support (#489 ) Remove the requirement for `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support. Fix #487	2025-03-28 16:46:54 -07:00
Qinghua Zhou	0f21ed44b8	Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations (#485 ) Add CI test for fallback allgather, allreduce, broadcast, and reducescatter to NCCL operations Test following parameters: -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl/nccl/build/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allgather, allreduce, broadcast, reducescatter" or "all"	2025-03-27 21:13:07 +00:00
Caio Rocha	ac5cc647e0	Reduce Operation Support to the Executor (#484 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-25 13:58:12 -07:00
Binyang Li	b4062462fd	Fix reduceMin failaure issue (#486 ) Remove the reduceOp check, as this already done at `getReduceOp` method	2025-03-25 10:15:24 -07:00

1 2 3 4 5 ...

776 Commits