mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Qinghua Zhou	f803eff8b9	Use multiple thread blocks; Add peer-parallel kernels	2026-02-24 04:05:01 +00:00
Qinghua Zhou	21e3f1ebb3	Get correct remote receive displacements for peers	2026-02-23 14:22:30 +00:00
Qinghua Zhou	97426b3483	Use same chaneel for both signal and wait for ring kernel; Add pipelined kernel for imbalannced worloads	2026-02-18 12:13:30 +00:00
Qinghua Zhou	43980da455	Use maximum threads (1024) for best bandwidth utilization	2026-02-18 03:00:29 +00:00
Qinghua Zhou	b7485762a5	Improve with memory channels for intra-node communication	2026-02-17 13:55:15 +00:00
Qinghua Zhou	c42579e900	Move the alltoallv kernel to the src directory; Utilize the kernel in mscclpp-test	2026-02-06 02:57:34 +00:00
Qinghua Zhou	ac3e770c42	Add alltoallv kernel and test	2026-02-05 07:41:35 +00:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Changho Hwang	2cf14ff723	Minor fixes (#715 )	2026-01-05 11:09:48 +08:00
Binyang Li	eda74a7f29	Add handle cache for AMD platform (#698 ) Introduce handle cache for AMD platform. Avoid reaching handle limitation if we open too much IPC handles For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-12-21 18:39:12 -08:00
Changho Hwang	9e076da3d4	Make IB more configurable (#703 ) * Added `port` and `gidIndex` field in the IB endpoint config (and `deviceIndex` field for future usages) * Added `MSCCLPP_IBV_SO` env variable to specify a custom libibverbs.so * Added `--ib_gid_index` CLI option to `mp_unit_tests` * Other minor fixes	2025-12-18 13:21:07 -08:00
Caio Rocha	060c35fec6	No IB Env CI Test (#687 )	2025-11-19 11:13:03 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Caio Rocha	eb202780f5	Support Synchronous Initialization for Proxy Service (#679 )	2025-11-12 18:35:57 -08:00
Changho Hwang	ffafcaf6d6	IB stack enhancements & bug fixes (#673 ) * Always use `ibv_reg_dmabuf_mr` when DMABUF is supported * Do not check `nvidia-peermem` when unnecessary * More rigorous check on IB port availability * Fixed ibverbs wrappers * Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test	2025-11-07 14:26:17 -08:00
Changho Hwang	9994f53cea	Fixes for no-IB systems (#667 ) * Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB on/off * Fix `nvidia-peermem` check; no need for DMABUF supported systems * Fix `mp_unit_tests` to skip all IB tests when built with `-DMSCCLPP_USE_IB=OFF`	2025-10-29 10:03:02 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Changho Hwang	a2f1279c60	Test peer accessibility after deployment (#661 ) Test GPUs' peer accessibility before integration testing to distinguish VM issues.	2025-10-24 11:09:36 -07:00
Binyang Li	610db6f023	Fix test script (#655 ) Fix: #654. Address correctness_test.py crash issue Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-21 19:57:17 +00:00
Changho Hwang	2f7d74b281	Fix lint.sh (#652 ) Exit 1 upon any errors from clang-format or black	2025-10-20 17:23:01 -07:00
Binyang Li	b1a88d755e	Pipeline fix (#645 ) Co-authored-by: github-actions <github-actions@github.com>	2025-10-10 11:26:33 -07:00
Binyang Li	5ac427610d	Address teardown issue (#638 ) Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point	2025-09-25 12:12:40 -07:00
Binyang Li	bf8d424ae3	use unix socket to share fd (#634 ) Use unix socket to share fd to other processes. Used for nvls handle sharing Update nccl interface to support worldSize=1 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-25 11:40:54 -07:00
Abhinav Jangda	5d062b7038	Fix Illegal Memory Access in nvls_test for CUDA12.9 (#631 ) Running NVLS test, `test/nvls_test.cu` in CUDA 12.9 leads to illegal memory access at `571fee16fb/test/nvls_test.cu (L151)` . This PR addresses this error by moving cudaMemset after memory mapping.	2025-09-10 09:46:18 -07:00
Binyang Li	ba4c4aaeb8	Integrate MSCCL++ with torch workload (#626 ) Integrate MSCCL++ with torch Introduce `NCCL audit shim library`, use can use following commands to launch torch library. Also avoid break build pipeline in the CPU machine ```bash export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH torchrun --nnodes=1 --nproc_per_node=8 your_script.py ```	2025-09-09 13:28:32 -07:00
Changho Hwang	547a9ae65c	Fixed cpp linter (#619 )	2025-08-25 12:15:45 -07:00
Binyang Li	2b40fe37b3	add torch test (#612 ) Simple torch test	2025-08-15 10:27:21 -07:00
Binyang Li	03c0ff2a91	Fix for multi-nodes test (#614 ) Fix multi-node test --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-14 20:44:43 -07:00
Binyang Li	bb76d27553	all2all implementation (#609 ) Implement single node all2all via MSCCL++ C++API perf kernel 3: ``` size count time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 32768 23.41 44.78 39.19 0 2097152 65536 23.95 87.56 76.61 0 4194304 131072 27.50 152.51 133.45 0 8388608 262144 35.14 238.73 208.89 0 16777216 524288 57.54 291.55 255.11 0 33554432 1048576 109.7 305.81 267.59 0 67108864 2097152 212.3 316.07 276.56 0 134217728 4194304 410.9 326.64 285.81 0 268435456 8388608 784.9 341.99 299.24 0 ``` kernel 2 ``` # in-place out-of-place # size count time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 32768 23.42 44.77 39.17 0 2097152 65536 24.96 84.02 73.52 0 4194304 131072 28.53 147.03 128.65 0 8388608 262144 36.75 228.28 199.75 0 16777216 524288 58.01 289.20 253.05 0 33554432 1048576 110.4 303.83 265.85 0 67108864 2097152 212.4 315.99 276.49 0 134217728 4194304 407.8 329.12 287.98 0 268435456 8388608 797.4 336.64 294.56 0 ``` NCCL: ``` NCCL version 2.21.5+cuda12.4 # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8388608 524288 half none -1 38.70 216.75 189.66 0 39.25 213.72 187.00 N/A 16777216 1048576 half none -1 71.39 234.99 205.62 0 68.41 245.25 214.60 N/A 33554432 2097152 half none -1 119.7 280.22 245.20 0 119.8 280.17 245.15 N/A 67108864 4194304 half none -1 211.9 316.66 277.08 0 212.7 315.53 276.09 N/A 134217728 8388608 half none -1 408.4 328.61 287.53 0 393.8 340.87 298.26 N/A 268435456 16777216 half none -1 761.6 352.47 308.41 0 763.3 351.70 307.73 N/A 536870912 33554432 half none -1 1502.5 357.31 312.64 0 1467.3 365.89 320.16 N/A ```	2025-08-14 11:30:40 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Binyang Li	01e72f3aca	Fix multinode test failure (#574 ) Add CPU based connection to fix multi-node test failure issue --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-07-23 10:33:44 -07:00
Binyang Li	5e991cf5c8	update readme & bump version (#550 ) Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-07-12 01:00:18 -07:00
Changho Hwang	199468bc47	Revise NVLS interface (#458 ) * Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel` * Minor interface improvements	2025-07-12 00:33:03 +00:00
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Changho Hwang	20eca28942	Fix a FIFO correctness bug (#549 ) * Add a FIFO test code that reproduced a correctness issue * Fix the correctness issue by using pinned memory instead of cudaMemcpy --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-07-11 23:53:59 +00:00
Changho Hwang	22e8db4885	Support connection between local endpoints (#561 )	2025-07-02 13:02:44 -07:00
Changho Hwang	b4dde38db8	FIFO improvements (#557 ) * Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default. * Optimize `FifoDeviceHandle`. * Do not use `cudaHostAllocWriteCombined` that increases latency. * Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`. * Fix proxy NUMA binding issues. * Prevent graph capture inside proxy threads. * Now `CudaIpcConnection` skips stream sync when unnecessary. * Now any type of connection needs to hold a shared pointer to the context for memory safety. * Now a context should be always managed by a shared pointer for memory safety. * Minor docs & interface improvements. * Minor fix in `mscclpp-test` correctness test.	2025-06-24 09:50:28 -07:00
Changho Hwang	2796cfa5ba	New FIFO test (#558 ) Comprehensive FIFO testing	2025-06-23 15:42:44 -07:00
Changho Hwang	125d6f5809	Multi-stream CUDA IPC (#326 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>	2025-06-04 10:31:04 -07:00
Changho Hwang	253a1ba1a9	Use a stream pool for `gpuCalloc()` (#509 ) Previous `gpuCalloc()` creates a new stream for each allocation, which messes the timeline up in profiler traces. Now `GpuStreamPool` allows reusing the temporal streams.	2025-06-04 10:07:20 -07:00
Changho Hwang	83356957bd	Improved documentation & minor interface revision (#541 )	2025-06-03 14:26:27 -07:00
Binyang Li	a18e91cee4	Set Up a CI Pipeline for H100 (#526 ) Set Up a CI Pipeline for H100	2025-05-15 14:50:23 -07:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Changho Hwang	d636093336	Asynchronous setup (#514 ) Cherry-picked a part of features from #167: now `Communicator::setup()` is unneeded. `Communicator::sendMemory()` conducts the task inline, and `Communicator::recvMemory()` and `Communicator::connect()` conducts the task asynchronously without explicit setup.	2025-05-08 22:01:51 +00:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Binyang Li	7da11b35d5	Add flag to disable nvls (#500 ) Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported by Azure vm. Add this flag to force to use cudaMalloc for memory allocation and disable nvls feature	2025-04-22 17:09:19 -07:00
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Binyang Li	e412804eab	Improve signal/wait performance and fix barrier issue (#499 ) Remove __assert_fail for release build. This will reduce the number of PTX instructions inside the loop. Also Trying to resolve this issue reported in #497. Reduce the number of PTX instructions from 8 to 6. 8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5 Also NDEBUG flag is confused here, sometime it will not be set. Use customized flag for debug build. Here is current PTX: ``` ld.u64 %rd12, [%rd2+-24]; mov.u64 %rd13, %rd12; mov.u64 %rd11, %rd13; ld.acquire.sys.b64 %rd10,[%rd11]; setp.lt.u64 %p1, %rd10, %rd3; @%p1 bra $L__BB2_1; ``` If we change to `asm volatile("ld.global.acquire.sys.b64 %0, [%1];" : "=l"(flag) : "l"(flag_addr));` will reduce to 4 instructions. We can get 2.1 us for 8 ranks signal/wait ``` ld.u64 %rd9, [%rd1+-24]; ld.global.acquire.sys.b64 %rd8, [%rd9]; setp.lt.u64 %p1, %rd8, %rd2; @%p1 bra $L__BB2_1; ```	2025-04-16 14:22:10 -07:00
Changho Hwang	def68ced64	Add CUDA 12.8 images (#488 )	2025-03-29 00:31:26 +00:00

1 2 3

144 Commits