mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-21 21:39:21 +00:00

Author	SHA1	Message	Date
Changho Hwang	960a8ddebd	Add a new logger (#668 ) * Add `logger.hpp` that will gradually replace `debug.h` * Minor fixes in `ib.cc` --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2025-11-04 10:32:46 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Changho Hwang	9994f53cea	Fixes for no-IB systems (#667 ) * Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB on/off * Fix `nvidia-peermem` check; no need for DMABUF supported systems * Fix `mp_unit_tests` to skip all IB tests when built with `-DMSCCLPP_USE_IB=OFF`	2025-10-29 10:03:02 -07:00
Caio Rocha	2b987cf8e8	Resolve IBVerbs Loading Issues (#648 ) Some systems do not include libibverbs.so when installing ibverbs; instead, they only provide libibverbs.so1. This PR updates the CMake file to search for this library and modifies the wrapper to load it. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-28 18:14:53 -07:00
Qinghua Zhou	a38c2ee784	FP8 support for Allreduce (#646 ) Add FP8 support for Allreduce on both NVIDIA and AMD platform. Add new data type: fp8_e4m3 and fp8_e5m2 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-10-27 14:51:48 -07:00
Binyang Li	2b05908635	Add token pool for cuCreate API (#628 ) Create a tokenPool to allocate token. This feature is used to support inter node NVL and try to reduce the footprint caused by cuCreate --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-27 11:19:21 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Changho Hwang	200cdf946e	Update `EndpointConfig` interfaces (#651 ) * Separate IB-specific options into a nested struct * Enable `connect()` by an `Endpoint`, not only by `EndpointConfig` * Other minor changes	2025-10-22 10:39:39 -07:00
Changho Hwang	2f7d74b281	Fix lint.sh (#652 ) Exit 1 upon any errors from clang-format or black	2025-10-20 17:23:01 -07:00
Binyang Li	70b8297c56	Revise NCCL API implementation (#617 ) - Make nccl interface extensible. Customer can register their own algo to NCCL API. User can provide customized algo selection function. - Fallback to NCCL/RCCL if no algo is selected based on algo selection function - MSCCLPP interfaces now works for any scale	2025-09-26 10:08:12 -07:00
Binyang Li	5ac427610d	Address teardown issue (#638 ) Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point	2025-09-25 12:12:40 -07:00
Binyang Li	bf8d424ae3	use unix socket to share fd (#634 ) Use unix socket to share fd to other processes. Used for nvls handle sharing Update nccl interface to support worldSize=1 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-25 11:40:54 -07:00
Changho Hwang	43f160c8e6	Fix for safe process teardown (#633 ) * `gpuFree()` functions are usually called during process teardown, so we let them ignore regarding errors. `AvoidCudaGraphCaptureGuard` is constructed in `gpuFree*()` functions, so it needs the same fix.	2025-09-10 20:28:04 -07:00
Binyang Li	ba4c4aaeb8	Integrate MSCCL++ with torch workload (#626 ) Integrate MSCCL++ with torch Introduce `NCCL audit shim library`, use can use following commands to launch torch library. Also avoid break build pipeline in the CPU machine ```bash export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH torchrun --nnodes=1 --nproc_per_node=8 your_script.py ```	2025-09-09 13:28:32 -07:00
Binyang Li	4bbe16b351	Fix hang issue in logging submodule (#625 ) Fix: #622, using std::recursive_mutex to allow acquiring `lock` reclusively in the same thread	2025-09-05 09:18:28 -07:00
Changho Hwang	c4d8781390	Fix memory exchange within a single process (#624 )	2025-09-04 12:53:51 -07:00
Caio Rocha	c3473b1794	Thread Block Group DSL (#621 ) Supporting the creation of a group of thread block to perform some operation.	2025-09-03 14:58:40 -07:00
Changho Hwang	547a9ae65c	Fixed cpp linter (#619 )	2025-08-25 12:15:45 -07:00
Caio Rocha	f839184a27	Fix deadlock in Executor channel setup (#616 ) In cases where we have circular channel creation, such as: creating channel 0 <-> 1 creating channel 1 <-> 2 creating channel 2 <-> 3 creating channel 3 <-> 0 creating channel 0 <-> 3 creating channel 1 <-> 0 creating channel 2 <-> 1 creating channel 3 <-> 2 This setup can result in a deadlock during the first channel creation for each rank. The current code requires sharing the semaphore for the first channel before moving on, which leads to the following sequence: creating channel 0 <-> 1 creating channel 1 <-> 2 creating channel 2 <-> 3 creating channel 3 <-> 0 <-- HANG ISSUE --> The process hangs because, for example, rank 0 will only share the semaphore with rank 3 after receiving it from rank 1. However, rank 1 is waiting for a semaphore from rank 2, rank 2 is waiting for one from rank 3, and rank 3 is waiting for one from rank 0. The solution is to make this creation asynchronous and only retrieve the semaphore after all semaphores have been requested.	2025-08-19 16:44:44 -07:00
Caio Rocha	9261b1d278	AlltoAll Test Support (#606 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-15 16:00:41 -07:00
Binyang Li	03c0ff2a91	Fix for multi-nodes test (#614 ) Fix multi-node test --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-14 20:44:43 -07:00
Binyang Li	671b688bb3	Create ib mr for per ib transport (#611 )	2025-08-14 19:54:48 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Changho Hwang	1cc1b827f4	MNNVL fix (#604 )	2025-08-08 19:23:55 +00:00
Changho Hwang	542a10f69e	Merge ChannelTrigger with ProxyTrigger (#601 )	2025-08-08 19:07:50 +00:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Changho Hwang	d55ac96f5e	Fixed the local channel test (#597 )	2025-08-05 15:33:48 -07:00
Changho Hwang	334b232e36	Fix GpuStreamPool to be aware of the device ID of streams (#590 ) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-04 11:07:31 -07:00
Changho Hwang	c580e4c503	Support CudaIpc connection within a single process (#593 ) * Allow CudaIpc connection between GPUs in a single process * Added an example of connection in a single process * Minor interface updates --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-02 12:59:36 +08:00
Changho Hwang	199468bc47	Revise NVLS interface (#458 ) * Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel` * Minor interface improvements	2025-07-12 00:33:03 +00:00
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Changho Hwang	20eca28942	Fix a FIFO correctness bug (#549 ) * Add a FIFO test code that reproduced a correctness issue * Fix the correctness issue by using pinned memory instead of cudaMemcpy --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-07-11 23:53:59 +00:00
Binyang Li	9b71d524b3	Fix pytest failure (#567 ) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-07-11 16:49:28 -07:00
Changho Hwang	22e8db4885	Support connection between local endpoints (#561 )	2025-07-02 13:02:44 -07:00
Changho Hwang	3de6d5b63a	Fix #557 (#560 ) * Page-locking `Host2DeviceSemaphore::outboundSemaphore_` caused unexpected performance issues so reverting it back. We may revisit this later. * Removed reference to connections from context as now connections refer to context.	2025-06-30 11:33:19 -07:00
Changho Hwang	b4dde38db8	FIFO improvements (#557 ) * Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default. * Optimize `FifoDeviceHandle`. * Do not use `cudaHostAllocWriteCombined` that increases latency. * Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`. * Fix proxy NUMA binding issues. * Prevent graph capture inside proxy threads. * Now `CudaIpcConnection` skips stream sync when unnecessary. * Now any type of connection needs to hold a shared pointer to the context for memory safety. * Now a context should be always managed by a shared pointer for memory safety. * Minor docs & interface improvements. * Minor fix in `mscclpp-test` correctness test.	2025-06-24 09:50:28 -07:00
Changho Hwang	a36dcd56bf	Do not use tail replica by default (#544 ) Added `MSCCLPP_FIFO_USE_TAIL_REPLICA` environment variable to control whether to use a tail replica for the FIFO buffer. Default is false.	2025-06-12 14:07:10 -07:00
Changho Hwang	f694f2e46b	Fix #509 (#546 ) Fix a destruction order issue	2025-06-05 19:36:02 -07:00
Changho Hwang	125d6f5809	Multi-stream CUDA IPC (#326 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>	2025-06-04 10:31:04 -07:00
Changho Hwang	253a1ba1a9	Use a stream pool for `gpuCalloc()` (#509 ) Previous `gpuCalloc()` creates a new stream for each allocation, which messes the timeline up in profiler traces. Now `GpuStreamPool` allows reusing the temporal streams.	2025-06-04 10:07:20 -07:00
Changho Hwang	83356957bd	Improved documentation & minor interface revision (#541 )	2025-06-03 14:26:27 -07:00
Changho Hwang	c184485808	DLPack fixes (#537 ) * Fix typos in type name * Make it work without current context set	2025-05-27 21:40:50 +00:00
Changho Hwang	7278b51e61	Rename `ChannelTrigger` fields and check field values in debug builds (#529 )	2025-05-27 14:36:22 -07:00
Caio Rocha	29c3af2ac6	Properly setting up the device in Ethernet Connection (#527 ) When we create the thread to receive messages in the Ethernet Connection, it resets the Device ID, causing faults in the Ethernet Connection unit tests. ![image](https://github.com/user-attachments/assets/ba609c16-0f52-4624-807a-5ad776a0c18d) This PR aims to properly set up the device when the thread is created. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-05-19 10:05:45 -07:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Changho Hwang	d636093336	Asynchronous setup (#514 ) Cherry-picked a part of features from #167: now `Communicator::setup()` is unneeded. `Communicator::sendMemory()` conducts the task inline, and `Communicator::recvMemory()` and `Communicator::connect()` conducts the task asynchronously without explicit setup.	2025-05-08 22:01:51 +00:00
Qinghua Zhou	b4f0af8f9f	Support ibv_reg_dmabuf_mr for buffer allocated by cuMemMalloc (#513 ) Fix #496 For buffer allocated by cuMemMalloc, use ibv_reg_dmabuf_mr to register a dma-buf based memory region.	2025-05-07 17:26:14 -07:00
Binyang Li	affca7d9bc	Add NVLS based fallback algo (#507 ) Add two nvls based fallback algo. allreduce9 is for nvls with zero copy. allreduce10 is for nvls need to copy to scratch buffer, do reduce operation then copy result back to result buffer. Perf number for allreduce9 ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 5.45 0.19 0.33 0 5.35 0.19 0.33 0 2048 512 float sum -1 5.57 0.37 0.64 0 5.53 0.37 0.65 0 4096 1024 float sum -1 5.80 0.71 1.24 0 5.78 0.71 1.24 0 8192 2048 float sum -1 5.94 1.38 2.42 0 5.85 1.40 2.45 0 16384 4096 float sum -1 6.40 2.56 4.48 0 6.27 2.61 4.57 0 32768 8192 float sum -1 7.45 4.40 7.70 0 7.39 4.43 7.76 0 65536 16384 float sum -1 8.03 8.17 14.29 0 8.32 7.88 13.79 0 131072 32768 float sum -1 7.28 18.00 31.49 0 7.07 18.53 32.43 0 262144 65536 float sum -1 7.72 33.95 59.41 0 7.59 34.56 60.48 0 524288 131072 float sum -1 8.70 60.29 105.51 0 8.37 62.61 109.57 0 1048576 262144 float sum -1 10.56 99.26 173.70 0 10.32 101.64 177.87 0 2097152 524288 float sum -1 14.45 145.14 253.99 0 14.02 149.58 261.76 0 4194304 1048576 float sum -1 22.83 183.73 321.52 0 23.03 182.14 318.75 0 8388608 2097152 float sum -1 38.63 217.14 380.00 0 38.57 217.52 380.65 0 16777216 4194304 float sum -1 70.03 239.58 419.27 0 69.96 239.80 419.66 0 33554432 8388608 float sum -1 131.5 255.17 446.55 0 131.3 255.59 447.28 0 67108864 16777216 float sum -1 255.8 262.37 459.15 0 255.4 262.75 459.82 0 134217728 33554432 float sum -1 500.9 267.94 468.90 0 500.0 268.42 469.74 0 268435456 67108864 float sum -1 989.0 271.41 474.97 0 988.9 271.45 475.05 0 536870912 134217728 float sum -1 1967.4 272.88 477.54 0 1966.0 273.08 477.88 0 1073741824 268435456 float sum -1 3908.5 274.72 480.77 0 3904.6 274.99 481.24 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 218.734 ``` Perf number for allreduce10 ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 5.60 0.18 0.32 0 5.52 0.19 0.32 0 2048 512 float sum -1 5.79 0.35 0.62 0 5.64 0.36 0.64 0 4096 1024 float sum -1 5.92 0.69 1.21 0 5.82 0.70 1.23 0 8192 2048 float sum -1 6.03 1.36 2.38 0 5.95 1.38 2.41 0 16384 4096 float sum -1 6.58 2.49 4.35 0 6.39 2.56 4.49 0 32768 8192 float sum -1 7.54 4.34 7.60 0 7.41 4.42 7.74 0 65536 16384 float sum -1 7.95 8.24 14.42 0 8.10 8.09 14.16 0 131072 32768 float sum -1 9.56 13.72 24.00 0 9.47 13.84 24.23 0 262144 65536 float sum -1 11.49 22.81 39.92 0 11.41 22.97 40.20 0 524288 131072 float sum -1 14.19 36.94 64.64 0 13.88 37.76 66.09 0 1048576 262144 float sum -1 19.10 54.89 96.06 0 18.98 55.24 96.67 0 2097152 524288 float sum -1 31.12 67.38 117.91 0 31.34 66.92 117.10 0 4194304 1048576 float sum -1 44.88 93.46 163.56 0 44.76 93.70 163.97 0 8388608 2097152 float sum -1 63.23 132.68 232.18 0 62.53 134.14 234.75 0 16777216 4194304 float sum -1 106.8 157.03 274.80 0 105.9 158.46 277.30 0 33554432 8388608 float sum -1 172.2 194.91 341.09 0 172.0 195.05 341.35 0 67108864 16777216 float sum -1 299.8 223.83 391.70 0 300.8 223.12 390.46 0 134217728 33554432 float sum -1 553.1 242.66 424.66 0 553.8 242.38 424.16 0 268435456 67108864 float sum -1 1056.1 254.18 444.82 0 1057.4 253.86 444.26 0 536870912 134217728 float sum -1 2064.0 260.11 455.20 0 2063.8 260.14 455.25 0 1073741824 268435456 float sum -1 4074.4 263.53 461.18 0 4065.8 264.09 462.16 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 169.799 ``` --------- Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-04-27 14:09:31 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Binyang Li	7da11b35d5	Add flag to disable nvls (#500 ) Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported by Azure vm. Add this flag to force to use cudaMalloc for memory allocation and disable nvls feature	2025-04-22 17:09:19 -07:00

1 2 3 4 5 ...

466 Commits