mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Binyang Li	03c0ff2a91	Fix for multi-nodes test (#614 ) Fix multi-node test --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-14 20:44:43 -07:00
Binyang Li	bb76d27553	all2all implementation (#609 ) Implement single node all2all via MSCCL++ C++API perf kernel 3: ``` size count time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 32768 23.41 44.78 39.19 0 2097152 65536 23.95 87.56 76.61 0 4194304 131072 27.50 152.51 133.45 0 8388608 262144 35.14 238.73 208.89 0 16777216 524288 57.54 291.55 255.11 0 33554432 1048576 109.7 305.81 267.59 0 67108864 2097152 212.3 316.07 276.56 0 134217728 4194304 410.9 326.64 285.81 0 268435456 8388608 784.9 341.99 299.24 0 ``` kernel 2 ``` # in-place out-of-place # size count time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 32768 23.42 44.77 39.17 0 2097152 65536 24.96 84.02 73.52 0 4194304 131072 28.53 147.03 128.65 0 8388608 262144 36.75 228.28 199.75 0 16777216 524288 58.01 289.20 253.05 0 33554432 1048576 110.4 303.83 265.85 0 67108864 2097152 212.4 315.99 276.49 0 134217728 4194304 407.8 329.12 287.98 0 268435456 8388608 797.4 336.64 294.56 0 ``` NCCL: ``` NCCL version 2.21.5+cuda12.4 # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8388608 524288 half none -1 38.70 216.75 189.66 0 39.25 213.72 187.00 N/A 16777216 1048576 half none -1 71.39 234.99 205.62 0 68.41 245.25 214.60 N/A 33554432 2097152 half none -1 119.7 280.22 245.20 0 119.8 280.17 245.15 N/A 67108864 4194304 half none -1 211.9 316.66 277.08 0 212.7 315.53 276.09 N/A 134217728 8388608 half none -1 408.4 328.61 287.53 0 393.8 340.87 298.26 N/A 268435456 16777216 half none -1 761.6 352.47 308.41 0 763.3 351.70 307.73 N/A 536870912 33554432 half none -1 1502.5 357.31 312.64 0 1467.3 365.89 320.16 N/A ```	2025-08-14 11:30:40 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Binyang Li	01e72f3aca	Fix multinode test failure (#574 ) Add CPU based connection to fix multi-node test failure issue --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-07-23 10:33:44 -07:00
Binyang Li	5e991cf5c8	update readme & bump version (#550 ) Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-07-12 01:00:18 -07:00
Changho Hwang	199468bc47	Revise NVLS interface (#458 ) * Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel` * Minor interface improvements	2025-07-12 00:33:03 +00:00
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Changho Hwang	20eca28942	Fix a FIFO correctness bug (#549 ) * Add a FIFO test code that reproduced a correctness issue * Fix the correctness issue by using pinned memory instead of cudaMemcpy --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-07-11 23:53:59 +00:00
Changho Hwang	22e8db4885	Support connection between local endpoints (#561 )	2025-07-02 13:02:44 -07:00
Changho Hwang	b4dde38db8	FIFO improvements (#557 ) * Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default. * Optimize `FifoDeviceHandle`. * Do not use `cudaHostAllocWriteCombined` that increases latency. * Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`. * Fix proxy NUMA binding issues. * Prevent graph capture inside proxy threads. * Now `CudaIpcConnection` skips stream sync when unnecessary. * Now any type of connection needs to hold a shared pointer to the context for memory safety. * Now a context should be always managed by a shared pointer for memory safety. * Minor docs & interface improvements. * Minor fix in `mscclpp-test` correctness test.	2025-06-24 09:50:28 -07:00
Changho Hwang	2796cfa5ba	New FIFO test (#558 ) Comprehensive FIFO testing	2025-06-23 15:42:44 -07:00
Changho Hwang	125d6f5809	Multi-stream CUDA IPC (#326 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>	2025-06-04 10:31:04 -07:00
Changho Hwang	253a1ba1a9	Use a stream pool for `gpuCalloc()` (#509 ) Previous `gpuCalloc()` creates a new stream for each allocation, which messes the timeline up in profiler traces. Now `GpuStreamPool` allows reusing the temporal streams.	2025-06-04 10:07:20 -07:00
Changho Hwang	83356957bd	Improved documentation & minor interface revision (#541 )	2025-06-03 14:26:27 -07:00
Binyang Li	a18e91cee4	Set Up a CI Pipeline for H100 (#526 ) Set Up a CI Pipeline for H100	2025-05-15 14:50:23 -07:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Changho Hwang	d636093336	Asynchronous setup (#514 ) Cherry-picked a part of features from #167: now `Communicator::setup()` is unneeded. `Communicator::sendMemory()` conducts the task inline, and `Communicator::recvMemory()` and `Communicator::connect()` conducts the task asynchronously without explicit setup.	2025-05-08 22:01:51 +00:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Binyang Li	7da11b35d5	Add flag to disable nvls (#500 ) Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported by Azure vm. Add this flag to force to use cudaMalloc for memory allocation and disable nvls feature	2025-04-22 17:09:19 -07:00
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Binyang Li	e412804eab	Improve signal/wait performance and fix barrier issue (#499 ) Remove __assert_fail for release build. This will reduce the number of PTX instructions inside the loop. Also Trying to resolve this issue reported in #497. Reduce the number of PTX instructions from 8 to 6. 8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5 Also NDEBUG flag is confused here, sometime it will not be set. Use customized flag for debug build. Here is current PTX: ``` ld.u64 %rd12, [%rd2+-24]; mov.u64 %rd13, %rd12; mov.u64 %rd11, %rd13; ld.acquire.sys.b64 %rd10,[%rd11]; setp.lt.u64 %p1, %rd10, %rd3; @%p1 bra $L__BB2_1; ``` If we change to `asm volatile("ld.global.acquire.sys.b64 %0, [%1];" : "=l"(flag) : "l"(flag_addr));` will reduce to 4 instructions. We can get 2.1 us for 8 ranks signal/wait ``` ld.u64 %rd9, [%rd1+-24]; ld.global.acquire.sys.b64 %rd8, [%rd9]; setp.lt.u64 %p1, %rd8, %rd2; @%p1 bra $L__BB2_1; ```	2025-04-16 14:22:10 -07:00
Changho Hwang	def68ced64	Add CUDA 12.8 images (#488 )	2025-03-29 00:31:26 +00:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	c65f19ad1a	Move pipeline to official org (#406 ) Move pipeline to official org. Unify all pipelines	2024-12-16 09:43:00 -08:00
Binyang Li	7a3dcb0627	Setup pipeline for mscclpp over nccl (#401 ) Setup pipeline for mscclpp over nccl Run `all_reduce_perf` via nccl API	2024-12-07 08:57:45 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Binyang Li	593478e1b7	Add cross threadblock barrier (#383 )	2024-11-26 05:13:30 +00:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Changho Hwang	0c150e5166	Fix copyright messages (#367 )	2024-10-17 21:25:46 -07:00
Changho Hwang	74130c7c5e	Use IB transport flags only when an IB device exists (#355 )	2024-09-19 07:13:11 +00:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	7bedb25054	Add proxy channel related operations (#351 ) Add Flush, PutWithSignal, PutWithFlushAndSignal operation	2024-09-15 13:24:57 -07:00
Caio Rocha	4eca6f1e95	Support executors to send packets over ProxyChannel (#344 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-08-30 22:10:33 +00:00
Changho Hwang	1e82dd444f	Make ibverbs optional at compile time (#340 ) Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-08-21 12:47:05 -07:00
Binyang Li	1351f9f1c5	Add "packet type" option for executor test (#313 ) Add "packet type" option for executor test	2024-06-14 09:53:58 +00:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Changho Hwang	f76eae4dca	Fix assert declaration & add a compile test (#303 )	2024-05-20 02:39:30 +00:00
Changho Hwang	a3cd95bd42	Upgrade gtest (#300 ) The new gtest version resolves a type casting issue: `3044657e7a`	2024-05-07 20:49:26 -07:00
Binyang Li	6226556ce2	Optimized the execution kernel (#294 )	2024-05-03 11:54:50 -07:00
Changho Hwang	d4ede480f4	Ethernet support (#284 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-04-25 11:06:43 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 )	2024-03-27 11:53:09 -07:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Binyang Li	bc465aefcd	Add __launch_bounds__ for mscclpp-test (#273 )	2024-03-25 15:55:37 -07:00

1 2 3

117 Commits