mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 17:26:04 +00:00

Author	SHA1	Message	Date
Changho Hwang	17d8e7c9e9	Fix build processes (#545 ) * Let CMake read version numbers from the `VERSION` file. * Upgrade dlpack and drop `CMAKE_POLICY_VERSION_MINIMUM`. * Do not install dlpack. * Add license files in the wheel and exclude `*.cpp` files.	2025-06-06 13:37:40 -07:00
Changho Hwang	83356957bd	Improved documentation & minor interface revision (#541 )	2025-06-03 14:26:27 -07:00
Changho Hwang	c184485808	DLPack fixes (#537 ) * Fix typos in type name * Make it work without current context set	2025-05-27 21:40:50 +00:00
Changho Hwang	7278b51e61	Rename `ChannelTrigger` fields and check field values in debug builds (#529 )	2025-05-27 14:36:22 -07:00
Caio Rocha	29c3af2ac6	Properly setting up the device in Ethernet Connection (#527 ) When we create the thread to receive messages in the Ethernet Connection, it resets the Device ID, causing faults in the Ethernet Connection unit tests. ![image](https://github.com/user-attachments/assets/ba609c16-0f52-4624-807a-5ad776a0c18d) This PR aims to properly set up the device when the thread is created. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-05-19 10:05:45 -07:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Changho Hwang	5205618c4a	Fix device assert (#522 ) * Fixed a bug that external `assert()`s may not be compiled with mscclpp headers * Use a macro assert instead of a function	2025-05-12 13:38:11 -07:00
Changho Hwang	d636093336	Asynchronous setup (#514 ) Cherry-picked a part of features from #167: now `Communicator::setup()` is unneeded. `Communicator::sendMemory()` conducts the task inline, and `Communicator::recvMemory()` and `Communicator::connect()` conducts the task asynchronously without explicit setup.	2025-05-08 22:01:51 +00:00
Caio Rocha	51eca89d20	Enhance Collective Check at MSCCLang (#511 )	2025-04-29 13:29:28 -07:00
Changho Hwang	b310783603	Fix #508 (#515 ) * Wrong offsets in `unpackPackets()` * Added Python binding of `BaseMemoryChannel`	2025-04-25 09:52:05 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Caio Rocha	7a25e51b07	Automatic creation of Scratch Buffer at MSCCLLang (#510 )	2025-04-23 16:37:14 -07:00
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Binyang Li	adc9ee5684	Export mscclpp GpuBuffer to dlpack format (#492 ) For mscclpp, to use nvls we require the buffer is allocated by mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export the raw buffer to dlpack format. User can use this feature to create buffer with type supported by pytorch ```python buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16 dl_pack = buffer.to_dlpack(str(torch.bfloat16)) tensor = torch.utils.dlpack.from_dlpack(dl_pack) ```	2025-04-03 12:59:32 -07:00
Caio Rocha	ac5cc647e0	Reduce Operation Support to the Executor (#484 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-25 13:58:12 -07:00
Caio Rocha	bac3c90f6a	Improving Get Operation at MSCCLLang (#475 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-05 09:41:38 -08:00
Caio Rocha	b3992a8b29	Adding Read Put Packet operation at Executor (#441 )	2025-02-26 09:29:43 -08:00
Caio Rocha	0222bb324d	Adjusting AllGather Collective in MSCCLLang (#466 ) Co-authored-by: Caio Rocha <aiorocha@microsoft.com>	2025-02-25 08:35:26 -08:00
Caio Rocha	8a564977e5	Updating MSCCLLang Examples (#462 ) Co-authored-by: Caio Rocha <aiorocha@microsoft.com>	2025-02-19 09:48:31 -08:00
Caio Rocha	55789bc551	Support ReduceScatter in the NCCL interface (#460 ) Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net> Co-authored-by: Caio Rocha <aiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-02-11 13:28:19 -08:00
Binyang Li	a6e00cc449	remove unnecessary sync (#461 ) `nop` instruction is only for synchronization within the same threadblock. Cross threadblock synchronization is handled by `barrier` instruction. So insert `nop` only if the dependency is within the same threadblock.	2025-02-10 15:31:49 +08:00
Caio Rocha	e7cff899ce	Adjusting BFS to seek circular dependencies in the msccl-tools DAG (#459 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-02-07 11:24:27 -08:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Binyang Li	af0bb86e07	Merge mscclpp-lang to mscclpp project (#442 ) First step to merge msccl-tools into mscclpp repo. In this step will move all msccl related code, pass the current tests and do some necessary refactor. Add `mscclpp.language` module Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag Add `DagLower` to lower dag to intermediate representation Add documents for mscclpp.language Remove msccl related code	2025-01-22 09:47:37 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Changho Hwang	f2b52c6318	Fix Python binding of exceptions (#444 ) * Fixed errors to be catchable from Python code * Skip IB tests in Python unit tests when IB ports are down	2025-01-09 11:58:23 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Caio Rocha	ff18bb8d0b	Providing reduce-scatter test support (#390 )	2024-11-28 09:19:30 -08:00
Binyang Li	1b8d020650	Fix mscclpp_benchmark (#392 ) Enable 1GB message size for NVLS transport in mscclpp_benchmark	2024-11-25 19:59:51 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Caio Rocha	b3dc74c020	Small Adjust in Test Data AllGather at Executor Test (#384 )	2024-11-16 15:21:00 +08:00
Ziyue Yang	9526d76fc7	Add kernel-based verification for executor_test (#378 ) Add kernels to fill and test data for correctness test in executor_test.py.	2024-11-07 14:14:20 +08:00
Ziyue Yang	95ab1088ef	Fix in-place all-gather input buffer in executor_test (#372 )	2024-10-24 23:04:11 +08:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	0c7311e83f	Add CI for rocm (#346 )	2024-09-15 22:30:54 +00:00
Roshan Dathathri	7ed13ec4b5	Auto-tune vector sizes for NVLS allreduce6 (#338 ) Also fixes bugs in MscclppAllReduce6 Below is the performance when the algorithm is fixed to MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2. Float16: +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| Size (fp16) \| Time (us) \| AlgBW (GB/s) \| Correctness \| NCCL Time (us) \| NCCL AlgBW (GB/s) \| NCCL Correctness \| Speed Up \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| 2.0 KiB \| 11.15 \| 0.18 \| PASS \| 13.82 \| 0.15 \| PASS \| 1.24 \| \| 4.0 KiB \| 11.15 \| 0.37 \| PASS \| 14.74 \| 0.28 \| PASS \| 1.32 \| \| 8.0 KiB \| 11.14 \| 0.74 \| PASS \| 15.17 \| 0.54 \| PASS \| 1.36 \| \| 16.0 KiB \| 11.16 \| 1.47 \| PASS \| 15.77 \| 1.04 \| PASS \| 1.41 \| \| 32.0 KiB \| 11.15 \| 2.94 \| PASS \| 17.50 \| 1.87 \| PASS \| 1.57 \| \| 64.0 KiB \| 11.18 \| 5.86 \| PASS \| 17.64 \| 3.71 \| PASS \| 1.58 \| \| 128.0 KiB \| 11.16 \| 11.74 \| PASS \| 17.83 \| 7.35 \| PASS \| 1.60 \| \| 256.0 KiB \| 11.21 \| 23.38 \| PASS \| 18.00 \| 14.57 \| PASS \| 1.60 \| \| 512.0 KiB \| 11.70 \| 44.81 \| PASS \| 18.42 \| 28.46 \| PASS \| 1.57 \| \| 1.0 MiB \| 13.64 \| 76.87 \| PASS \| 20.23 \| 51.83 \| PASS \| 1.48 \| \| 2.0 MiB \| 17.29 \| 121.27 \| PASS \| 31.60 \| 66.36 \| PASS \| 1.83 \| \| 4.0 MiB \| 25.26 \| 166.02 \| PASS \| 38.74 \| 108.26 \| PASS \| 1.53 \| \| 8.0 MiB \| 40.17 \| 208.83 \| PASS \| 62.86 \| 133.45 \| PASS \| 1.56 \| \| 16.0 MiB \| 70.92 \| 236.56 \| PASS \| 113.36 \| 147.99 \| PASS \| 1.60 \| \| 32.0 MiB \| 131.38 \| 255.41 \| PASS \| 203.21 \| 165.13 \| PASS \| 1.55 \| \| 64.0 MiB \| 253.39 \| 264.84 \| PASS \| 342.12 \| 196.15 \| PASS \| 1.35 \| \| 128.0 MiB \| 496.74 \| 270.20 \| PASS \| 670.62 \| 200.14 \| PASS \| 1.35 \| \| 256.0 MiB \| 982.42 \| 273.24 \| PASS \| 1318.36 \| 203.61 \| PASS \| 1.34 \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ Float32: +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| Size (fp32) \| Time (us) \| AlgBW (GB/s) \| Correctness \| NCCL Time (us) \| NCCL AlgBW (GB/s) \| NCCL Correctness \| Speed Up \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| 4.0 KiB \| 11.04 \| 0.37 \| PASS \| 14.79 \| 0.28 \| PASS \| 1.34 \| \| 8.0 KiB \| 11.15 \| 0.73 \| PASS \| 15.25 \| 0.54 \| PASS \| 1.37 \| \| 16.0 KiB \| 11.12 \| 1.47 \| PASS \| 15.87 \| 1.03 \| PASS \| 1.43 \| \| 32.0 KiB \| 11.13 \| 2.95 \| PASS \| 17.21 \| 1.90 \| PASS \| 1.55 \| \| 64.0 KiB \| 11.11 \| 5.90 \| PASS \| 17.37 \| 3.77 \| PASS \| 1.56 \| \| 128.0 KiB \| 11.08 \| 11.83 \| PASS \| 17.54 \| 7.47 \| PASS \| 1.58 \| \| 256.0 KiB \| 11.15 \| 23.50 \| PASS \| 17.71 \| 14.80 \| PASS \| 1.59 \| \| 512.0 KiB \| 11.56 \| 45.34 \| PASS \| 18.21 \| 28.79 \| PASS \| 1.57 \| \| 1.0 MiB \| 13.64 \| 76.90 \| PASS \| 19.87 \| 52.77 \| PASS \| 1.46 \| \| 2.0 MiB \| 17.24 \| 121.67 \| PASS \| 31.63 \| 66.30 \| PASS \| 1.84 \| \| 4.0 MiB \| 25.19 \| 166.47 \| PASS \| 38.63 \| 108.57 \| PASS \| 1.53 \| \| 8.0 MiB \| 40.38 \| 207.72 \| PASS \| 62.65 \| 133.89 \| PASS \| 1.55 \| \| 16.0 MiB \| 70.72 \| 237.23 \| PASS \| 114.57 \| 146.44 \| PASS \| 1.62 \| \| 32.0 MiB \| 131.49 \| 255.18 \| PASS \| 200.79 \| 167.11 \| PASS \| 1.53 \| \| 64.0 MiB \| 253.98 \| 264.23 \| PASS \| 342.58 \| 195.89 \| PASS \| 1.35 \| \| 128.0 MiB \| 496.96 \| 270.08 \| PASS \| 670.64 \| 200.13 \| PASS \| 1.35 \| \| 256.0 MiB \| 982.83 \| 273.12 \| PASS \| 1318.90 \| 203.53 \| PASS \| 1.34 \| \| 512.0 MiB \| 1954.07 \| 274.75 \| PASS \| 2609.04 \| 205.77 \| PASS \| 1.34 \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+	2024-08-16 11:11:54 +08:00
Changho Hwang	8c6fb429e9	bfloat16 support (#336 ) * Add bfloat16 support for executor and NCCL interface * Changed `gpu_data_types.hpp` into an internal header file	2024-08-12 15:41:58 -07:00
Ziyue Yang	faadc75649	Fix missing import in executor test (#334 )	2024-08-06 14:24:50 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
Changho Hwang	c4ca2fbc8c	Resolve clang++ warnings (#325 )	2024-07-11 07:48:35 +00:00
Angelica Moreira	0f796bbdf7	Update allreduce_bench.py (#318 ) Replacing hardcoded network interface name for generic discovery strategy. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-06-29 03:41:13 +00:00
Roshan Dathathri	91550dab4c	Simplify/improve barrier in AllReduce6 (#317 ) Drop superfluous __threadfence_system()	2024-06-23 21:08:59 +00:00
Roshan Dathathri	93ed8e1e58	Add support for multicast reduce insruction (#316 )	2024-06-19 13:28:12 -07:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00

1 2 3 4

174 Commits