mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 09:17:06 +00:00

Author	SHA1	Message	Date
Changho Hwang	b310783603	Fix #508 (#515 ) * Wrong offsets in `unpackPackets()` * Added Python binding of `BaseMemoryChannel`	2025-04-25 09:52:05 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Caio Rocha	7a25e51b07	Automatic creation of Scratch Buffer at MSCCLLang (#510 )	2025-04-23 16:37:14 -07:00
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Binyang Li	adc9ee5684	Export mscclpp GpuBuffer to dlpack format (#492 ) For mscclpp, to use nvls we require the buffer is allocated by mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export the raw buffer to dlpack format. User can use this feature to create buffer with type supported by pytorch ```python buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16 dl_pack = buffer.to_dlpack(str(torch.bfloat16)) tensor = torch.utils.dlpack.from_dlpack(dl_pack) ```	2025-04-03 12:59:32 -07:00
Caio Rocha	ac5cc647e0	Reduce Operation Support to the Executor (#484 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-25 13:58:12 -07:00
Caio Rocha	bac3c90f6a	Improving Get Operation at MSCCLLang (#475 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-03-05 09:41:38 -08:00
Caio Rocha	b3992a8b29	Adding Read Put Packet operation at Executor (#441 )	2025-02-26 09:29:43 -08:00
Caio Rocha	0222bb324d	Adjusting AllGather Collective in MSCCLLang (#466 ) Co-authored-by: Caio Rocha <aiorocha@microsoft.com>	2025-02-25 08:35:26 -08:00
Caio Rocha	8a564977e5	Updating MSCCLLang Examples (#462 ) Co-authored-by: Caio Rocha <aiorocha@microsoft.com>	2025-02-19 09:48:31 -08:00
Caio Rocha	55789bc551	Support ReduceScatter in the NCCL interface (#460 ) Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net> Co-authored-by: Caio Rocha <aiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-02-11 13:28:19 -08:00
Binyang Li	a6e00cc449	remove unnecessary sync (#461 ) `nop` instruction is only for synchronization within the same threadblock. Cross threadblock synchronization is handled by `barrier` instruction. So insert `nop` only if the dependency is within the same threadblock.	2025-02-10 15:31:49 +08:00
Caio Rocha	e7cff899ce	Adjusting BFS to seek circular dependencies in the msccl-tools DAG (#459 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-02-07 11:24:27 -08:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Binyang Li	af0bb86e07	Merge mscclpp-lang to mscclpp project (#442 ) First step to merge msccl-tools into mscclpp repo. In this step will move all msccl related code, pass the current tests and do some necessary refactor. Add `mscclpp.language` module Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag Add `DagLower` to lower dag to intermediate representation Add documents for mscclpp.language Remove msccl related code	2025-01-22 09:47:37 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Changho Hwang	f2b52c6318	Fix Python binding of exceptions (#444 ) * Fixed errors to be catchable from Python code * Skip IB tests in Python unit tests when IB ports are down	2025-01-09 11:58:23 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Caio Rocha	ff18bb8d0b	Providing reduce-scatter test support (#390 )	2024-11-28 09:19:30 -08:00
Binyang Li	1b8d020650	Fix mscclpp_benchmark (#392 ) Enable 1GB message size for NVLS transport in mscclpp_benchmark	2024-11-25 19:59:51 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Caio Rocha	b3dc74c020	Small Adjust in Test Data AllGather at Executor Test (#384 )	2024-11-16 15:21:00 +08:00
Ziyue Yang	9526d76fc7	Add kernel-based verification for executor_test (#378 ) Add kernels to fill and test data for correctness test in executor_test.py.	2024-11-07 14:14:20 +08:00
Ziyue Yang	95ab1088ef	Fix in-place all-gather input buffer in executor_test (#372 )	2024-10-24 23:04:11 +08:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	0c7311e83f	Add CI for rocm (#346 )	2024-09-15 22:30:54 +00:00
Roshan Dathathri	7ed13ec4b5	Auto-tune vector sizes for NVLS allreduce6 (#338 ) Also fixes bugs in MscclppAllReduce6 Below is the performance when the algorithm is fixed to MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2. Float16: +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| Size (fp16) \| Time (us) \| AlgBW (GB/s) \| Correctness \| NCCL Time (us) \| NCCL AlgBW (GB/s) \| NCCL Correctness \| Speed Up \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| 2.0 KiB \| 11.15 \| 0.18 \| PASS \| 13.82 \| 0.15 \| PASS \| 1.24 \| \| 4.0 KiB \| 11.15 \| 0.37 \| PASS \| 14.74 \| 0.28 \| PASS \| 1.32 \| \| 8.0 KiB \| 11.14 \| 0.74 \| PASS \| 15.17 \| 0.54 \| PASS \| 1.36 \| \| 16.0 KiB \| 11.16 \| 1.47 \| PASS \| 15.77 \| 1.04 \| PASS \| 1.41 \| \| 32.0 KiB \| 11.15 \| 2.94 \| PASS \| 17.50 \| 1.87 \| PASS \| 1.57 \| \| 64.0 KiB \| 11.18 \| 5.86 \| PASS \| 17.64 \| 3.71 \| PASS \| 1.58 \| \| 128.0 KiB \| 11.16 \| 11.74 \| PASS \| 17.83 \| 7.35 \| PASS \| 1.60 \| \| 256.0 KiB \| 11.21 \| 23.38 \| PASS \| 18.00 \| 14.57 \| PASS \| 1.60 \| \| 512.0 KiB \| 11.70 \| 44.81 \| PASS \| 18.42 \| 28.46 \| PASS \| 1.57 \| \| 1.0 MiB \| 13.64 \| 76.87 \| PASS \| 20.23 \| 51.83 \| PASS \| 1.48 \| \| 2.0 MiB \| 17.29 \| 121.27 \| PASS \| 31.60 \| 66.36 \| PASS \| 1.83 \| \| 4.0 MiB \| 25.26 \| 166.02 \| PASS \| 38.74 \| 108.26 \| PASS \| 1.53 \| \| 8.0 MiB \| 40.17 \| 208.83 \| PASS \| 62.86 \| 133.45 \| PASS \| 1.56 \| \| 16.0 MiB \| 70.92 \| 236.56 \| PASS \| 113.36 \| 147.99 \| PASS \| 1.60 \| \| 32.0 MiB \| 131.38 \| 255.41 \| PASS \| 203.21 \| 165.13 \| PASS \| 1.55 \| \| 64.0 MiB \| 253.39 \| 264.84 \| PASS \| 342.12 \| 196.15 \| PASS \| 1.35 \| \| 128.0 MiB \| 496.74 \| 270.20 \| PASS \| 670.62 \| 200.14 \| PASS \| 1.35 \| \| 256.0 MiB \| 982.42 \| 273.24 \| PASS \| 1318.36 \| 203.61 \| PASS \| 1.34 \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ Float32: +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| Size (fp32) \| Time (us) \| AlgBW (GB/s) \| Correctness \| NCCL Time (us) \| NCCL AlgBW (GB/s) \| NCCL Correctness \| Speed Up \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| 4.0 KiB \| 11.04 \| 0.37 \| PASS \| 14.79 \| 0.28 \| PASS \| 1.34 \| \| 8.0 KiB \| 11.15 \| 0.73 \| PASS \| 15.25 \| 0.54 \| PASS \| 1.37 \| \| 16.0 KiB \| 11.12 \| 1.47 \| PASS \| 15.87 \| 1.03 \| PASS \| 1.43 \| \| 32.0 KiB \| 11.13 \| 2.95 \| PASS \| 17.21 \| 1.90 \| PASS \| 1.55 \| \| 64.0 KiB \| 11.11 \| 5.90 \| PASS \| 17.37 \| 3.77 \| PASS \| 1.56 \| \| 128.0 KiB \| 11.08 \| 11.83 \| PASS \| 17.54 \| 7.47 \| PASS \| 1.58 \| \| 256.0 KiB \| 11.15 \| 23.50 \| PASS \| 17.71 \| 14.80 \| PASS \| 1.59 \| \| 512.0 KiB \| 11.56 \| 45.34 \| PASS \| 18.21 \| 28.79 \| PASS \| 1.57 \| \| 1.0 MiB \| 13.64 \| 76.90 \| PASS \| 19.87 \| 52.77 \| PASS \| 1.46 \| \| 2.0 MiB \| 17.24 \| 121.67 \| PASS \| 31.63 \| 66.30 \| PASS \| 1.84 \| \| 4.0 MiB \| 25.19 \| 166.47 \| PASS \| 38.63 \| 108.57 \| PASS \| 1.53 \| \| 8.0 MiB \| 40.38 \| 207.72 \| PASS \| 62.65 \| 133.89 \| PASS \| 1.55 \| \| 16.0 MiB \| 70.72 \| 237.23 \| PASS \| 114.57 \| 146.44 \| PASS \| 1.62 \| \| 32.0 MiB \| 131.49 \| 255.18 \| PASS \| 200.79 \| 167.11 \| PASS \| 1.53 \| \| 64.0 MiB \| 253.98 \| 264.23 \| PASS \| 342.58 \| 195.89 \| PASS \| 1.35 \| \| 128.0 MiB \| 496.96 \| 270.08 \| PASS \| 670.64 \| 200.13 \| PASS \| 1.35 \| \| 256.0 MiB \| 982.83 \| 273.12 \| PASS \| 1318.90 \| 203.53 \| PASS \| 1.34 \| \| 512.0 MiB \| 1954.07 \| 274.75 \| PASS \| 2609.04 \| 205.77 \| PASS \| 1.34 \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+	2024-08-16 11:11:54 +08:00
Changho Hwang	8c6fb429e9	bfloat16 support (#336 ) * Add bfloat16 support for executor and NCCL interface * Changed `gpu_data_types.hpp` into an internal header file	2024-08-12 15:41:58 -07:00
Ziyue Yang	faadc75649	Fix missing import in executor test (#334 )	2024-08-06 14:24:50 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
Changho Hwang	c4ca2fbc8c	Resolve clang++ warnings (#325 )	2024-07-11 07:48:35 +00:00
Angelica Moreira	0f796bbdf7	Update allreduce_bench.py (#318 ) Replacing hardcoded network interface name for generic discovery strategy. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-06-29 03:41:13 +00:00
Roshan Dathathri	91550dab4c	Simplify/improve barrier in AllReduce6 (#317 ) Drop superfluous __threadfence_system()	2024-06-23 21:08:59 +00:00
Roshan Dathathri	93ed8e1e58	Add support for multicast reduce insruction (#316 )	2024-06-19 13:28:12 -07:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Changho Hwang	d35a2f2dc2	Rename executor.cpp to executor_py.cpp (#301 )	2024-05-17 13:31:27 -07:00
aashaka	0650371b54	Allow obtaining cuda stream handle from PyTorch stream when launching kernel (#297 ) Use `cuda_stream` attribute of a torch stream if the stream is not an instance of the cupy stream.	2024-05-04 04:57:07 +00:00
Changho Hwang	6c1fa5307c	Refactoring NVLS interfaces (#293 ) Move NVLS details from the core to a separate interface	2024-04-24 10:05:41 -07:00
Roshan Dathathri	41e0964d93	Allow binding allocated memory to NVLS multicast pointer (#290 ) And change NVLS multimem instructions to static functions	2024-04-18 17:11:31 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	9406123711	Fix a typo name (#286 )	2024-04-17 23:45:46 +00:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 )	2024-03-27 11:53:09 -07:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Saeed Maleki	a3d0799963	Fix the comm.py for nvls (#267 ) Fix the comm.py for nvls	2024-02-19 10:39:21 +08:00

1 2 3 4

165 Commits