mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 10:57:27 +00:00

Author	SHA1	Message	Date
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Changho Hwang	20eca28942	Fix a FIFO correctness bug (#549 ) * Add a FIFO test code that reproduced a correctness issue * Fix the correctness issue by using pinned memory instead of cudaMemcpy --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-07-11 23:53:59 +00:00
Changho Hwang	b4dde38db8	FIFO improvements (#557 ) * Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default. * Optimize `FifoDeviceHandle`. * Do not use `cudaHostAllocWriteCombined` that increases latency. * Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`. * Fix proxy NUMA binding issues. * Prevent graph capture inside proxy threads. * Now `CudaIpcConnection` skips stream sync when unnecessary. * Now any type of connection needs to hold a shared pointer to the context for memory safety. * Now a context should be always managed by a shared pointer for memory safety. * Minor docs & interface improvements. * Minor fix in `mscclpp-test` correctness test.	2025-06-24 09:50:28 -07:00
Changho Hwang	7278b51e61	Rename `ChannelTrigger` fields and check field values in debug builds (#529 )	2025-05-27 14:36:22 -07:00
Caio Rocha	29c3af2ac6	Properly setting up the device in Ethernet Connection (#527 ) When we create the thread to receive messages in the Ethernet Connection, it resets the Device ID, causing faults in the Ethernet Connection unit tests. ![image](https://github.com/user-attachments/assets/ba609c16-0f52-4624-807a-5ad776a0c18d) This PR aims to properly set up the device when the thread is created. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-05-19 10:05:45 -07:00
Changho Hwang	5205618c4a	Fix device assert (#522 ) * Fixed a bug that external `assert()`s may not be compiled with mscclpp headers * Use a macro assert instead of a function	2025-05-12 13:38:11 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Caio Rocha	55789bc551	Support ReduceScatter in the NCCL interface (#460 ) Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net> Co-authored-by: Caio Rocha <aiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-02-11 13:28:19 -08:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Binyang Li	af0bb86e07	Merge mscclpp-lang to mscclpp project (#442 ) First step to merge msccl-tools into mscclpp repo. In this step will move all msccl related code, pass the current tests and do some necessary refactor. Add `mscclpp.language` module Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag Add `DagLower` to lower dag to intermediate representation Add documents for mscclpp.language Remove msccl related code	2025-01-22 09:47:37 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Changho Hwang	f2b52c6318	Fix Python binding of exceptions (#444 ) * Fixed errors to be catchable from Python code * Skip IB tests in Python unit tests when IB ports are down	2025-01-09 11:58:23 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Caio Rocha	ff18bb8d0b	Providing reduce-scatter test support (#390 )	2024-11-28 09:19:30 -08:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Caio Rocha	b3dc74c020	Small Adjust in Test Data AllGather at Executor Test (#384 )	2024-11-16 15:21:00 +08:00
Ziyue Yang	9526d76fc7	Add kernel-based verification for executor_test (#378 ) Add kernels to fill and test data for correctness test in executor_test.py.	2024-11-07 14:14:20 +08:00
Ziyue Yang	95ab1088ef	Fix in-place all-gather input buffer in executor_test (#372 )	2024-10-24 23:04:11 +08:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Ziyue Yang	faadc75649	Fix missing import in executor test (#334 )	2024-08-06 14:24:50 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
Roshan Dathathri	93ed8e1e58	Add support for multicast reduce insruction (#316 )	2024-06-19 13:28:12 -07:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Roshan Dathathri	41e0964d93	Allow binding allocated memory to NVLS multicast pointer (#290 ) And change NVLS multimem instructions to static functions	2024-04-18 17:11:31 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Saeed Maleki	91d592dcc0	NVLS support. (#250 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-04 20:46:10 -08:00
Changho Hwang	544ff0c21d	ROCm support (#213 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-11-24 16:41:56 +08:00
Changho Hwang	060fda12e6	mscclpp-test in Python (#204 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com>	2023-11-16 12:45:25 +08:00
Changho Hwang	4cdb100265	Release GIL for Python APIs with wait (#190 )	2023-11-14 21:11:01 +08:00
Changho Hwang	3521fb0280	Clear minor warnings (#214 ) Clear warnings from the clang compiler.	2023-11-14 09:28:48 +08:00
Saeed Maleki	85e8017535	Atomic for semaphores instead of fences (#188 ) Co-authored-by: Pratyush Patel <pratyushpatel.1995@gmail.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-10-13 18:57:08 +08:00
Saeed Maleki	148681b4bc	Fix a pytest bug (#196 )	2023-10-13 16:39:43 +08:00
Changho Hwang	8c0f9e84d0	v0.3.0 (#171 )	2023-10-11 22:35:54 +08:00
Changho Hwang	11ac824cc7	Align interfaces of put/get/putPackets/getPackets (#185 )	2023-10-07 22:18:26 +08:00
Binyang2014	097aa8843a	Fix pytest unstable issue. (#170 ) - remove `#include <cstdint>` from `poll.hpp`. To make it only contains device-side code - Fix compilation issue, which will cause pytest fail randomly. Reuse the compiled result for same kernel with different arguments	2023-09-06 17:09:04 -07:00
Olli Saarikivi	828be48b21	Add Context and Endpoint classes to enable non-Communicator use-cases (#166 ) This PR implements and closes #137. The new `Endpoint` and `Context` classes expose the connection establishing functionality from `Communicator`, which now is only responsible for tying together the bootstrapper with a context. The largest breaking change here is that `Communicator.connectOnSetup(...)` now returns the `Connection` wrapped inside a `NonblockingFuture`. This is because with the way `Context` is implemented a `Connection` is now fully initialized on construction. Some smaller breaking API changes from this change are that `RegisteredMemory` no longer has a `rank()` function (as there maybe no concept of rank), and similarly `Connection` has no `remoteRank()` and `tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf` functions in `Communicator`. A new `EndpointConfig` class is introduced to avoid duplication of the IB configuration parameters in the APIs of `Context` and `Communicator`. The usual usage pattern of just passing in a `Transport` still works due to an implicit conversion into `EndpointConfig`. Miscellaneous changes: -Cleans up how the PIMPL pattern is applied by making both the `Impl` struct and the `pimpl_` pointers private for all relevant classes in the core API. -Enables ctest to be run from the build root directory.	2023-09-06 13:10:04 +08:00
Binyang2014	858e381829	Pytest (#162 ) Port python tests to mscclpp. Please run `mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest --------- Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>	2023-09-01 21:22:11 +08:00

44 Commits