mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 09:17:06 +00:00

Author	SHA1	Message	Date
Caio Rocha	17247cd695	DSL Quick Start (#689 ) Fix #675 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-21 14:45:49 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Changho Hwang	9994f53cea	Fixes for no-IB systems (#667 ) * Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB on/off * Fix `nvidia-peermem` check; no need for DMABUF supported systems * Fix `mp_unit_tests` to skip all IB tests when built with `-DMSCCLPP_USE_IB=OFF`	2025-10-29 10:03:02 -07:00
Caio Rocha	d7b99e9c9d	Improving DSL documentation (#650 )	2025-10-23 17:50:33 -07:00
Changho Hwang	a48421872e	Fix docs (#656 ) * Fix Python doc generation * Remove `ChannelTrigger` and fix `ProxyTrigger` * Fixed package versions for consistency	2025-10-23 00:34:53 +00:00
Changho Hwang	b8f61cb761	Update the port channel tutorial doc (#653 )	2025-10-21 11:52:15 -07:00
Binyang Li	ddca185add	Address corner case when generating version file (#641 ) Address corner case for version file generation --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com>	2025-10-07 14:32:33 -07:00
Binyang Li	3d94383696	Add MSCCLPP_GIT_COMMIT micro (#640 ) - Add MSCCLPP_GIT_COMMIT micro - Update docs	2025-10-06 15:57:28 -07:00
Qinghua Zhou	16a96ea77b	Support detailed version tracking that captures git repository information (#639 ) #### Version Format The package version includes the git commit hash directly in the version string for development builds: - Release version: `0.7.0` - Development version: `0.7.0.dev36+g6e2360d69` (includes short commit hash) - Development with uncommitted changes: `0.7.0.dev36+g6e2360d69.dirty` #### Checking Version Information After installation, you can check the version information in several ways: From Python: ```python import mscclpp # Access individual attributes print(f"Version: {mscclpp.__version__}") # Full version with commit Version: 0.7.0.dev36+g6e2360d69 # Get as dictionary mscclpp.version() {'version': '0.7.0.dev46+gb0d27c58f', 'base_version': '0.7.0', 'git_commit': 'b0d27c58f'} ``` #### Version Information Details The version tracking captures: - Package Version (`mscclpp.__version__`): Full version string including git commit (e.g., `0.7.0.dev36+g6e2360d69`) This information is embedded during the package build process and remains accessible even after distribution, making it easier to debug issues and ensure reproducibility. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-09-30 09:00:33 -07:00
Binyang Li	70b8297c56	Revise NCCL API implementation (#617 ) - Make nccl interface extensible. Customer can register their own algo to NCCL API. User can provide customized algo selection function. - Fallback to NCCL/RCCL if no algo is selected based on algo selection function - MSCCLPP interfaces now works for any scale	2025-09-26 10:08:12 -07:00
Binyang Li	ba4c4aaeb8	Integrate MSCCL++ with torch workload (#626 ) Integrate MSCCL++ with torch Introduce `NCCL audit shim library`, use can use following commands to launch torch library. Also avoid break build pipeline in the CPU machine ```bash export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH torchrun --nnodes=1 --nproc_per_node=8 your_script.py ```	2025-09-09 13:28:32 -07:00
Caio Rocha	c3473b1794	Thread Block Group DSL (#621 ) Supporting the creation of a group of thread block to perform some operation.	2025-09-03 14:58:40 -07:00
Changho Hwang	2eadbaf86f	python doc auto generation (#605 ) Add Python API references	2025-08-11 10:34:29 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Changho Hwang	9650e5c37e	Update documentation (#576 ) Documentation overhaul	2025-08-07 15:37:37 -07:00
Binyang Li	5e991cf5c8	update readme & bump version (#550 ) Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-07-12 01:00:18 -07:00
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Wenxuan Tan	2151790463	Fix some typos in docs (#555 )	2025-06-19 19:39:37 +00:00
Changho Hwang	de664ad200	Fix #514 (#521 ) * In cases when the same `tag` is used for receiving data from the same remote rank, #514 changed the behavior of `Communicator::connect` and `Communicator::recvMemory` to receive data in the order of `std::shared_future::get()` is called, instead of the original behvaior that receive data in the order of the method calls. Since the original behavior is more intuitive, we get that back. Now when `get()` is called on a future, the async function will first call `wait()` on the latest previously returned future. In a recursive manner, this will call `wait()` on all previous futures that are not yet ready. * Removed all deprecated API calls and replaced into the new ones.	2025-05-13 13:43:35 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Binyang Li	a3d8d6807b	Remove the requirement for `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support (#489 ) Remove the requirement for `CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED` for NVLS support. Fix #487	2025-03-28 16:46:54 -07:00
Qinghua Zhou	0f21ed44b8	Add CI test for fallback allgather, allreduce, broadcastand reducescatter to NCCL operations (#485 ) Add CI test for fallback allgather, allreduce, broadcast, and reducescatter to NCCL operations Test following parameters: -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl/nccl/build/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allgather, allreduce, broadcast, reducescatter" or "all"	2025-03-27 21:13:07 +00:00
Changho Hwang	e4012ded48	Mark mscclpp-test as deprecated in the doc (#478 )	2025-03-11 22:44:38 +00:00
Yang Wang	1ff217d5f3	Fix minor typos and errors in documentation (#474 )	2025-02-28 17:46:24 -08:00
Binyang Li	7f3b088744	Add multi-nodes example & update doc (#455 ) Documentation update: * [`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114): Updated the link to the examples folder to reflect the correct path. New example script: * [`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76): Added a new example script demonstrating the allgather all-pairs algorithm across multiple nodes using packet communication. IR module improvements: * [`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472): Refined the sorting criteria for GPU instance channels and thread block channels to include the channel type, ensuring a more accurate order. Debugging enhancements: * [`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441): Added a debug log to indicate the start of communication collective execution with details about the execution plan and collective. * [`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89): Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for logging executor-related information.	2025-01-31 17:52:15 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Binyang Li	af0bb86e07	Merge mscclpp-lang to mscclpp project (#442 ) First step to merge msccl-tools into mscclpp repo. In this step will move all msccl related code, pass the current tests and do some necessary refactor. Add `mscclpp.language` module Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag Add `DagLower` to lower dag to intermediate representation Add documents for mscclpp.language Remove msccl related code	2025-01-22 09:47:37 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	3d6bfed2cf	Update version number (#433 ) Co-authored-by: github-actions <github-actions@github.com>	2025-01-02 16:45:08 -08:00
Binyang Li	863a599360	Disable CuMemMap check for ROCm (#411 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-17 08:36:25 +00:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Binyang Li	4136153a76	[Doc] mscclpp docs (#348 ) Generate docs for mescclpp. Setup github action to auto-deploy github-page doc link here: https://microsoft.github.io/mscclpp --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-10-18 06:08:31 +00:00
Changho Hwang	40cb196553	v0.5.2 (#328 )	2024-07-16 00:35:18 +00:00
caiomcbr	b1b9d0626c	Support NCCL APIs (#319 ) Start supporting NCCL APIs with a few limitations. --------- Co-authored-by: Caio Rocha <caio.rocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-06-27 23:54:06 +00:00
Angelica Moreira	34f4d9d006	Update quickstart.md (#314 ) Updating the docker image name tag and the python benchmark path.	2024-06-19 22:26:13 +00:00
Changho Hwang	cddffbc8b6	v0.5.1 (#308 )	2024-05-26 14:31:29 -07:00
Changho Hwang	9c2a96060a	v0.5.0 (#298 )	2024-05-04 16:51:48 -07:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 )	2024-03-27 11:53:09 -07:00
Changho Hwang	cdaf3aea3d	New packet format & optimizations (#256 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-20 20:01:37 -08:00
Changho Hwang	f1605b73d6	v0.4.2 (#236 )	2023-12-18 11:42:58 +08:00
Binyang Li	f1b2c9df12	Fix performance downgrade issue & update doc (#229 ) For push function, we only need to make sure the instruction `st.global` will be executed after the while loop. Since there is a Write-After-Read hazard for `trigger.fst` (Check `this->triggers[curFifoHead % size].fst != 0` first then write value to `triggers[curFifoHead % size]`), we can expect the compiler and hardware can handle this situation correctly. Remove the `release.sys` there. BTW, `st.global.release.sys.v2.u64` will cause perf regression issue. Previous we use `st.global.release.cta.v2.u64`, but seems not necessary.	2023-12-04 10:20:10 -08:00
Changho Hwang	351b95b926	Update documents (#225 ) Adding AMD supports on the docs	2023-11-24 17:00:18 +08:00
Changho Hwang	15f6dcca49	Update documentation (#217 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 12:58:04 -08:00
Changho Hwang	f68820436c	Explicit build dependency on `nvidia_peermem` (#201 )	2023-10-23 04:29:30 +00:00
Changho Hwang	3df18d20a3	Update install guidelines (#159 )	2023-08-30 10:40:40 -07:00
Changho Hwang	4114d65c60	Documents & minor updates (#119 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-07-07 17:35:05 +08:00
Changho Hwang	6ec585f3d8	Packet copy for IB (#109 ) * Extend channels to support LL with IB * Rename classes and interfaces	2023-06-28 10:39:31 -07:00

1 2

53 Commits