mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-04-20 06:49:29 +00:00

Author	SHA1	Message	Date
Changho Hwang	8b8593ba51	Fix Python bindings and tests (#690 ) Minimal fix to make things work. We need a more careful look at preventing silent fallback of nanobind when it fails to (properly) construct a C++ STL object with mscclpp instances.	2025-11-21 12:53:12 -08:00
Caio Rocha	060c35fec6	No IB Env CI Test (#687 )	2025-11-19 11:13:03 -08:00
Caio Rocha	bbdeafb3ca	Fix Error in Non IB Env at Executor (#686 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-11-17 16:35:57 -08:00
Qinghua Zhou	b9428341a2	Revise the mscclpp datatype (#671 ) Use mscclpp::DataType to replace the following types in API interface: int dtype; ncclDataType_t dtype; Add data type conversion: Convert ncclDataType_t to mscclpp::DataType	2025-11-17 12:58:47 -08:00
Caio Rocha	a19bca9738	Fix Minor Issue Proxy Python Interface (#685 )	2025-11-17 09:03:00 -08:00
Changho Hwang	1bf4e8c90e	`connect()` APIs changed to return an instance instead of a shared_ptr (#680 ) The key purpose is handling all mscclpp objects' memory internally by hiding shared pointers from user APIs. * `Connection` class is now a wrapper of `BaseConnection` class that is equivalent to the previous `Connection` class * `connect()` methods now return `Connection` instead of `std::shared_ptr<Connection>` * Removed `connectOnSetup()` method	2025-11-15 11:40:40 -08:00
Caio Rocha	7eb3ff701a	Supporting New Packet Kernel Operation at Executor (#677 ) This PR introduces three new operations to enhance flexibility and performance at executor. One operation can be invoked directly via the DSL API and two operations are created through fusion of existing operations, reducing overhead and improving efficiency. 1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt format to the remote side in pkt format via the port channel. Both source and destination buffers must be scratch. 2. Reduce Copy Packet (Fusion): Reduce Packet+Copy Packet=Reduce Copy Packet Triggered when the destination buffer of Reduce Packet matches the source buffer of Copy Packet. Purpose: Combine reduction and copy into a single step for better performance. 3. Reduce Copy Send Packet (Fusion): Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer of Reduce Copy Packet matches src buffer of Put Packet) Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt buffer of Reduce Copy Packet matches src buffer of Read Put Packet) Purpose: Combine reduction, copy, and send operations into one optimized pipeline. Fusion Diagram Reduce Packet + Copy Packet → Reduce Copy Packet Reduce Copy Packet + Put Packet → Reduce Copy Send Packet Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet Beyond this, this PR adjust the AllReduce 2 Node algorithm: Message Size \| Latency (µs) 1K \| 15.34 2K \| 15.88 4K \| 15.71 8K \| 16.01 16K \| 15.88 32K \| 16.21 64K \| 16.90 128K \| 18.24 256K \| 20.39 512K \| 25.26 1M \| 32.74 2M \| 53.64	2025-11-13 14:08:44 -08:00
Caio Rocha	eb202780f5	Support Synchronous Initialization for Proxy Service (#679 )	2025-11-12 18:35:57 -08:00
Changho Hwang	ffafcaf6d6	IB stack enhancements & bug fixes (#673 ) * Always use `ibv_reg_dmabuf_mr` when DMABUF is supported * Do not check `nvidia-peermem` when unnecessary * More rigorous check on IB port availability * Fixed ibverbs wrappers * Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test	2025-11-07 14:26:17 -08:00
Binyang Li	9eb958183c	upgrade codeql to v3 (#676 )	2025-11-06 16:58:19 -08:00
Changho Hwang	960a8ddebd	Add a new logger (#668 ) * Add `logger.hpp` that will gradually replace `debug.h` * Minor fixes in `ib.cc` --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2025-11-04 10:32:46 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Changho Hwang	9994f53cea	Fixes for no-IB systems (#667 ) * Add a compile flag `MSCCLPP_USE_IB` that explicitly specifies IB on/off * Fix `nvidia-peermem` check; no need for DMABUF supported systems * Fix `mp_unit_tests` to skip all IB tests when built with `-DMSCCLPP_USE_IB=OFF`	2025-10-29 10:03:02 -07:00
Caio Rocha	2b987cf8e8	Resolve IBVerbs Loading Issues (#648 ) Some systems do not include libibverbs.so when installing ibverbs; instead, they only provide libibverbs.so1. This PR updates the CMake file to search for this library and modifies the wrapper to load it. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-28 18:14:53 -07:00
Qinghua Zhou	a38c2ee784	FP8 support for Allreduce (#646 ) Add FP8 support for Allreduce on both NVIDIA and AMD platform. Add new data type: fp8_e4m3 and fp8_e5m2 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-10-27 14:51:48 -07:00
Changho Hwang	fc0aaaf1b4	Auto-detect CUDA arch in CMake GPU check (#666 ) Compute capability 60 support is dropped from CUDA 13 Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-10-27 11:25:24 -07:00
Binyang Li	2b05908635	Add token pool for cuCreate API (#628 ) Create a tokenPool to allocate token. This feature is used to support inter node NVL and try to reduce the footprint caused by cuCreate --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-27 11:19:21 -07:00
Changho Hwang	09219c1f6a	Fix #651 (#662 ) * Python cannot distinguish `Communicator::connect(const Endpoint&, ...)` from `Communicator::connect(const EndpointConfig&, ...)`. Temporarily removed the former one. * A few other fixes in Python bindings.	2025-10-24 14:25:51 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Changho Hwang	a2f1279c60	Test peer accessibility after deployment (#661 ) Test GPUs' peer accessibility before integration testing to distinguish VM issues.	2025-10-24 11:09:36 -07:00
Changho Hwang	4d4f087c11	Add exclude paths under pipeline triggers (#664 )	2025-10-23 18:27:36 -07:00
Caio Rocha	d7b99e9c9d	Improving DSL documentation (#650 )	2025-10-23 17:50:33 -07:00
Changho Hwang	f7d1fb4492	Exclude irrelevant files from workflow triggers (#663 )	2025-10-23 15:52:19 -07:00
Changho Hwang	58996b5c51	Fix docs version (#659 ) Fetch full history of the repo for accurate version info	2025-10-23 11:14:27 -07:00
Changho Hwang	a48421872e	Fix docs (#656 ) * Fix Python doc generation * Remove `ChannelTrigger` and fix `ProxyTrigger` * Fixed package versions for consistency	2025-10-23 00:34:53 +00:00
Binyang Li	cbf448b012	New allreduce algo for small message size (#647 ) New algo for message size < 32KB, command: `mpirun --allow-run-as-root -tag-output -np 8 -x LD_PRELOAD=/root/mscclpp/build/apps/nccl/libmscclpp_nccl.so -x MSCCLPP_DISABLE_CHANNEL_CACHE=1 ./build/all_reduce_perf -b 1K -e 32K -f 2 -c 1 -G 1 -n 100 -d half` Tested on H100 Perf: ``` [1,0]<stdout>:# out-of-place in-place [1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong [1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) [1,0]<stdout>: 1024 512 half sum -1 4.65 0.22 0.39 0 4.60 0.22 0.39 0 [1,0]<stdout>: 2048 1024 half sum -1 4.96 0.41 0.72 0 4.93 0.42 0.73 0 [1,0]<stdout>: 4096 2048 half sum -1 5.12 0.80 1.40 0 5.12 0.80 1.40 0 [1,0]<stdout>: 8192 4096 half sum -1 5.11 1.60 2.81 0 5.08 1.61 2.82 0 [1,0]<stdout>: 16384 8192 half sum -1 5.47 3.00 5.24 0 5.44 3.01 5.27 0 [1,0]<stdout>: 32768 16384 half sum -1 6.24 5.25 9.19 0 6.28 5.22 9.14 0 [1,0]<stdout>:# Out of bounds values : 0 OK [1,0]<stdout>:# Avg bus bandwidth : 3.29145 ``` Old: ``` [1,0]<stdout>:# out-of-place in-place [1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong [1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) [1,0]<stdout>: 1024 512 half sum -1 5.02 0.20 0.36 0 5.12 0.20 0.35 0 [1,0]<stdout>: 2048 1024 half sum -1 5.28 0.39 0.68 0 5.29 0.39 0.68 0 [1,0]<stdout>: 4096 2048 half sum -1 5.45 0.75 1.32 0 5.46 0.75 1.31 0 [1,0]<stdout>: 8192 4096 half sum -1 5.50 1.49 2.61 0 5.51 1.49 2.60 0 [1,0]<stdout>: 16384 8192 half sum -1 5.79 2.83 4.95 0 5.80 2.82 4.94 0 [1,0]<stdout>: 32768 16384 half sum -1 7.36 4.45 7.79 0 7.36 4.46 7.80 0 [1,0]<stdout>:# Out of bounds values : 0 OK [1,0]<stdout>:# Avg bus bandwidth : 2.94887 ```	2025-10-22 17:28:53 -07:00
Changho Hwang	200cdf946e	Update `EndpointConfig` interfaces (#651 ) * Separate IB-specific options into a nested struct * Enable `connect()` by an `Endpoint`, not only by `EndpointConfig` * Other minor changes	2025-10-22 10:39:39 -07:00
Binyang Li	610db6f023	Fix test script (#655 ) Fix: #654. Address correctness_test.py crash issue Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-21 19:57:17 +00:00
Changho Hwang	b8f61cb761	Update the port channel tutorial doc (#653 )	2025-10-21 11:52:15 -07:00
Changho Hwang	2f7d74b281	Fix lint.sh (#652 ) Exit 1 upon any errors from clang-format or black	2025-10-20 17:23:01 -07:00
Binyang Li	b1a88d755e	Pipeline fix (#645 ) Co-authored-by: github-actions <github-actions@github.com> v0.8.0	2025-10-10 11:26:33 -07:00
Binyang Li	ddca185add	Address corner case when generating version file (#641 ) Address corner case for version file generation --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com>	2025-10-07 14:32:33 -07:00
Binyang Li	3d94383696	Add MSCCLPP_GIT_COMMIT micro (#640 ) - Add MSCCLPP_GIT_COMMIT micro - Update docs	2025-10-06 15:57:28 -07:00
Binyang Li	da84c6e4cc	Reduce memory footprint for allreduce8 and allgather6 (#644 ) Reduce memory footprint for allreduce8 and allgather6 Remove libLoadSucceed check for group start	2025-10-03 15:08:56 -07:00
Binyang Li	fe65c045c4	Make ncclReduce/ncclSend/ncclRecv work (#643 ) Fallback to nccl for ncclReduce/ncclSend/ncclRecv --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-02 17:05:00 -07:00
Caio Rocha	b76f3ebf39	Add 2 Node AllReduce DSL Algorithm (#636 ) This PR creates two allreduce algorithms designed for a 2-node environment. These algorithms are in-place and non-zero copy.	2025-10-01 17:00:17 -07:00
Binyang Li	79e9e613a8	Fix Rocm build issue (#642 ) Fix cross-compiling issue. Add `target_compile_options` to make sure the compile option is correct	2025-10-01 14:20:45 -07:00
Qinghua Zhou	16a96ea77b	Support detailed version tracking that captures git repository information (#639 ) #### Version Format The package version includes the git commit hash directly in the version string for development builds: - Release version: `0.7.0` - Development version: `0.7.0.dev36+g6e2360d69` (includes short commit hash) - Development with uncommitted changes: `0.7.0.dev36+g6e2360d69.dirty` #### Checking Version Information After installation, you can check the version information in several ways: From Python: ```python import mscclpp # Access individual attributes print(f"Version: {mscclpp.__version__}") # Full version with commit Version: 0.7.0.dev36+g6e2360d69 # Get as dictionary mscclpp.version() {'version': '0.7.0.dev46+gb0d27c58f', 'base_version': '0.7.0', 'git_commit': 'b0d27c58f'} ``` #### Version Information Details The version tracking captures: - Package Version (`mscclpp.__version__`): Full version string including git commit (e.g., `0.7.0.dev36+g6e2360d69`) This information is embedded during the package build process and remains accessible even after distribution, making it easier to debug issues and ensure reproducibility. --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-09-30 09:00:33 -07:00
Binyang Li	70b8297c56	Revise NCCL API implementation (#617 ) - Make nccl interface extensible. Customer can register their own algo to NCCL API. User can provide customized algo selection function. - Fallback to NCCL/RCCL if no algo is selected based on algo selection function - MSCCLPP interfaces now works for any scale	2025-09-26 10:08:12 -07:00
Binyang Li	5ac427610d	Address teardown issue (#638 ) Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point	2025-09-25 12:12:40 -07:00
Binyang Li	bf8d424ae3	use unix socket to share fd (#634 ) Use unix socket to share fd to other processes. Used for nvls handle sharing Update nccl interface to support worldSize=1 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-25 11:40:54 -07:00
Changho Hwang	43f160c8e6	Fix for safe process teardown (#633 ) * `gpuFree()` functions are usually called during process teardown, so we let them ignore regarding errors. `AvoidCudaGraphCaptureGuard` is constructed in `gpuFree*()` functions, so it needs the same fix.	2025-09-10 20:28:04 -07:00
Binyang Li	d946c45ebd	Adapt with torch 2.6 (#632 )	2025-09-10 20:40:39 +00:00
Abhinav Jangda	5d062b7038	Fix Illegal Memory Access in nvls_test for CUDA12.9 (#631 ) Running NVLS test, `test/nvls_test.cu` in CUDA 12.9 leads to illegal memory access at `571fee16fb/test/nvls_test.cu (L151)` . This PR addresses this error by moving cudaMemset after memory mapping.	2025-09-10 09:46:18 -07:00
Changho Hwang	571fee16fb	Add `FifoDeviceHandle::poll()` (#630 )	2025-09-09 23:32:35 +00:00
Binyang Li	ba4c4aaeb8	Integrate MSCCL++ with torch workload (#626 ) Integrate MSCCL++ with torch Introduce `NCCL audit shim library`, use can use following commands to launch torch library. Also avoid break build pipeline in the CPU machine ```bash export LD_AUDIT=$MSCCLPP_INSTALL_DIR/libmscclpp_audit_nccl.so export LD_LIBRARY_PATH=$MSCCLPP_INSTALL_DIR:$LD_LIBRARY_PATH torchrun --nnodes=1 --nproc_per_node=8 your_script.py ```	2025-09-09 13:28:32 -07:00
Binyang Li	4bbe16b351	Fix hang issue in logging submodule (#625 ) Fix: #622, using std::recursive_mutex to allow acquiring `lock` reclusively in the same thread	2025-09-05 09:18:28 -07:00
Changho Hwang	c4d8781390	Fix memory exchange within a single process (#624 )	2025-09-04 12:53:51 -07:00
Caio Rocha	c3473b1794	Thread Block Group DSL (#621 ) Supporting the creation of a group of thread block to perform some operation.	2025-09-03 14:58:40 -07:00
Changho Hwang	547a9ae65c	Fixed cpp linter (#619 )	2025-08-25 12:15:45 -07:00

1 2 3 4 5 ...

855 Commits