mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Binyang Li	ecd33722d4	Fix multi-node H100 CI: CUDA compat, deploy improvements (#781 ) ## Summary - Multi-node H100 CI setup: Improve architecture detection and GPU configuration - Remove hardcoded VMSS hostnames from deploy files - Fix CUDA compat library issue: Remove stale compat paths from Docker image for CUDA 12+. Instead, `peer_access_test` now returns a distinct exit code (2) for CUDA init failure, and `setup.sh` conditionally adds compat libs only when needed. This fixes `cudaErrorSystemNotReady` (error 803) when the host driver is newer than the container's compat libs. - Speed up deploy: Replace recursive `parallel-scp` with tar+scp+untar to avoid per-file SSH overhead. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-13 21:51:29 -07:00
Binyang Li	fa95e82e18	Fix CI/CD pipeline issues (#773 ) This pull request updates the deployment pipeline to allow custom CMake arguments to be passed to the pip install process on remote VMs. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 08:41:51 -07:00
Caio Rocha	7eb3ff701a	Supporting New Packet Kernel Operation at Executor (#677 ) This PR introduces three new operations to enhance flexibility and performance at executor. One operation can be invoked directly via the DSL API and two operations are created through fusion of existing operations, reducing overhead and improving efficiency. 1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt format to the remote side in pkt format via the port channel. Both source and destination buffers must be scratch. 2. Reduce Copy Packet (Fusion): Reduce Packet+Copy Packet=Reduce Copy Packet Triggered when the destination buffer of Reduce Packet matches the source buffer of Copy Packet. Purpose: Combine reduction and copy into a single step for better performance. 3. Reduce Copy Send Packet (Fusion): Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer of Reduce Copy Packet matches src buffer of Put Packet) Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt buffer of Reduce Copy Packet matches src buffer of Read Put Packet) Purpose: Combine reduction, copy, and send operations into one optimized pipeline. Fusion Diagram Reduce Packet + Copy Packet → Reduce Copy Packet Reduce Copy Packet + Put Packet → Reduce Copy Send Packet Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet Beyond this, this PR adjust the AllReduce 2 Node algorithm: Message Size \| Latency (µs) 1K \| 15.34 2K \| 15.88 4K \| 15.71 8K \| 16.01 16K \| 15.88 32K \| 16.21 64K \| 16.90 128K \| 18.24 256K \| 20.39 512K \| 25.26 1M \| 32.74 2M \| 53.64	2025-11-13 14:08:44 -08:00
Changho Hwang	a2f1279c60	Test peer accessibility after deployment (#661 ) Test GPUs' peer accessibility before integration testing to distinguish VM issues.	2025-10-24 11:09:36 -07:00
Changho Hwang	2f7d74b281	Fix lint.sh (#652 ) Exit 1 upon any errors from clang-format or black	2025-10-20 17:23:01 -07:00
Changho Hwang	547a9ae65c	Fixed cpp linter (#619 )	2025-08-25 12:15:45 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Changho Hwang	c3b47c59fd	Updated Dev Container (#591 ) * Added more features in Dev Container * Made it runnable on AMD platforms --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-01 13:39:03 -07:00
Changho Hwang	5b84c8a3d1	Separate linters from cmake (#587 )	2025-07-28 09:59:20 +08:00
Caio Rocha	986c45b71a	NPKit Support to Read Put Packet Operation (#471 )	2025-02-27 12:02:16 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Binyang Li	593478e1b7	Add cross threadblock barrier (#383 )	2024-11-26 05:13:30 +00:00
Ziyue Yang	5c4e105814	Fix NPKit exit event offset (#356 )	2024-09-19 13:35:44 +08:00
Binyang Li	7bedb25054	Add proxy channel related operations (#351 ) Add Flush, PutWithSignal, PutWithFlushAndSignal operation	2024-09-15 13:24:57 -07:00
Caio Rocha	4eca6f1e95	Support executors to send packets over ProxyChannel (#344 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-08-30 22:10:33 +00:00
Caio Rocha	1af62ea43d	ProxyChannel Support in Executor (#342 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-08-27 10:09:44 -07:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Saeed Maleki	8d1b984bed	Change device handle interfaces & others (#142 ) * Changed device handle interfaces * Changed proxy service interfaces * Move device code into separate files * Fixed FIFO polling issues * Add configuration arguments in several interface functions --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>	2023-08-16 20:00:56 +08:00
Binyang2014	2640578b22	Add performance check for mscclpp-test (#110 ) - Add ndmv4 perf baseline - change mscclpp-test to output perf number into a json file - add python script to check the perf result with the baseline	2023-06-21 07:42:53 +00:00
Ziyue Yang	b234cf5012	NPKit: add DMA events and fix bandwidth calculation (#33 )	2023-03-28 09:58:32 +08:00
Ziyue Yang	f92b428cba	Port NPKit	2023-03-24 06:41:16 +00:00

21 Commits