Commit Graph

20 Commits

Author SHA1 Message Date
Binyang Li
fa95e82e18 Fix CI/CD pipeline issues (#773)
This pull request updates the deployment pipeline to allow custom CMake
arguments to be passed to the pip install process on remote VMs.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-07 08:41:51 -07:00
Caio Rocha
7eb3ff701a Supporting New Packet Kernel Operation at Executor (#677)
This PR introduces three new operations to enhance flexibility and
performance at executor.

One operation can be invoked directly via the DSL API and two operations
are created through fusion of existing operations, reducing overhead and
improving efficiency.

1. Port Channel Put Packet (Direct DSL API Call): Sends data from pkt
format to the remote side in pkt format via the port channel. Both
source and destination buffers must be scratch.

2. Reduce Copy Packet (Fusion):
Reduce Packet+Copy Packet=Reduce Copy Packet
Triggered when the destination buffer of Reduce Packet matches the
source buffer of Copy Packet.
Purpose: Combine reduction and copy into a single step for better
performance.

3. Reduce Copy Send Packet (Fusion):
Reduce Copy Packet+Put Packet=Reduce Copy Send Packet (when dst buffer
of Reduce Copy Packet matches src buffer of Put Packet)
Reduce Copy Packet+Read Put Packet=Reduce Copy Send Packet (when dst pkt
buffer of Reduce Copy Packet matches src buffer of Read Put Packet)
Purpose: Combine reduction, copy, and send operations into one optimized
pipeline.


Fusion Diagram
Reduce Packet + Copy Packet → Reduce Copy Packet
Reduce Copy Packet + Put Packet → Reduce Copy Send Packet
Reduce Copy Packet + Read Put Packet → Reduce Copy Send Packet

Beyond this, this PR adjust the AllReduce 2 Node algorithm:

Message Size  |  Latency (µs)
        1K            |     15.34
        2K            |     15.88
        4K            |     15.71
        8K            |     16.01
        16K          |     15.88
        32K          |     16.21
        64K          |     16.90
        128K        |     18.24
        256K        |     20.39
        512K        |     25.26
        1M           |     32.74
        2M           |     53.64
2025-11-13 14:08:44 -08:00
Changho Hwang
a2f1279c60 Test peer accessibility after deployment (#661)
Test GPUs' peer accessibility before integration testing to distinguish
VM issues.
2025-10-24 11:09:36 -07:00
Changho Hwang
2f7d74b281 Fix lint.sh (#652)
Exit 1 upon any errors from clang-format or black
2025-10-20 17:23:01 -07:00
Changho Hwang
547a9ae65c Fixed cpp linter (#619) 2025-08-25 12:15:45 -07:00
Binyang Li
be6a941fba New DSL implementation (#579)
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline 
- Clean code and enhance document
C++ side: 
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario

---------

Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-08-09 00:36:20 -07:00
Changho Hwang
c3b47c59fd Updated Dev Container (#591)
* Added more features in Dev Container
* Made it runnable on AMD platforms

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-01 13:39:03 -07:00
Changho Hwang
5b84c8a3d1 Separate linters from cmake (#587) 2025-07-28 09:59:20 +08:00
Caio Rocha
986c45b71a NPKit Support to Read Put Packet Operation (#471) 2025-02-27 12:02:16 -08:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Binyang Li
593478e1b7 Add cross threadblock barrier (#383) 2024-11-26 05:13:30 +00:00
Ziyue Yang
5c4e105814 Fix NPKit exit event offset (#356) 2024-09-19 13:35:44 +08:00
Binyang Li
7bedb25054 Add proxy channel related operations (#351)
Add Flush, PutWithSignal, PutWithFlushAndSignal operation
2024-09-15 13:24:57 -07:00
Caio Rocha
4eca6f1e95 Support executors to send packets over ProxyChannel (#344)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-08-30 22:10:33 +00:00
Caio Rocha
1af62ea43d ProxyChannel Support in Executor (#342)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-08-27 10:09:44 -07:00
Ziyue Yang
76328fe623 Add NPKit GPU event support (#310) 2024-06-13 13:59:50 +08:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Binyang2014
2640578b22 Add performance check for mscclpp-test (#110)
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Ziyue Yang
b234cf5012 NPKit: add DMA events and fix bandwidth calculation (#33) 2023-03-28 09:58:32 +08:00
Ziyue Yang
f92b428cba Port NPKit 2023-03-24 06:41:16 +00:00