Binyang Li
25435acf5d
Add new algos for GB200 ( #747 )
...
- Add new algos (allreduce_rsag, allreduce_rsag_pipeline and
allreduce_rsag_zero_copy) for GB200.
- Add IB stub for non-IB env
- Provides example for algorithm tunning with different nblocks/nthreads
Perf for allreduce_rsag
```
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum -1 25.16 41.67 62.51 0 23.73 44.18 66.27 0
2097152 524288 float sum -1 26.06 80.47 120.71 0 25.31 82.86 124.29 0
4194304 1048576 float sum -1 31.09 134.93 202.39 0 30.75 136.39 204.58 0
8388608 2097152 float sum -1 45.52 184.29 276.43 0 45.13 185.87 278.80 0
16777216 4194304 float sum -1 75.73 221.53 332.30 0 75.51 222.18 333.27 0
33554432 8388608 float sum -1 137.25 244.48 366.72 0 137.22 244.54 366.81 0
67108864 16777216 float sum -1 271.34 247.32 370.99 0 270.86 247.76 371.65 0
134217728 33554432 float sum -1 534.25 251.22 376.84 0 534.43 251.14 376.71 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 264.454
#
# Collective test concluded: all_reduce_perf
```
perf for allreduce_rsag_pipeline
```
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum -1 61.57 17.03 25.55 0 61.51 17.05 25.57 0
2097152 524288 float sum -1 61.31 34.20 51.31 0 61.23 34.25 51.38 0
4194304 1048576 float sum -1 61.62 68.06 102.10 0 61.84 67.83 101.74 0
8388608 2097152 float sum -1 61.97 135.37 203.06 0 61.89 135.53 203.30 0
16777216 4194304 float sum -1 63.15 265.65 398.48 0 62.89 266.76 400.15 0
33554432 8388608 float sum -1 100.63 333.46 500.19 0 99.76 336.34 504.51 0
67108864 16777216 float sum -1 180.04 372.75 559.13 0 179.75 373.34 560.01 0
134217728 33554432 float sum -1 339.60 395.23 592.84 0 338.16 396.91 595.36 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 304.665
#
# Collective test concluded: all_reduce_perf
```
perf for allreduce_rsag_zero_copy
```
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum -1 14.99 69.93 104.90 0 14.44 72.61 108.92 0
2097152 524288 float sum -1 16.19 129.56 194.33 0 15.85 132.32 198.48 0
4194304 1048576 float sum -1 21.19 197.98 296.97 0 20.64 203.20 304.81 0
8388608 2097152 float sum -1 31.04 270.27 405.41 0 30.68 273.44 410.16 0
16777216 4194304 float sum -1 50.34 333.26 499.89 0 50.15 334.51 501.77 0
33554432 8388608 float sum -1 89.58 374.56 561.84 0 88.65 378.48 567.73 0
67108864 16777216 float sum -1 165.69 405.03 607.54 0 163.64 410.10 615.16 0
134217728 33554432 float sum -1 323.19 415.28 622.93 0 318.01 422.05 633.07 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 414.619
#
# Collective test concluded: all_reduce_perf
```
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com >
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2026-02-24 16:43:23 -08:00
Changho Hwang
d7925448f3
Update copilot-instructions.md ( #722 )
2026-02-06 11:27:01 -08:00
Qinghua Zhou
f0441ee4ea
Update document versioning for PR #724 ( #735 )
...
This PR fix the issue of generating docs when we take
https://github.com/microsoft/mscclpp/pull/724 into main branch.
Build docs for main branch separately.
Use HEAD request instead of GET to check if a page exist.
Filter out versions before v0.4.0 in generate_versions.py.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2026-02-01 19:52:01 -08:00
Qinghua Zhou
cc797abc87
Revert "Support versioning for mscclpp document ( #724 )" ( #734 )
...
This PR reverts commit 69d3b7 to avoid the github page issue.
2026-01-23 16:42:54 -08:00
Qinghua Zhou
69d3b79ecd
Support versioning for mscclpp document ( #724 )
...
Show all the versions of mscclpp document on the webpage
https://microsoft.github.io/mscclpp/
Add sphinx-multiversion to generate documents for different versions.
Add version selector on document webpage.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2026-01-23 09:45:41 -08:00
Binyang Li
a707273701
Torch integration ( #692 )
...
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com >
2026-01-21 20:32:24 -08:00
Changho Hwang
7b18a42274
Add copilot-instructions.md ( #602 )
2025-12-22 22:15:40 -08:00
Binyang Li
9eb958183c
upgrade codeql to v3 ( #676 )
2025-11-06 16:58:19 -08:00
Changho Hwang
f7d1fb4492
Exclude irrelevant files from workflow triggers ( #663 )
2025-10-23 15:52:19 -07:00
Changho Hwang
58996b5c51
Fix docs version ( #659 )
...
Fetch full history of the repo for accurate version info
2025-10-23 11:14:27 -07:00
Changho Hwang
a48421872e
Fix docs ( #656 )
...
* Fix Python doc generation
* Remove `ChannelTrigger` and fix `ProxyTrigger`
* Fixed package versions for consistency
2025-10-23 00:34:53 +00:00
Binyang Li
3d94383696
Add MSCCLPP_GIT_COMMIT micro ( #640 )
...
- Add MSCCLPP_GIT_COMMIT micro
- Update docs
2025-10-06 15:57:28 -07:00
Binyang Li
be6a941fba
New DSL implementation ( #579 )
...
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline
- Clean code and enhance document
C++ side:
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario
---------
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2025-08-09 00:36:20 -07:00
Changho Hwang
5b84c8a3d1
Separate linters from cmake ( #587 )
2025-07-28 09:59:20 +08:00
Binyang Li
adc9ee5684
Export mscclpp GpuBuffer to dlpack format ( #492 )
...
For mscclpp, to use nvls we require the buffer is allocated by
mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export
the raw buffer to dlpack format.
User can use this feature to create buffer with type supported by
pytorch
```python
buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16
dl_pack = buffer.to_dlpack(str(torch.bfloat16))
tensor = torch.utils.dlpack.from_dlpack(dl_pack)
```
2025-04-03 12:59:32 -07:00
Changho Hwang
def68ced64
Add CUDA 12.8 images ( #488 )
2025-03-29 00:31:26 +00:00
Binyang Li
7f3b088744
Add multi-nodes example & update doc ( #455 )
...
Documentation update:
*
[`docs/design/mscclpp-dsl.md`](diffhunk://#diff-02a69290fb3e02b8a069bf915fbf5266cfc2ac51c6e9ff8b5b19df51ed909b22L114-R114):
Updated the link to the examples folder to reflect the correct path.
New example script:
*
[`python/examples/allgather_allpairs_multinodes_packets.py`](diffhunk://#diff-ab42c16ecca0680d55b60b82a6913138c5fba4069b9c4493fbe8c72217fe54bcR1-R76):
Added a new example script demonstrating the allgather all-pairs
algorithm across multiple nodes using packet communication.
IR module improvements:
*
[`python/mscclpp/language/ir.py`](diffhunk://#diff-b025796b03fbbd9b2ca9aee2569547efa7a56101743bc4aa05661be0b52aeec9L470-R472):
Refined the sorting criteria for GPU instance channels and thread block
channels to include the channel type, ensuring a more accurate order.
Debugging enhancements:
*
[`src/executor/executor.cc`](diffhunk://#diff-60f7806d111e5cc12ded06358b5d5b09b8521e3858f182d8be81ac05147c535dR439-R441):
Added a debug log to indicate the start of communication collective
execution with details about the execution plan and collective.
*
[`src/include/debug.h`](diffhunk://#diff-24e5fda55e3712277be4bb99b3c348294a77ebd3046bfe716b74bdb32cd203dfR89):
Introduced a new debug log subsystem identifier `MSCCLPP_EXECUTOR` for
logging executor-related information.
2025-01-31 17:52:15 -08:00
Binyang Li
af0bb86e07
Merge mscclpp-lang to mscclpp project ( #442 )
...
First step to merge msccl-tools into mscclpp repo. In this step will
move all msccl related code, pass the current tests and do some
necessary refactor.
Add `mscclpp.language` module
Add `_InstructionOptimizer` and `DagOptimizer` class to optimize the dag
Add `DagLower` to lower dag to intermediate representation
Add documents for mscclpp.language
Remove msccl related code
2025-01-22 09:47:37 -08:00
Changho Hwang
2b54af7e27
Auto-update version numbers in CMakeLists.txt ( #450 )
2025-01-09 17:54:10 -08:00
Binyang Li
3d6bfed2cf
Update version number ( #433 )
...
Co-authored-by: github-actions <github-actions@github.com >
2025-01-02 16:45:08 -08:00
Binyang Li
f18a440feb
trigger ci for release branches ( #426 )
2024-12-21 00:05:13 +00:00
Changho Hwang
2127a3ba29
Improve CMake options ( #376 )
...
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-22 01:54:11 +00:00
Binyang Li
4136153a76
[Doc] mscclpp docs ( #348 )
...
Generate docs for mescclpp.
Setup github action to auto-deploy github-page
doc link here: https://microsoft.github.io/mscclpp
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-10-18 06:08:31 +00:00
Changho Hwang
8a330f9135
Update ROCm CI ( #357 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-09-20 17:57:02 +00:00
caiomcbr
7493e2f075
Double buffering for NCCL APIs ( #324 )
...
Using two scratch buffers in each peer to exchange data.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-15 22:18:53 +00:00
Binyang Li
422c81f0f8
remove make pylib-copy command ( #249 )
...
Fix #216
Remove `make pylib-copy`
2024-01-19 12:29:15 -08:00
Changho Hwang
5fa5bd2706
Check nvidia_peermem during runtime ( #234 )
2023-12-25 12:02:10 +08:00
Changho Hwang
c15a166cf0
Add a documentation issue template ( #230 )
2023-12-05 01:01:45 +00:00
Changho Hwang
544ff0c21d
ROCm support ( #213 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-11-24 16:41:56 +08:00
Changho Hwang
dab19e00c1
Templatize Dockerfiles & update workflows ( #223 )
...
Now build images by a script with a shared Dockerfile template
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-11-22 13:29:12 -08:00
Changho Hwang
f68820436c
Explicit build dependency on nvidia_peermem ( #201 )
2023-10-23 04:29:30 +00:00
Changho Hwang
8c0f9e84d0
v0.3.0 ( #171 )
2023-10-11 22:35:54 +08:00
Changho Hwang
11ac824cc7
Align interfaces of put/get/putPackets/getPackets ( #185 )
2023-10-07 22:18:26 +08:00
Changho Hwang
497a9e0c82
Add backup workflows ( #189 )
2023-10-07 15:13:49 +08:00
Changho Hwang
bb64f68d74
Update issue templates ( #179 )
2023-09-15 04:05:09 +00:00
Saeed Maleki
e7d5e652df
Python bindings ( #125 )
...
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-19 15:35:54 +08:00
Binyang2014
56bdbc2f32
Enable test for both cuda11 and cuda12 ( #124 )
...
Update pipeline: enable test for both cuda11 and cuda12
2023-07-10 13:19:14 +08:00
Changho Hwang
4114d65c60
Documents & minor updates ( #119 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-07 17:35:05 +08:00
Changho Hwang
bb7b85a810
2-node AllReduce improvements ( #118 )
...
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-07-07 07:05:46 +00:00
Binyang2014
2640578b22
Add performance check for mscclpp-test ( #110 )
...
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Changho Hwang
5a4885ccbb
Misc updates ( #95 )
2023-06-12 13:53:43 +08:00
Changho Hwang
798631bd52
Update unit tests ( #81 )
2023-06-08 09:58:05 +00:00
Changho Hwang
7346e70109
Use MSCCL++ Docker image for CodeQL ( #94 )
2023-06-06 18:42:22 +08:00
Changho Hwang
0581bfb431
Fix CodeQL workflow ( #80 )
2023-05-22 14:03:30 +08:00
Changho Hwang
8d54bf3301
Update CI ( #79 )
2023-05-21 11:45:41 -07:00
Binyang Li
5704fb7c6a
update
2023-05-11 08:55:51 +00:00
Binyang Li
1487596dc8
update cpplint
2023-05-11 08:34:57 +00:00
Binyang Li
669c67b3de
enable github action on all ranches
2023-05-05 08:42:25 +00:00
Changho Hwang
72431957fd
Use clang-format-12
2023-03-27 14:00:03 +00:00
Binyang Li
7ec6ae9d6a
add cpplint and CI
2023-03-27 03:32:10 +00:00