Ziyue Yang
9526d76fc7
Add kernel-based verification for executor_test ( #378 )
...
Add kernels to fill and test data for correctness test in
executor_test.py.
2024-11-07 14:14:20 +08:00
Jeff Rasley
449c274326
[docs] fix quickstart link ( #374 )
...
Small fix to update quickstart link
2024-10-30 13:13:33 +08:00
Ziyue Yang
95ab1088ef
Fix in-place all-gather input buffer in executor_test ( #372 )
2024-10-24 23:04:11 +08:00
Binyang Li
b72decbfeb
Update docker image for cuda12.4 ( #370 )
...
Update docker image for cuda12.4
Image pushed to registry
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-10-22 12:51:28 +08:00
Binyang Li
582d386b3b
Fix algo repo name ( #369 )
...
Change algo repo name from azure-mscclpp to msccl-users
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-10-22 10:59:15 +08:00
Caio Rocha
c6e06cfad7
Executor AllGather In-Place Support ( #365 )
2024-10-21 05:45:56 -07:00
Binyang Li
4136153a76
[Doc] mscclpp docs ( #348 )
...
Generate docs for mescclpp.
Setup github action to auto-deploy github-page
doc link here: https://microsoft.github.io/mscclpp
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-10-18 06:08:31 +00:00
Changho Hwang
0c150e5166
Fix copyright messages ( #367 )
2024-10-17 21:25:46 -07:00
Changho Hwang
f8c0bcca2b
Perf optimization & support clipping ( #364 )
...
Co-authored-by: Nusrat Islam <Nusrat.Islam@amd.com >
2024-10-16 14:35:08 -07:00
Changho Hwang
e9294357c5
Fix NCCL API bugs ( #363 )
2024-10-16 14:16:34 -07:00
Caio Rocha
08a0cec2eb
Fixing RegisterMemory Allocation for ProxyChannels ( #353 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-09-24 23:01:41 -07:00
Changho Hwang
8a330f9135
Update ROCm CI ( #357 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-09-20 17:57:02 +00:00
Changho Hwang
74130c7c5e
Use IB transport flags only when an IB device exists ( #355 )
2024-09-19 07:13:11 +00:00
Ziyue Yang
5c4e105814
Fix NPKit exit event offset ( #356 )
2024-09-19 13:35:44 +08:00
Binyang Li
b30bb260e3
Tune threads per block for mscclpp executor ( #345 )
2024-09-18 17:21:47 -07:00
Binyang Li
0c7311e83f
Add CI for rocm ( #346 )
2024-09-15 22:30:54 +00:00
Binyang Li
7bedb25054
Add proxy channel related operations ( #351 )
...
Add Flush, PutWithSignal, PutWithFlushAndSignal operation
2024-09-15 13:24:57 -07:00
Binyang Li
26a87535f9
Fix bug for construct sempaphore ( #341 )
...
Current semaphore construction requires two-way communication, e.g., to
construct a semaphore signaling from rank 0 to rank 1, both rank 0 and
rank 1 need to send a message to each other. This PR fixes an executor
bug that fails to conduct two-way communication for constructing such
one-way semaphores, and instead hangs during the semaphore construction.
In the future, we may need to change the implementation to construct
semaphore via one-way communication.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-09-04 19:42:03 +08:00
Changho Hwang
72b99a4229
Fix for ROCm 6.0 ( #347 )
2024-09-01 20:22:33 -07:00
Caio Rocha
4eca6f1e95
Support executors to send packets over ProxyChannel ( #344 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-08-30 22:10:33 +00:00
Caio Rocha
1af62ea43d
ProxyChannel Support in Executor ( #342 )
...
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-08-27 10:09:44 -07:00
Changho Hwang
1e82dd444f
Make ibverbs optional at compile time ( #340 )
...
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-08-21 12:47:05 -07:00
Roshan Dathathri
7ed13ec4b5
Auto-tune vector sizes for NVLS allreduce6 ( #338 )
...
Also fixes bugs in MscclppAllReduce6
Below is the performance when the algorithm is fixed to
MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2.
Float16:
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp16) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 2.0 KiB | 11.15 | 0.18 | PASS | 13.82 | 0.15 | PASS | 1.24 |
| 4.0 KiB | 11.15 | 0.37 | PASS | 14.74 | 0.28 | PASS | 1.32 |
| 8.0 KiB | 11.14 | 0.74 | PASS | 15.17 | 0.54 | PASS | 1.36 |
| 16.0 KiB | 11.16 | 1.47 | PASS | 15.77 | 1.04 | PASS | 1.41 |
| 32.0 KiB | 11.15 | 2.94 | PASS | 17.50 | 1.87 | PASS | 1.57 |
| 64.0 KiB | 11.18 | 5.86 | PASS | 17.64 | 3.71 | PASS | 1.58 |
| 128.0 KiB | 11.16 | 11.74 | PASS | 17.83 | 7.35 | PASS | 1.60 |
| 256.0 KiB | 11.21 | 23.38 | PASS | 18.00 | 14.57 | PASS | 1.60 |
| 512.0 KiB | 11.70 | 44.81 | PASS | 18.42 | 28.46 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.87 | PASS | 20.23 | 51.83 | PASS | 1.48 |
| 2.0 MiB | 17.29 | 121.27 | PASS | 31.60 | 66.36 | PASS | 1.83 |
| 4.0 MiB | 25.26 | 166.02 | PASS | 38.74 | 108.26 | PASS | 1.53 |
| 8.0 MiB | 40.17 | 208.83 | PASS | 62.86 | 133.45 | PASS | 1.56 |
| 16.0 MiB | 70.92 | 236.56 | PASS | 113.36 | 147.99 | PASS | 1.60 |
| 32.0 MiB | 131.38 | 255.41 | PASS | 203.21 | 165.13 | PASS | 1.55 |
| 64.0 MiB | 253.39 | 264.84 | PASS | 342.12 | 196.15 | PASS | 1.35 |
| 128.0 MiB | 496.74 | 270.20 | PASS | 670.62 | 200.14 | PASS | 1.35 |
| 256.0 MiB | 982.42 | 273.24 | PASS | 1318.36 | 203.61 | PASS | 1.34 |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
Float32:
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp32) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 4.0 KiB | 11.04 | 0.37 | PASS | 14.79 | 0.28 | PASS | 1.34 |
| 8.0 KiB | 11.15 | 0.73 | PASS | 15.25 | 0.54 | PASS | 1.37 |
| 16.0 KiB | 11.12 | 1.47 | PASS | 15.87 | 1.03 | PASS | 1.43 |
| 32.0 KiB | 11.13 | 2.95 | PASS | 17.21 | 1.90 | PASS | 1.55 |
| 64.0 KiB | 11.11 | 5.90 | PASS | 17.37 | 3.77 | PASS | 1.56 |
| 128.0 KiB | 11.08 | 11.83 | PASS | 17.54 | 7.47 | PASS | 1.58 |
| 256.0 KiB | 11.15 | 23.50 | PASS | 17.71 | 14.80 | PASS | 1.59 |
| 512.0 KiB | 11.56 | 45.34 | PASS | 18.21 | 28.79 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.90 | PASS | 19.87 | 52.77 | PASS | 1.46 |
| 2.0 MiB | 17.24 | 121.67 | PASS | 31.63 | 66.30 | PASS | 1.84 |
| 4.0 MiB | 25.19 | 166.47 | PASS | 38.63 | 108.57 | PASS | 1.53 |
| 8.0 MiB | 40.38 | 207.72 | PASS | 62.65 | 133.89 | PASS | 1.55 |
| 16.0 MiB | 70.72 | 237.23 | PASS | 114.57 | 146.44 | PASS | 1.62 |
| 32.0 MiB | 131.49 | 255.18 | PASS | 200.79 | 167.11 | PASS | 1.53 |
| 64.0 MiB | 253.98 | 264.23 | PASS | 342.58 | 195.89 | PASS | 1.35 |
| 128.0 MiB | 496.96 | 270.08 | PASS | 670.64 | 200.13 | PASS | 1.35 |
| 256.0 MiB | 982.83 | 273.12 | PASS | 1318.90 | 203.53 | PASS | 1.34 |
| 512.0 MiB | 1954.07 | 274.75 | PASS | 2609.04 | 205.77 | PASS | 1.34 |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
2024-08-16 11:11:54 +08:00
Caio Rocha
ead4efc315
Dynamically load libibverbs ( #337 )
2024-08-13 23:48:39 -07:00
Changho Hwang
8c6fb429e9
bfloat16 support ( #336 )
...
* Add bfloat16 support for executor and NCCL interface
* Changed `gpu_data_types.hpp` into an internal header file
2024-08-12 15:41:58 -07:00
Ziyue Yang
faadc75649
Fix missing import in executor test ( #334 )
2024-08-06 14:24:50 -07:00
caiomcbr
67eb9b04cc
NCCL API Executor Integration ( #331 )
...
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-25 15:05:02 -07:00
Roshan Dathathri
f131fae3ec
Add support for different vector sizes in multimem instructions ( #332 )
2024-07-25 10:14:02 -07:00
Changho Hwang
40cb196553
v0.5.2 ( #328 )
v0.5.2
2024-07-16 00:35:18 +00:00
caiomcbr
7493e2f075
Double buffering for NCCL APIs ( #324 )
...
Using two scratch buffers in each peer to exchange data.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-15 22:18:53 +00:00
Binyang Li
5f9ee27aa8
Support to write packets via uint2 ( #327 )
2024-07-15 12:05:13 -07:00
Changho Hwang
c4ca2fbc8c
Resolve clang++ warnings ( #325 )
2024-07-11 07:48:35 +00:00
caiomcbr
f4c3c8f916
AllReduce Kernel for Small Messages ( #322 )
...
Adding allreduce kernel code for message sizes smaller than 32 bytes,
when the number of elements are smaller than the number of ranks.
---------
Co-authored-by: Caio Rocha <caio.rocha@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-05 21:08:43 +00:00
Ziyue Yang
b5a48f836c
Separate NPKit CPU timestamp access from different blocks for AMD platform ( #321 )
...
Reference: https://github.com/ROCm/rccl/pull/1229
2024-07-02 19:36:48 +08:00
Angelica Moreira
0f796bbdf7
Update allreduce_bench.py ( #318 )
...
Replacing hardcoded network interface name for generic discovery
strategy.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-06-29 03:41:13 +00:00
caiomcbr
b1b9d0626c
Support NCCL APIs ( #319 )
...
Start supporting NCCL APIs with a few limitations.
---------
Co-authored-by: Caio Rocha <caio.rocha@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-06-27 23:54:06 +00:00
Roshan Dathathri
91550dab4c
Simplify/improve barrier in AllReduce6 ( #317 )
...
Drop superfluous __threadfence_system()
2024-06-23 21:08:59 +00:00
Angelica Moreira
34f4d9d006
Update quickstart.md ( #314 )
...
Updating the docker image name tag and the python benchmark path.
2024-06-19 22:26:13 +00:00
Roshan Dathathri
93ed8e1e58
Add support for multicast reduce insruction ( #316 )
2024-06-19 13:28:12 -07:00
Binyang Li
1351f9f1c5
Add "packet type" option for executor test ( #313 )
...
Add "packet type" option for executor test
2024-06-14 09:53:58 +00:00
Ziyue Yang
f29095b3b1
Fix NPKit support for AMD ( #312 )
2024-06-14 16:22:14 +08:00
Ziyue Yang
76328fe623
Add NPKit GPU event support ( #310 )
2024-06-13 13:59:50 +08:00
Binyang Li
80aefe55bc
Cumulative Updates ( #309 )
...
Bug fix: Unable to execute communication primitives with the same
execution plan but varying message sizes.
Add reduce_packets OP
2024-06-12 19:17:57 +08:00
Changho Hwang
1f62dfd7cd
Add C++ executor test ( #304 )
...
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-05-29 10:54:36 +00:00
Changho Hwang
cddffbc8b6
v0.5.1 ( #308 )
v0.5.1
2024-05-26 14:31:29 -07:00
Binyang Li
3a18068cd4
Fix security issue ( #305 )
...
Change sprintf to snprintf to avoid potential security issue
2024-05-25 23:12:57 -07:00
Changho Hwang
f76eae4dca
Fix assert declaration & add a compile test ( #303 )
2024-05-20 02:39:30 +00:00
Changho Hwang
d35a2f2dc2
Rename executor.cpp to executor_py.cpp ( #301 )
2024-05-17 13:31:27 -07:00
Changho Hwang
a3cd95bd42
Upgrade gtest ( #300 )
...
The new gtest version resolves a type casting issue:
3044657e7a
2024-05-07 20:49:26 -07:00
Changho Hwang
9c2a96060a
v0.5.0 ( #298 )
v0.5.0
2024-05-04 16:51:48 -07:00