SreevatsaAnantharamu
0c7ed2c674
Add ncclBcast / ncclBroadcast support ( #419 )
...
A simple broadcast using scratch buffer and option to use an executor.
2024-12-19 01:16:30 +00:00
David Sidler
d8d0dfbffa
Fix synchronization in allreduce8 kernel ( #407 )
...
Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a
synchronization bug. The PR addresses it by ensuring that signals are
only issued after all threads in the block have issued their writes to
guarantee correct ordering between data writes and signal writes.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-12-18 17:10:22 -08:00
Caio Rocha
774602d49c
Supporting Executor multi node in NCCL API ( #412 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-12-18 15:50:58 -08:00
Binyang Li
fcb2e46cb1
NVLS support for NCCL API ( #410 )
...
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-12-18 09:55:35 +00:00
Binyang Li
863a599360
Disable CuMemMap check for ROCm ( #411 )
...
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-12-17 08:36:25 +00:00
Binyang Li
c65f19ad1a
Move pipeline to official org ( #406 )
...
Move pipeline to official org. Unify all pipelines
2024-12-16 09:43:00 -08:00
Binyang Li
ee75caf365
Reduce memory usage for scratch buffer ( #403 )
...
In the executor, we allocate the scratch buffer based on `sendMemRange`.
However, for certain execution plans, this allocation may be unsuitable,
as the plan does not support messages of this size.
To avoid allocating to much data and cause OOM error, set scratch buffer
size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan),
scratchBufferSize(sendMemRange))`
2024-12-13 13:00:04 -08:00
Caio Rocha
01fd813f1b
Exception Max Number Operation per Tb ( #405 )
2024-12-11 16:06:15 -08:00
Binyang Li
7a3dcb0627
Setup pipeline for mscclpp over nccl ( #401 )
...
Setup pipeline for mscclpp over nccl
Run `all_reduce_perf` via nccl API
2024-12-07 08:57:45 -08:00
Changho Hwang
756f24c697
Revised ProxyChannel interfaces ( #400 )
...
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Ziyue Yang
f6305a3c1d
Add connection events for NPKit ( #386 )
2024-12-05 00:06:37 +08:00
Binyang Li
88d28e07a7
Select algo according to json config ( #396 )
...
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Caio Rocha
ff18bb8d0b
Providing reduce-scatter test support ( #390 )
2024-11-28 09:19:30 -08:00
Caio Rocha
d9c297ba14
AllGather Executor Support in NCCL Interface ( #393 )
...
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-27 17:05:51 -08:00
Binyang Li
593478e1b7
Add cross threadblock barrier ( #383 )
2024-11-26 05:13:30 +00:00
Binyang Li
1b8d020650
Fix mscclpp_benchmark ( #392 )
...
Enable 1GB message size for NVLS transport in mscclpp_benchmark
2024-11-25 19:59:51 +00:00
Caio Rocha
93628d2066
Fixing Message Boundary AllReduce Fallback Code ( #391 )
2024-11-23 12:15:56 -08:00
Changho Hwang
2127a3ba29
Improve CMake options ( #376 )
...
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-22 01:54:11 +00:00
Binyang Li
db8e187407
Fix typo ( #389 )
2024-11-21 22:45:50 +00:00
Binyang Li
28a57b0610
NVLS support for msccl++ executor ( #375 )
...
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda
Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Ziyue Yang
3e51e9b359
Fix missing packet parameter for executor ( #385 )
2024-11-19 08:36:37 +08:00
Caio Rocha
b3dc74c020
Small Adjust in Test Data AllGather at Executor Test ( #384 )
2024-11-16 15:21:00 +08:00
Binyang Li
1baea89fa0
Fix light load bug ( #379 )
...
Fix lightLoadExecutionPlan issue.
An execution context many have multi device execution plans. These plans
share the channel connections which are constructed before.
A deviceExecutionPlanKey is introduced to identify these plans. We can
get the current device execution plan key via:
`contexts.currentDevicePlan`
2024-11-13 07:58:43 +00:00
Caio Rocha
d5d608abdc
Fixing Bug Const Offset in Execution Plan ( #380 )
...
The offset was not differentiating between the buffer types, causing the
offset to be incorrect when the buffer type was not `SCRATCH`.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-11-11 20:02:02 -08:00
Changho Hwang
85fdde7a73
Lazily create the context stream ( #381 )
...
Create the context stream only when needed.
2024-11-11 10:39:32 +08:00
Ziyue Yang
9526d76fc7
Add kernel-based verification for executor_test ( #378 )
...
Add kernels to fill and test data for correctness test in
executor_test.py.
2024-11-07 14:14:20 +08:00
Jeff Rasley
449c274326
[docs] fix quickstart link ( #374 )
...
Small fix to update quickstart link
2024-10-30 13:13:33 +08:00
Ziyue Yang
95ab1088ef
Fix in-place all-gather input buffer in executor_test ( #372 )
2024-10-24 23:04:11 +08:00
Binyang Li
b72decbfeb
Update docker image for cuda12.4 ( #370 )
...
Update docker image for cuda12.4
Image pushed to registry
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-10-22 12:51:28 +08:00
Binyang Li
582d386b3b
Fix algo repo name ( #369 )
...
Change algo repo name from azure-mscclpp to msccl-users
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-10-22 10:59:15 +08:00
Caio Rocha
c6e06cfad7
Executor AllGather In-Place Support ( #365 )
2024-10-21 05:45:56 -07:00
Binyang Li
4136153a76
[Doc] mscclpp docs ( #348 )
...
Generate docs for mescclpp.
Setup github action to auto-deploy github-page
doc link here: https://microsoft.github.io/mscclpp
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-10-18 06:08:31 +00:00
Changho Hwang
0c150e5166
Fix copyright messages ( #367 )
2024-10-17 21:25:46 -07:00
Changho Hwang
f8c0bcca2b
Perf optimization & support clipping ( #364 )
...
Co-authored-by: Nusrat Islam <Nusrat.Islam@amd.com >
2024-10-16 14:35:08 -07:00
Changho Hwang
e9294357c5
Fix NCCL API bugs ( #363 )
2024-10-16 14:16:34 -07:00
Caio Rocha
08a0cec2eb
Fixing RegisterMemory Allocation for ProxyChannels ( #353 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-09-24 23:01:41 -07:00
Changho Hwang
8a330f9135
Update ROCm CI ( #357 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-09-20 17:57:02 +00:00
Changho Hwang
74130c7c5e
Use IB transport flags only when an IB device exists ( #355 )
2024-09-19 07:13:11 +00:00
Ziyue Yang
5c4e105814
Fix NPKit exit event offset ( #356 )
2024-09-19 13:35:44 +08:00
Binyang Li
b30bb260e3
Tune threads per block for mscclpp executor ( #345 )
2024-09-18 17:21:47 -07:00
Binyang Li
0c7311e83f
Add CI for rocm ( #346 )
2024-09-15 22:30:54 +00:00
Binyang Li
7bedb25054
Add proxy channel related operations ( #351 )
...
Add Flush, PutWithSignal, PutWithFlushAndSignal operation
2024-09-15 13:24:57 -07:00
Binyang Li
26a87535f9
Fix bug for construct sempaphore ( #341 )
...
Current semaphore construction requires two-way communication, e.g., to
construct a semaphore signaling from rank 0 to rank 1, both rank 0 and
rank 1 need to send a message to each other. This PR fixes an executor
bug that fails to conduct two-way communication for constructing such
one-way semaphores, and instead hangs during the semaphore construction.
In the future, we may need to change the implementation to construct
semaphore via one-way communication.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-09-04 19:42:03 +08:00
Changho Hwang
72b99a4229
Fix for ROCm 6.0 ( #347 )
2024-09-01 20:22:33 -07:00
Caio Rocha
4eca6f1e95
Support executors to send packets over ProxyChannel ( #344 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-08-30 22:10:33 +00:00
Caio Rocha
1af62ea43d
ProxyChannel Support in Executor ( #342 )
...
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-08-27 10:09:44 -07:00
Changho Hwang
1e82dd444f
Make ibverbs optional at compile time ( #340 )
...
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-08-21 12:47:05 -07:00
Roshan Dathathri
7ed13ec4b5
Auto-tune vector sizes for NVLS allreduce6 ( #338 )
...
Also fixes bugs in MscclppAllReduce6
Below is the performance when the algorithm is fixed to
MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2.
Float16:
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp16) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 2.0 KiB | 11.15 | 0.18 | PASS | 13.82 | 0.15 | PASS | 1.24 |
| 4.0 KiB | 11.15 | 0.37 | PASS | 14.74 | 0.28 | PASS | 1.32 |
| 8.0 KiB | 11.14 | 0.74 | PASS | 15.17 | 0.54 | PASS | 1.36 |
| 16.0 KiB | 11.16 | 1.47 | PASS | 15.77 | 1.04 | PASS | 1.41 |
| 32.0 KiB | 11.15 | 2.94 | PASS | 17.50 | 1.87 | PASS | 1.57 |
| 64.0 KiB | 11.18 | 5.86 | PASS | 17.64 | 3.71 | PASS | 1.58 |
| 128.0 KiB | 11.16 | 11.74 | PASS | 17.83 | 7.35 | PASS | 1.60 |
| 256.0 KiB | 11.21 | 23.38 | PASS | 18.00 | 14.57 | PASS | 1.60 |
| 512.0 KiB | 11.70 | 44.81 | PASS | 18.42 | 28.46 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.87 | PASS | 20.23 | 51.83 | PASS | 1.48 |
| 2.0 MiB | 17.29 | 121.27 | PASS | 31.60 | 66.36 | PASS | 1.83 |
| 4.0 MiB | 25.26 | 166.02 | PASS | 38.74 | 108.26 | PASS | 1.53 |
| 8.0 MiB | 40.17 | 208.83 | PASS | 62.86 | 133.45 | PASS | 1.56 |
| 16.0 MiB | 70.92 | 236.56 | PASS | 113.36 | 147.99 | PASS | 1.60 |
| 32.0 MiB | 131.38 | 255.41 | PASS | 203.21 | 165.13 | PASS | 1.55 |
| 64.0 MiB | 253.39 | 264.84 | PASS | 342.12 | 196.15 | PASS | 1.35 |
| 128.0 MiB | 496.74 | 270.20 | PASS | 670.62 | 200.14 | PASS | 1.35 |
| 256.0 MiB | 982.42 | 273.24 | PASS | 1318.36 | 203.61 | PASS | 1.34 |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
Float32:
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp32) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 4.0 KiB | 11.04 | 0.37 | PASS | 14.79 | 0.28 | PASS | 1.34 |
| 8.0 KiB | 11.15 | 0.73 | PASS | 15.25 | 0.54 | PASS | 1.37 |
| 16.0 KiB | 11.12 | 1.47 | PASS | 15.87 | 1.03 | PASS | 1.43 |
| 32.0 KiB | 11.13 | 2.95 | PASS | 17.21 | 1.90 | PASS | 1.55 |
| 64.0 KiB | 11.11 | 5.90 | PASS | 17.37 | 3.77 | PASS | 1.56 |
| 128.0 KiB | 11.08 | 11.83 | PASS | 17.54 | 7.47 | PASS | 1.58 |
| 256.0 KiB | 11.15 | 23.50 | PASS | 17.71 | 14.80 | PASS | 1.59 |
| 512.0 KiB | 11.56 | 45.34 | PASS | 18.21 | 28.79 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.90 | PASS | 19.87 | 52.77 | PASS | 1.46 |
| 2.0 MiB | 17.24 | 121.67 | PASS | 31.63 | 66.30 | PASS | 1.84 |
| 4.0 MiB | 25.19 | 166.47 | PASS | 38.63 | 108.57 | PASS | 1.53 |
| 8.0 MiB | 40.38 | 207.72 | PASS | 62.65 | 133.89 | PASS | 1.55 |
| 16.0 MiB | 70.72 | 237.23 | PASS | 114.57 | 146.44 | PASS | 1.62 |
| 32.0 MiB | 131.49 | 255.18 | PASS | 200.79 | 167.11 | PASS | 1.53 |
| 64.0 MiB | 253.98 | 264.23 | PASS | 342.58 | 195.89 | PASS | 1.35 |
| 128.0 MiB | 496.96 | 270.08 | PASS | 670.64 | 200.13 | PASS | 1.35 |
| 256.0 MiB | 982.83 | 273.12 | PASS | 1318.90 | 203.53 | PASS | 1.34 |
| 512.0 MiB | 1954.07 | 274.75 | PASS | 2609.04 | 205.77 | PASS | 1.34 |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
2024-08-16 11:11:54 +08:00
Caio Rocha
ead4efc315
Dynamically load libibverbs ( #337 )
2024-08-13 23:48:39 -07:00
Changho Hwang
8c6fb429e9
bfloat16 support ( #336 )
...
* Add bfloat16 support for executor and NCCL interface
* Changed `gpu_data_types.hpp` into an internal header file
2024-08-12 15:41:58 -07:00