Pedram Alizadeh
97eaca2bd2
[NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl ( #399 )
2025-01-03 20:38:57 +00:00
Qinghua Zhou
ba0d0d68b8
Enhance the nccl error message handling ( #434 )
...
Add WARN or INFO before returning the nccl error message.
Change NCCL_DEBUG to MSCCLPP_DEBUG in debug message.
2025-01-03 00:50:36 +00:00
Binyang Li
3d6bfed2cf
Update version number ( #433 )
...
Co-authored-by: github-actions <github-actions@github.com >
2025-01-02 16:45:08 -08:00
Changho Hwang
dff21905e3
Fix typos in the pipeline ( #420 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2025-01-02 09:49:50 -08:00
Binyang Li
3e7801b1a8
Fix CI trigger issue ( #428 )
2024-12-20 16:27:52 -08:00
Binyang Li
f18a440feb
trigger ci for release branches ( #426 )
2024-12-21 00:05:13 +00:00
Changho Hwang
e2230aab26
Tackle build warnings ( #422 )
...
* Comply with
[CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html )
* Tackle other warnings during build
2024-12-19 16:51:50 -08:00
Binyang Li
6fedb7c0e8
Fix nccl-test failure issue ( #421 )
2024-12-19 12:07:00 -08:00
Binyang Li
776f24e787
update READMED ( #414 )
2024-12-19 05:54:27 +00:00
SreevatsaAnantharamu
0c7ed2c674
Add ncclBcast / ncclBroadcast support ( #419 )
...
A simple broadcast using scratch buffer and option to use an executor.
2024-12-19 01:16:30 +00:00
David Sidler
d8d0dfbffa
Fix synchronization in allreduce8 kernel ( #407 )
...
Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a
synchronization bug. The PR addresses it by ensuring that signals are
only issued after all threads in the block have issued their writes to
guarantee correct ordering between data writes and signal writes.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-12-18 17:10:22 -08:00
Caio Rocha
774602d49c
Supporting Executor multi node in NCCL API ( #412 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-12-18 15:50:58 -08:00
Binyang Li
fcb2e46cb1
NVLS support for NCCL API ( #410 )
...
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-12-18 09:55:35 +00:00
Binyang Li
863a599360
Disable CuMemMap check for ROCm ( #411 )
...
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-12-17 08:36:25 +00:00
Binyang Li
c65f19ad1a
Move pipeline to official org ( #406 )
...
Move pipeline to official org. Unify all pipelines
2024-12-16 09:43:00 -08:00
Binyang Li
ee75caf365
Reduce memory usage for scratch buffer ( #403 )
...
In the executor, we allocate the scratch buffer based on `sendMemRange`.
However, for certain execution plans, this allocation may be unsuitable,
as the plan does not support messages of this size.
To avoid allocating to much data and cause OOM error, set scratch buffer
size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan),
scratchBufferSize(sendMemRange))`
2024-12-13 13:00:04 -08:00
Caio Rocha
01fd813f1b
Exception Max Number Operation per Tb ( #405 )
2024-12-11 16:06:15 -08:00
Binyang Li
7a3dcb0627
Setup pipeline for mscclpp over nccl ( #401 )
...
Setup pipeline for mscclpp over nccl
Run `all_reduce_perf` via nccl API
2024-12-07 08:57:45 -08:00
Changho Hwang
756f24c697
Revised ProxyChannel interfaces ( #400 )
...
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Ziyue Yang
f6305a3c1d
Add connection events for NPKit ( #386 )
2024-12-05 00:06:37 +08:00
Binyang Li
88d28e07a7
Select algo according to json config ( #396 )
...
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Caio Rocha
ff18bb8d0b
Providing reduce-scatter test support ( #390 )
2024-11-28 09:19:30 -08:00
Caio Rocha
d9c297ba14
AllGather Executor Support in NCCL Interface ( #393 )
...
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-27 17:05:51 -08:00
Binyang Li
593478e1b7
Add cross threadblock barrier ( #383 )
2024-11-26 05:13:30 +00:00
Binyang Li
1b8d020650
Fix mscclpp_benchmark ( #392 )
...
Enable 1GB message size for NVLS transport in mscclpp_benchmark
2024-11-25 19:59:51 +00:00
Caio Rocha
93628d2066
Fixing Message Boundary AllReduce Fallback Code ( #391 )
2024-11-23 12:15:56 -08:00
Changho Hwang
2127a3ba29
Improve CMake options ( #376 )
...
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-22 01:54:11 +00:00
Binyang Li
db8e187407
Fix typo ( #389 )
2024-11-21 22:45:50 +00:00
Binyang Li
28a57b0610
NVLS support for msccl++ executor ( #375 )
...
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda
Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Ziyue Yang
3e51e9b359
Fix missing packet parameter for executor ( #385 )
2024-11-19 08:36:37 +08:00
Caio Rocha
b3dc74c020
Small Adjust in Test Data AllGather at Executor Test ( #384 )
2024-11-16 15:21:00 +08:00
Binyang Li
1baea89fa0
Fix light load bug ( #379 )
...
Fix lightLoadExecutionPlan issue.
An execution context many have multi device execution plans. These plans
share the channel connections which are constructed before.
A deviceExecutionPlanKey is introduced to identify these plans. We can
get the current device execution plan key via:
`contexts.currentDevicePlan`
2024-11-13 07:58:43 +00:00
Caio Rocha
d5d608abdc
Fixing Bug Const Offset in Execution Plan ( #380 )
...
The offset was not differentiating between the buffer types, causing the
offset to be incorrect when the buffer type was not `SCRATCH`.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-11-11 20:02:02 -08:00
Changho Hwang
85fdde7a73
Lazily create the context stream ( #381 )
...
Create the context stream only when needed.
2024-11-11 10:39:32 +08:00
Ziyue Yang
9526d76fc7
Add kernel-based verification for executor_test ( #378 )
...
Add kernels to fill and test data for correctness test in
executor_test.py.
2024-11-07 14:14:20 +08:00
Jeff Rasley
449c274326
[docs] fix quickstart link ( #374 )
...
Small fix to update quickstart link
2024-10-30 13:13:33 +08:00
Ziyue Yang
95ab1088ef
Fix in-place all-gather input buffer in executor_test ( #372 )
2024-10-24 23:04:11 +08:00
Binyang Li
b72decbfeb
Update docker image for cuda12.4 ( #370 )
...
Update docker image for cuda12.4
Image pushed to registry
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-10-22 12:51:28 +08:00
Binyang Li
582d386b3b
Fix algo repo name ( #369 )
...
Change algo repo name from azure-mscclpp to msccl-users
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-10-22 10:59:15 +08:00
Caio Rocha
c6e06cfad7
Executor AllGather In-Place Support ( #365 )
2024-10-21 05:45:56 -07:00
Binyang Li
4136153a76
[Doc] mscclpp docs ( #348 )
...
Generate docs for mescclpp.
Setup github action to auto-deploy github-page
doc link here: https://microsoft.github.io/mscclpp
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-10-18 06:08:31 +00:00
Changho Hwang
0c150e5166
Fix copyright messages ( #367 )
2024-10-17 21:25:46 -07:00
Changho Hwang
f8c0bcca2b
Perf optimization & support clipping ( #364 )
...
Co-authored-by: Nusrat Islam <Nusrat.Islam@amd.com >
2024-10-16 14:35:08 -07:00
Changho Hwang
e9294357c5
Fix NCCL API bugs ( #363 )
2024-10-16 14:16:34 -07:00
Caio Rocha
08a0cec2eb
Fixing RegisterMemory Allocation for ProxyChannels ( #353 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-09-24 23:01:41 -07:00
Changho Hwang
8a330f9135
Update ROCm CI ( #357 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-09-20 17:57:02 +00:00
Changho Hwang
74130c7c5e
Use IB transport flags only when an IB device exists ( #355 )
2024-09-19 07:13:11 +00:00
Ziyue Yang
5c4e105814
Fix NPKit exit event offset ( #356 )
2024-09-19 13:35:44 +08:00
Binyang Li
b30bb260e3
Tune threads per block for mscclpp executor ( #345 )
2024-09-18 17:21:47 -07:00
Binyang Li
0c7311e83f
Add CI for rocm ( #346 )
2024-09-15 22:30:54 +00:00