Pedram Alizadeh
97eaca2bd2
[NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl ( #399 )
2025-01-03 20:38:57 +00:00
Qinghua Zhou
ba0d0d68b8
Enhance the nccl error message handling ( #434 )
...
Add WARN or INFO before returning the nccl error message.
Change NCCL_DEBUG to MSCCLPP_DEBUG in debug message.
2025-01-03 00:50:36 +00:00
Changho Hwang
e2230aab26
Tackle build warnings ( #422 )
...
* Comply with
[CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html )
* Tackle other warnings during build
2024-12-19 16:51:50 -08:00
SreevatsaAnantharamu
0c7ed2c674
Add ncclBcast / ncclBroadcast support ( #419 )
...
A simple broadcast using scratch buffer and option to use an executor.
2024-12-19 01:16:30 +00:00
David Sidler
d8d0dfbffa
Fix synchronization in allreduce8 kernel ( #407 )
...
Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a
synchronization bug. The PR addresses it by ensuring that signals are
only issued after all threads in the block have issued their writes to
guarantee correct ordering between data writes and signal writes.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-12-18 17:10:22 -08:00
Caio Rocha
774602d49c
Supporting Executor multi node in NCCL API ( #412 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-12-18 15:50:58 -08:00
Binyang Li
fcb2e46cb1
NVLS support for NCCL API ( #410 )
...
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-12-18 09:55:35 +00:00
Binyang Li
88d28e07a7
Select algo according to json config ( #396 )
...
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Caio Rocha
d9c297ba14
AllGather Executor Support in NCCL Interface ( #393 )
...
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-27 17:05:51 -08:00
Caio Rocha
93628d2066
Fixing Message Boundary AllReduce Fallback Code ( #391 )
2024-11-23 12:15:56 -08:00
Changho Hwang
2127a3ba29
Improve CMake options ( #376 )
...
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-22 01:54:11 +00:00
Binyang Li
28a57b0610
NVLS support for msccl++ executor ( #375 )
...
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda
Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Binyang Li
4136153a76
[Doc] mscclpp docs ( #348 )
...
Generate docs for mescclpp.
Setup github action to auto-deploy github-page
doc link here: https://microsoft.github.io/mscclpp
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-10-18 06:08:31 +00:00
Changho Hwang
f8c0bcca2b
Perf optimization & support clipping ( #364 )
...
Co-authored-by: Nusrat Islam <Nusrat.Islam@amd.com >
2024-10-16 14:35:08 -07:00
Changho Hwang
e9294357c5
Fix NCCL API bugs ( #363 )
2024-10-16 14:16:34 -07:00
Binyang Li
b30bb260e3
Tune threads per block for mscclpp executor ( #345 )
2024-09-18 17:21:47 -07:00
Changho Hwang
1e82dd444f
Make ibverbs optional at compile time ( #340 )
...
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-08-21 12:47:05 -07:00
Changho Hwang
8c6fb429e9
bfloat16 support ( #336 )
...
* Add bfloat16 support for executor and NCCL interface
* Changed `gpu_data_types.hpp` into an internal header file
2024-08-12 15:41:58 -07:00
caiomcbr
67eb9b04cc
NCCL API Executor Integration ( #331 )
...
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-25 15:05:02 -07:00
caiomcbr
7493e2f075
Double buffering for NCCL APIs ( #324 )
...
Using two scratch buffers in each peer to exchange data.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-15 22:18:53 +00:00
Changho Hwang
c4ca2fbc8c
Resolve clang++ warnings ( #325 )
2024-07-11 07:48:35 +00:00
caiomcbr
f4c3c8f916
AllReduce Kernel for Small Messages ( #322 )
...
Adding allreduce kernel code for message sizes smaller than 32 bytes,
when the number of elements are smaller than the number of ranks.
---------
Co-authored-by: Caio Rocha <caio.rocha@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-05 21:08:43 +00:00
caiomcbr
b1b9d0626c
Support NCCL APIs ( #319 )
...
Start supporting NCCL APIs with a few limitations.
---------
Co-authored-by: Caio Rocha <caio.rocha@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-06-27 23:54:06 +00:00