Commit Graph

703 Commits

Author SHA1 Message Date
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Binyang Li
8ac50dc85d Resolve cuMemMap error (#451)
* Updated `RegisteredMemory::Impl::Impl(const std::vector<char>&
serialization)` to use both minimum and recommended granularities for
memory address reservation and mapping. This will resolve the cuMemMap
error
2025-01-10 14:22:14 -08:00
Changho Hwang
2b54af7e27 Auto-update version numbers in CMakeLists.txt (#450) 2025-01-09 17:54:10 -08:00
Changho Hwang
f2b52c6318 Fix Python binding of exceptions (#444)
* Fixed errors to be catchable from Python code
* Skip IB tests in Python unit tests when IB ports are down
2025-01-09 11:58:23 -08:00
Caio Rocha
80abce59ef Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-01-08 09:02:36 -08:00
Changho Hwang
1989d4be9c Fix CMake build messages (#443) 2025-01-08 02:44:01 +00:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Binyang Li
6d26b92665 Fix azure pipeline (#437) 2025-01-04 19:41:10 -08:00
Pedram Alizadeh
97eaca2bd2 [NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl (#399) 2025-01-03 20:38:57 +00:00
Qinghua Zhou
ba0d0d68b8 Enhance the nccl error message handling (#434)
Add WARN or INFO before returning the nccl error message.
Change NCCL_DEBUG to MSCCLPP_DEBUG in debug message.
2025-01-03 00:50:36 +00:00
Binyang Li
3d6bfed2cf Update version number (#433)
Co-authored-by: github-actions <github-actions@github.com>
2025-01-02 16:45:08 -08:00
Changho Hwang
dff21905e3 Fix typos in the pipeline (#420)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-01-02 09:49:50 -08:00
Binyang Li
3e7801b1a8 Fix CI trigger issue (#428) 2024-12-20 16:27:52 -08:00
Binyang Li
f18a440feb trigger ci for release branches (#426) 2024-12-21 00:05:13 +00:00
Changho Hwang
e2230aab26 Tackle build warnings (#422)
* Comply with
[CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html)
* Tackle other warnings during build
2024-12-19 16:51:50 -08:00
Binyang Li
6fedb7c0e8 Fix nccl-test failure issue (#421) 2024-12-19 12:07:00 -08:00
Binyang Li
776f24e787 update READMED (#414) 2024-12-19 05:54:27 +00:00
SreevatsaAnantharamu
0c7ed2c674 Add ncclBcast / ncclBroadcast support (#419)
A simple broadcast using scratch buffer and option to use an executor.
2024-12-19 01:16:30 +00:00
David Sidler
d8d0dfbffa Fix synchronization in allreduce8 kernel (#407)
Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a
synchronization bug. The PR addresses it by ensuring that signals are
only issued after all threads in the block have issued their writes to
guarantee correct ordering between data writes and signal writes.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-18 17:10:22 -08:00
Caio Rocha
774602d49c Supporting Executor multi node in NCCL API (#412)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-12-18 15:50:58 -08:00
Binyang Li
fcb2e46cb1 NVLS support for NCCL API (#410)
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-18 09:55:35 +00:00
Binyang Li
863a599360 Disable CuMemMap check for ROCm (#411)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-17 08:36:25 +00:00
Binyang Li
c65f19ad1a Move pipeline to official org (#406)
Move pipeline to official org. Unify all pipelines
2024-12-16 09:43:00 -08:00
Binyang Li
ee75caf365 Reduce memory usage for scratch buffer (#403)
In the executor, we allocate the scratch buffer based on `sendMemRange`.
However, for certain execution plans, this allocation may be unsuitable,
as the plan does not support messages of this size.

To avoid allocating to much data and cause OOM error, set scratch buffer
size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan),
scratchBufferSize(sendMemRange))`
2024-12-13 13:00:04 -08:00
Caio Rocha
01fd813f1b Exception Max Number Operation per Tb (#405) 2024-12-11 16:06:15 -08:00
Binyang Li
7a3dcb0627 Setup pipeline for mscclpp over nccl (#401)
Setup pipeline for mscclpp over nccl
Run `all_reduce_perf` via nccl API
2024-12-07 08:57:45 -08:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Ziyue Yang
f6305a3c1d Add connection events for NPKit (#386) 2024-12-05 00:06:37 +08:00
Binyang Li
88d28e07a7 Select algo according to json config (#396)
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Caio Rocha
ff18bb8d0b Providing reduce-scatter test support (#390) 2024-11-28 09:19:30 -08:00
Caio Rocha
d9c297ba14 AllGather Executor Support in NCCL Interface (#393)
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-27 17:05:51 -08:00
Binyang Li
593478e1b7 Add cross threadblock barrier (#383) 2024-11-26 05:13:30 +00:00
Binyang Li
1b8d020650 Fix mscclpp_benchmark (#392)
Enable 1GB message size for NVLS transport in mscclpp_benchmark
2024-11-25 19:59:51 +00:00
Caio Rocha
93628d2066 Fixing Message Boundary AllReduce Fallback Code (#391) 2024-11-23 12:15:56 -08:00
Changho Hwang
2127a3ba29 Improve CMake options (#376)
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-22 01:54:11 +00:00
Binyang Li
db8e187407 Fix typo (#389) 2024-11-21 22:45:50 +00:00
Binyang Li
28a57b0610 NVLS support for msccl++ executor (#375)
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda

Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Ziyue Yang
3e51e9b359 Fix missing packet parameter for executor (#385) 2024-11-19 08:36:37 +08:00
Caio Rocha
b3dc74c020 Small Adjust in Test Data AllGather at Executor Test (#384) 2024-11-16 15:21:00 +08:00
Binyang Li
1baea89fa0 Fix light load bug (#379)
Fix lightLoadExecutionPlan issue.
An execution context many have multi device execution plans. These plans
share the channel connections which are constructed before.
A deviceExecutionPlanKey is introduced to identify these plans. We can
get the current device execution plan key via:
`contexts.currentDevicePlan`
2024-11-13 07:58:43 +00:00
Caio Rocha
d5d608abdc Fixing Bug Const Offset in Execution Plan (#380)
The offset was not differentiating between the buffer types, causing the
offset to be incorrect when the buffer type was not `SCRATCH`.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-11-11 20:02:02 -08:00
Changho Hwang
85fdde7a73 Lazily create the context stream (#381)
Create the context stream only when needed.
2024-11-11 10:39:32 +08:00
Ziyue Yang
9526d76fc7 Add kernel-based verification for executor_test (#378)
Add kernels to fill and test data for correctness test in
executor_test.py.
2024-11-07 14:14:20 +08:00
Jeff Rasley
449c274326 [docs] fix quickstart link (#374)
Small fix to update quickstart link
2024-10-30 13:13:33 +08:00
Ziyue Yang
95ab1088ef Fix in-place all-gather input buffer in executor_test (#372) 2024-10-24 23:04:11 +08:00
Binyang Li
b72decbfeb Update docker image for cuda12.4 (#370)
Update docker image for cuda12.4
Image pushed to registry

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-10-22 12:51:28 +08:00
Binyang Li
582d386b3b Fix algo repo name (#369)
Change algo repo name from azure-mscclpp  to msccl-users

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-10-22 10:59:15 +08:00
Caio Rocha
c6e06cfad7 Executor AllGather In-Place Support (#365) 2024-10-21 05:45:56 -07:00
Binyang Li
4136153a76 [Doc] mscclpp docs (#348)
Generate docs for mescclpp.
Setup github action to auto-deploy github-page
doc link here: https://microsoft.github.io/mscclpp

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-10-18 06:08:31 +00:00
Changho Hwang
0c150e5166 Fix copyright messages (#367) 2024-10-17 21:25:46 -07:00