Commit Graph

396 Commits

Author SHA1 Message Date
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Binyang Li
6fedb7c0e8 Fix nccl-test failure issue (#421) 2024-12-19 12:07:00 -08:00
Binyang Li
fcb2e46cb1 NVLS support for NCCL API (#410)
Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-18 09:55:35 +00:00
Binyang Li
863a599360 Disable CuMemMap check for ROCm (#411)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-12-17 08:36:25 +00:00
Binyang Li
ee75caf365 Reduce memory usage for scratch buffer (#403)
In the executor, we allocate the scratch buffer based on `sendMemRange`.
However, for certain execution plans, this allocation may be unsuitable,
as the plan does not support messages of this size.

To avoid allocating to much data and cause OOM error, set scratch buffer
size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan),
scratchBufferSize(sendMemRange))`
2024-12-13 13:00:04 -08:00
Caio Rocha
01fd813f1b Exception Max Number Operation per Tb (#405) 2024-12-11 16:06:15 -08:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Ziyue Yang
f6305a3c1d Add connection events for NPKit (#386) 2024-12-05 00:06:37 +08:00
Binyang Li
88d28e07a7 Select algo according to json config (#396)
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Binyang Li
593478e1b7 Add cross threadblock barrier (#383) 2024-11-26 05:13:30 +00:00
Changho Hwang
2127a3ba29 Improve CMake options (#376)
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-22 01:54:11 +00:00
Binyang Li
28a57b0610 NVLS support for msccl++ executor (#375)
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda

Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Ziyue Yang
3e51e9b359 Fix missing packet parameter for executor (#385) 2024-11-19 08:36:37 +08:00
Binyang Li
1baea89fa0 Fix light load bug (#379)
Fix lightLoadExecutionPlan issue.
An execution context many have multi device execution plans. These plans
share the channel connections which are constructed before.
A deviceExecutionPlanKey is introduced to identify these plans. We can
get the current device execution plan key via:
`contexts.currentDevicePlan`
2024-11-13 07:58:43 +00:00
Caio Rocha
d5d608abdc Fixing Bug Const Offset in Execution Plan (#380)
The offset was not differentiating between the buffer types, causing the
offset to be incorrect when the buffer type was not `SCRATCH`.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-11-11 20:02:02 -08:00
Changho Hwang
85fdde7a73 Lazily create the context stream (#381)
Create the context stream only when needed.
2024-11-11 10:39:32 +08:00
Caio Rocha
c6e06cfad7 Executor AllGather In-Place Support (#365) 2024-10-21 05:45:56 -07:00
Changho Hwang
0c150e5166 Fix copyright messages (#367) 2024-10-17 21:25:46 -07:00
Caio Rocha
08a0cec2eb Fixing RegisterMemory Allocation for ProxyChannels (#353)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-09-24 23:01:41 -07:00
Ziyue Yang
5c4e105814 Fix NPKit exit event offset (#356) 2024-09-19 13:35:44 +08:00
Binyang Li
b30bb260e3 Tune threads per block for mscclpp executor (#345) 2024-09-18 17:21:47 -07:00
Binyang Li
7bedb25054 Add proxy channel related operations (#351)
Add Flush, PutWithSignal, PutWithFlushAndSignal operation
2024-09-15 13:24:57 -07:00
Binyang Li
26a87535f9 Fix bug for construct sempaphore (#341)
Current semaphore construction requires two-way communication, e.g., to
construct a semaphore signaling from rank 0 to rank 1, both rank 0 and
rank 1 need to send a message to each other. This PR fixes an executor
bug that fails to conduct two-way communication for constructing such
one-way semaphores, and instead hangs during the semaphore construction.
In the future, we may need to change the implementation to construct
semaphore via one-way communication.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-09-04 19:42:03 +08:00
Changho Hwang
72b99a4229 Fix for ROCm 6.0 (#347) 2024-09-01 20:22:33 -07:00
Caio Rocha
4eca6f1e95 Support executors to send packets over ProxyChannel (#344)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-08-30 22:10:33 +00:00
Caio Rocha
1af62ea43d ProxyChannel Support in Executor (#342)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-08-27 10:09:44 -07:00
Changho Hwang
1e82dd444f Make ibverbs optional at compile time (#340)
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-08-21 12:47:05 -07:00
Caio Rocha
ead4efc315 Dynamically load libibverbs (#337) 2024-08-13 23:48:39 -07:00
Changho Hwang
8c6fb429e9 bfloat16 support (#336)
* Add bfloat16 support for executor and NCCL interface
* Changed `gpu_data_types.hpp` into an internal header file
2024-08-12 15:41:58 -07:00
caiomcbr
67eb9b04cc NCCL API Executor Integration (#331)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-07-25 15:05:02 -07:00
Roshan Dathathri
f131fae3ec Add support for different vector sizes in multimem instructions (#332) 2024-07-25 10:14:02 -07:00
Ziyue Yang
b5a48f836c Separate NPKit CPU timestamp access from different blocks for AMD platform (#321)
Reference: https://github.com/ROCm/rccl/pull/1229
2024-07-02 19:36:48 +08:00
Ziyue Yang
f29095b3b1 Fix NPKit support for AMD (#312) 2024-06-14 16:22:14 +08:00
Ziyue Yang
76328fe623 Add NPKit GPU event support (#310) 2024-06-13 13:59:50 +08:00
Binyang Li
80aefe55bc Cumulative Updates (#309)
Bug fix: Unable to execute communication primitives with the same
execution plan but varying message sizes.
Add reduce_packets OP
2024-06-12 19:17:57 +08:00
Changho Hwang
1f62dfd7cd Add C++ executor test (#304)
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-05-29 10:54:36 +00:00
Binyang Li
3a18068cd4 Fix security issue (#305)
Change sprintf to snprintf to avoid potential security issue
2024-05-25 23:12:57 -07:00
Binyang Li
6226556ce2 Optimized the execution kernel (#294) 2024-05-03 11:54:50 -07:00
Binyang Li
5628362715 Resolve multi-nodes test failure issue (#295)
Fix bug, resolve multi-nodes test failure issue.
2024-04-26 13:06:57 +08:00
Changho Hwang
d4ede480f4 Ethernet support (#284)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-04-25 11:06:43 -07:00
Changho Hwang
89896ff94f Include GPU data types only for kernel code (#292) 2024-04-24 20:55:02 -07:00
Changho Hwang
6c1fa5307c Refactoring NVLS interfaces (#293)
Move NVLS details from the core to a separate interface
2024-04-24 10:05:41 -07:00
Changho Hwang
9934c982a8 Seperate headers for GPU data types (#291)
Prevent unnecessarily including data type headers in everywhere.
2024-04-19 05:52:43 +00:00
Roshan Dathathri
41e0964d93 Allow binding allocated memory to NVLS multicast pointer (#290)
And change NVLS multimem instructions to static functions
2024-04-18 17:11:31 -07:00
Binyang Li
64d837f9ab Add executor to execute schedule-plan file (#283)
Add executor to execute the JSON schedule file generated by msccl-tools

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-04-18 19:10:41 +00:00
Changho Hwang
9406123711 Fix a typo name (#286) 2024-04-17 23:45:46 +00:00
Changho Hwang
5ba6ce00c7 Fix bootstrapping mechanism (#278)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>
2024-03-27 10:24:24 +08:00
Changho Hwang
d34e097b40 Fix wrong offset calculation (#257) 2024-02-06 08:55:43 +08:00
Saeed Maleki
91d592dcc0 NVLS support. (#250)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-04 20:46:10 -08:00
Binyang Li
163cba08c8 Update interface to let user change fifo size (#243)
Related with this issue:
https://github.com/microsoft/mscclpp/issues/242. The user may use more
threads than the number specified in `fifo_size` to interact with the
FIFO. In this case, there will be unexpected behavior.
Update the interface to let user change fifo size on their demands.
2024-01-09 22:14:36 -08:00