Commit Graph

96 Commits

Author SHA1 Message Date
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
f2b52c6318 Fix Python binding of exceptions (#444)
* Fixed errors to be catchable from Python code
* Skip IB tests in Python unit tests when IB ports are down
2025-01-09 11:58:23 -08:00
Caio Rocha
80abce59ef Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-01-08 09:02:36 -08:00
Changho Hwang
34945fb107 Add GpuBuffer class (#423)
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Pedram Alizadeh
97eaca2bd2 [NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl (#399) 2025-01-03 20:38:57 +00:00
Binyang Li
3d6bfed2cf Update version number (#433)
Co-authored-by: github-actions <github-actions@github.com>
2025-01-02 16:45:08 -08:00
Changho Hwang
e2230aab26 Tackle build warnings (#422)
* Comply with
[CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html)
* Tackle other warnings during build
2024-12-19 16:51:50 -08:00
Changho Hwang
756f24c697 Revised ProxyChannel interfaces (#400)
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Ziyue Yang
f6305a3c1d Add connection events for NPKit (#386) 2024-12-05 00:06:37 +08:00
Binyang Li
88d28e07a7 Select algo according to json config (#396)
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Binyang Li
593478e1b7 Add cross threadblock barrier (#383) 2024-11-26 05:13:30 +00:00
Binyang Li
db8e187407 Fix typo (#389) 2024-11-21 22:45:50 +00:00
Binyang Li
28a57b0610 NVLS support for msccl++ executor (#375)
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda

Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Changho Hwang
f8c0bcca2b Perf optimization & support clipping (#364)
Co-authored-by: Nusrat Islam <Nusrat.Islam@amd.com>
2024-10-16 14:35:08 -07:00
Ziyue Yang
5c4e105814 Fix NPKit exit event offset (#356) 2024-09-19 13:35:44 +08:00
Binyang Li
b30bb260e3 Tune threads per block for mscclpp executor (#345) 2024-09-18 17:21:47 -07:00
Roshan Dathathri
7ed13ec4b5 Auto-tune vector sizes for NVLS allreduce6 (#338)
Also fixes bugs in MscclppAllReduce6

Below is the performance when the algorithm is fixed to
MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2.

Float16:

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp16) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 2.0 KiB | 11.15 | 0.18 | PASS | 13.82 | 0.15 | PASS | 1.24 |
| 4.0 KiB | 11.15 | 0.37 | PASS | 14.74 | 0.28 | PASS | 1.32 |
| 8.0 KiB | 11.14 | 0.74 | PASS | 15.17 | 0.54 | PASS | 1.36 |
| 16.0 KiB | 11.16 | 1.47 | PASS | 15.77 | 1.04 | PASS | 1.41 |
| 32.0 KiB | 11.15 | 2.94 | PASS | 17.50 | 1.87 | PASS | 1.57 |
| 64.0 KiB | 11.18 | 5.86 | PASS | 17.64 | 3.71 | PASS | 1.58 |
| 128.0 KiB | 11.16 | 11.74 | PASS | 17.83 | 7.35 | PASS | 1.60 |
| 256.0 KiB | 11.21 | 23.38 | PASS | 18.00 | 14.57 | PASS | 1.60 |
| 512.0 KiB | 11.70 | 44.81 | PASS | 18.42 | 28.46 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.87 | PASS | 20.23 | 51.83 | PASS | 1.48 |
| 2.0 MiB | 17.29 | 121.27 | PASS | 31.60 | 66.36 | PASS | 1.83 |
| 4.0 MiB | 25.26 | 166.02 | PASS | 38.74 | 108.26 | PASS | 1.53 |
| 8.0 MiB | 40.17 | 208.83 | PASS | 62.86 | 133.45 | PASS | 1.56 |
| 16.0 MiB | 70.92 | 236.56 | PASS | 113.36 | 147.99 | PASS | 1.60 |
| 32.0 MiB | 131.38 | 255.41 | PASS | 203.21 | 165.13 | PASS | 1.55 |
| 64.0 MiB | 253.39 | 264.84 | PASS | 342.12 | 196.15 | PASS | 1.35 |
| 128.0 MiB | 496.74 | 270.20 | PASS | 670.62 | 200.14 | PASS | 1.35 |
| 256.0 MiB | 982.42 | 273.24 | PASS | 1318.36 | 203.61 | PASS | 1.34 |

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
 
 Float32:

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp32) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 4.0 KiB | 11.04 | 0.37 | PASS | 14.79 | 0.28 | PASS | 1.34 |
| 8.0 KiB | 11.15 | 0.73 | PASS | 15.25 | 0.54 | PASS | 1.37 |
| 16.0 KiB | 11.12 | 1.47 | PASS | 15.87 | 1.03 | PASS | 1.43 |
| 32.0 KiB | 11.13 | 2.95 | PASS | 17.21 | 1.90 | PASS | 1.55 |
| 64.0 KiB | 11.11 | 5.90 | PASS | 17.37 | 3.77 | PASS | 1.56 |
| 128.0 KiB | 11.08 | 11.83 | PASS | 17.54 | 7.47 | PASS | 1.58 |
| 256.0 KiB | 11.15 | 23.50 | PASS | 17.71 | 14.80 | PASS | 1.59 |
| 512.0 KiB | 11.56 | 45.34 | PASS | 18.21 | 28.79 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.90 | PASS | 19.87 | 52.77 | PASS | 1.46 |
| 2.0 MiB | 17.24 | 121.67 | PASS | 31.63 | 66.30 | PASS | 1.84 |
| 4.0 MiB | 25.19 | 166.47 | PASS | 38.63 | 108.57 | PASS | 1.53 |
| 8.0 MiB | 40.38 | 207.72 | PASS | 62.65 | 133.89 | PASS | 1.55 |
| 16.0 MiB | 70.72 | 237.23 | PASS | 114.57 | 146.44 | PASS | 1.62 |
| 32.0 MiB | 131.49 | 255.18 | PASS | 200.79 | 167.11 | PASS | 1.53 |
| 64.0 MiB | 253.98 | 264.23 | PASS | 342.58 | 195.89 | PASS | 1.35 |
| 128.0 MiB | 496.96 | 270.08 | PASS | 670.64 | 200.13 | PASS | 1.35 |
| 256.0 MiB | 982.83 | 273.12 | PASS | 1318.90 | 203.53 | PASS | 1.34 |
| 512.0 MiB | 1954.07 | 274.75 | PASS | 2609.04 | 205.77 | PASS | 1.34 |

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
2024-08-16 11:11:54 +08:00
Changho Hwang
8c6fb429e9 bfloat16 support (#336)
* Add bfloat16 support for executor and NCCL interface
* Changed `gpu_data_types.hpp` into an internal header file
2024-08-12 15:41:58 -07:00
Roshan Dathathri
f131fae3ec Add support for different vector sizes in multimem instructions (#332) 2024-07-25 10:14:02 -07:00
Changho Hwang
40cb196553 v0.5.2 (#328) 2024-07-16 00:35:18 +00:00
Binyang Li
5f9ee27aa8 Support to write packets via uint2 (#327) 2024-07-15 12:05:13 -07:00
Ziyue Yang
b5a48f836c Separate NPKit CPU timestamp access from different blocks for AMD platform (#321)
Reference: https://github.com/ROCm/rccl/pull/1229
2024-07-02 19:36:48 +08:00
Roshan Dathathri
93ed8e1e58 Add support for multicast reduce insruction (#316) 2024-06-19 13:28:12 -07:00
Ziyue Yang
f29095b3b1 Fix NPKit support for AMD (#312) 2024-06-14 16:22:14 +08:00
Ziyue Yang
76328fe623 Add NPKit GPU event support (#310) 2024-06-13 13:59:50 +08:00
Changho Hwang
1f62dfd7cd Add C++ executor test (#304)
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-05-29 10:54:36 +00:00
Changho Hwang
cddffbc8b6 v0.5.1 (#308) 2024-05-26 14:31:29 -07:00
Changho Hwang
f76eae4dca Fix assert declaration & add a compile test (#303) 2024-05-20 02:39:30 +00:00
Changho Hwang
9c2a96060a v0.5.0 (#298) 2024-05-04 16:51:48 -07:00
Changho Hwang
d4ede480f4 Ethernet support (#284)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-04-25 11:06:43 -07:00
Changho Hwang
89896ff94f Include GPU data types only for kernel code (#292) 2024-04-24 20:55:02 -07:00
Changho Hwang
6c1fa5307c Refactoring NVLS interfaces (#293)
Move NVLS details from the core to a separate interface
2024-04-24 10:05:41 -07:00
Changho Hwang
9934c982a8 Seperate headers for GPU data types (#291)
Prevent unnecessarily including data type headers in everywhere.
2024-04-19 05:52:43 +00:00
Roshan Dathathri
41e0964d93 Allow binding allocated memory to NVLS multicast pointer (#290)
And change NVLS multimem instructions to static functions
2024-04-18 17:11:31 -07:00
Binyang Li
64d837f9ab Add executor to execute schedule-plan file (#283)
Add executor to execute the JSON schedule file generated by msccl-tools

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-04-18 19:10:41 +00:00
Changho Hwang
9406123711 Fix a typo name (#286) 2024-04-17 23:45:46 +00:00
Changho Hwang
1a7cb98e3a v0.4.3 (#279) 2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7 Fix bootstrapping mechanism (#278)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>
2024-03-27 10:24:24 +08:00
Changho Hwang
cdaf3aea3d New packet format & optimizations (#256)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-20 20:01:37 -08:00
Changho Hwang
6a19b19ece Fix NVLS support (#258)
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Saeed Maleki
91d592dcc0 NVLS support. (#250)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-04 20:46:10 -08:00
aashaka
6c9d159e85 Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 (#251)
MSCCLPP_BITS_REGMEM_HANDLE=8 limits the number registered memories for a
ProxyService to 256. Many use cases, such as KV cache transfer, require
registering more tensors.

This change allows registering up-to 512 memories. Note that this change
uses up the slack bits remaining in the ChannelTrigger struct.
2024-01-23 13:38:33 -08:00
Changho Hwang
1178a9ee47 Minor improvement on device syncer (#231)
Saved 1~2us in some tests
2024-01-15 17:33:59 -08:00
Changho Hwang
c0fe31fa76 Mask each fields of the trigger (#244)
The behavior of `ProxyChannelDeviceHandle::put()` is undefined by design
when each field value is given to exceed the bits limitation (such as
`MSCCLPP_BITS_SIZE`). Even so, we'd better trim exceeding bits of each
field value for safety, so that the invalid usage of a field does not
propagate to other fields.
2024-01-10 19:31:47 -08:00
Binyang Li
163cba08c8 Update interface to let user change fifo size (#243)
Related with this issue:
https://github.com/microsoft/mscclpp/issues/242. The user may use more
threads than the number specified in `fifo_size` to interact with the
FIFO. In this case, there will be unexpected behavior.
Update the interface to let user change fifo size on their demands.
2024-01-09 22:14:36 -08:00
Changho Hwang
70e28b3c76 Do not check value of __HIP_PLATFORM_AMD__ (#240)
According to the
[document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary),
`__HIP_PLATFORM_AMD__` is effective only by definition.
2023-12-25 13:51:18 +08:00
Changho Hwang
f1605b73d6 v0.4.2 (#236) 2023-12-18 11:42:58 +08:00
Changho Hwang
5ff8bc5ef2 Fix & improve perf for ROCm (#232)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-12-18 11:30:08 +08:00
Changho Hwang
5a9998bfba Include cstdint in packet_device.hpp (#233) 2023-12-07 09:44:35 -08:00
Binyang Li
f1b2c9df12 Fix performance downgrade issue & update doc (#229)
For push function, we only need to make sure the instruction `st.global`
will be executed after the while loop. Since there is a Write-After-Read
hazard for `trigger.fst` (Check `this->triggers[curFifoHead % size].fst
!= 0` first then write value to `triggers[curFifoHead % size]`), we can
expect the compiler and hardware can handle this situation correctly.
Remove the `release.sys` there.

BTW, `st.global.release.sys.v2.u64` will cause perf regression issue.
Previous we use `st.global.release.cta.v2.u64`, but seems not necessary.
2023-12-04 10:20:10 -08:00