Changho Hwang
def68ced64
Add CUDA 12.8 images ( #488 )
2025-03-29 00:31:26 +00:00
Changho Hwang
3565bfdf6d
Renaming channels ( #436 )
...
Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to
`MemoryChannel`
2025-01-24 14:25:31 -08:00
Changho Hwang
869cdba00c
Manage runtime environments ( #452 )
...
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
34945fb107
Add GpuBuffer class ( #423 )
...
* Renamed and moved mem alloc functions into the `mscclpp::detail::`
namespace (now `mscclpp::detail::gpuCalloc*<T>()`)
* Deprecated constructor-calling mem alloc functions
(`mscclpp::makeShared*<T>()` and `mscclpp::makeUnique*<T>()`)
* Added a new `mscclpp::GpuBuffer<T>()` class that should be used in
general for allocating communication buffers
* Added a new `mscclpp.utils.GpuBuffer` Python class that inherits
`cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc`
* Renamed `mscclpp::memcpyCuda*<T>()` functions into
`mscclpp::gpuMemcpy*<T>()` for name consistency
* A few fixes in NVLS memory allocation
* Tackled minor compiler warnings
2025-01-07 18:40:01 -08:00
Binyang Li
c65f19ad1a
Move pipeline to official org ( #406 )
...
Move pipeline to official org. Unify all pipelines
2024-12-16 09:43:00 -08:00
Binyang Li
7a3dcb0627
Setup pipeline for mscclpp over nccl ( #401 )
...
Setup pipeline for mscclpp over nccl
Run `all_reduce_perf` via nccl API
2024-12-07 08:57:45 -08:00
Changho Hwang
756f24c697
Revised ProxyChannel interfaces ( #400 )
...
* Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel`
-> `ProxyChannel`. It makes the interface more consistent by defining
channels to be associated with a certain src/dst memory region:
`ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema +
src/dst". BaseProxyChannel is not associated with any memory regions, as
"sema + fifo".
* `ProxyChannelDeviceHandle` now inherits from
`BaseProxyChannelDeviceHandle`, instead of having one as a member.
2024-12-06 10:53:34 -08:00
Binyang Li
88d28e07a7
Select algo according to json config ( #396 )
...
The way to run nccl-test over mscclpp:
mpirun -np 8 --bind-to numa --allow-run-as-root -x
LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN
-x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files
/root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w
10 -n 20
2024-12-03 22:39:20 +00:00
Binyang Li
593478e1b7
Add cross threadblock barrier ( #383 )
2024-11-26 05:13:30 +00:00
Changho Hwang
2127a3ba29
Improve CMake options ( #376 )
...
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-22 01:54:11 +00:00
Binyang Li
28a57b0610
NVLS support for msccl++ executor ( #375 )
...
- Support mote datatype for multicast operation
- Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS
- Modify allocSharedPhysicalCuda, which return std::shared_ptr<T>
instead of std::shared_ptr<PhysicalCudaMemory>
- Add Python support for allocSharedPhysicalCuda
Test passed for `allreduce_nvls.json`
2024-11-20 06:43:28 +00:00
Changho Hwang
0c150e5166
Fix copyright messages ( #367 )
2024-10-17 21:25:46 -07:00
Changho Hwang
74130c7c5e
Use IB transport flags only when an IB device exists ( #355 )
2024-09-19 07:13:11 +00:00
Binyang Li
b30bb260e3
Tune threads per block for mscclpp executor ( #345 )
2024-09-18 17:21:47 -07:00
Binyang Li
7bedb25054
Add proxy channel related operations ( #351 )
...
Add Flush, PutWithSignal, PutWithFlushAndSignal operation
2024-09-15 13:24:57 -07:00
Caio Rocha
4eca6f1e95
Support executors to send packets over ProxyChannel ( #344 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-08-30 22:10:33 +00:00
Changho Hwang
1e82dd444f
Make ibverbs optional at compile time ( #340 )
...
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-08-21 12:47:05 -07:00
Binyang Li
1351f9f1c5
Add "packet type" option for executor test ( #313 )
...
Add "packet type" option for executor test
2024-06-14 09:53:58 +00:00
Ziyue Yang
76328fe623
Add NPKit GPU event support ( #310 )
2024-06-13 13:59:50 +08:00
Changho Hwang
1f62dfd7cd
Add C++ executor test ( #304 )
...
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-05-29 10:54:36 +00:00
Changho Hwang
f76eae4dca
Fix assert declaration & add a compile test ( #303 )
2024-05-20 02:39:30 +00:00
Changho Hwang
a3cd95bd42
Upgrade gtest ( #300 )
...
The new gtest version resolves a type casting issue:
3044657e7a
2024-05-07 20:49:26 -07:00
Binyang Li
6226556ce2
Optimized the execution kernel ( #294 )
2024-05-03 11:54:50 -07:00
Changho Hwang
d4ede480f4
Ethernet support ( #284 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-04-25 11:06:43 -07:00
Binyang Li
64d837f9ab
Add executor to execute schedule-plan file ( #283 )
...
Add executor to execute the JSON schedule file generated by msccl-tools
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-04-18 19:10:41 +00:00
Changho Hwang
1a7cb98e3a
v0.4.3 ( #279 )
2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7
Fix bootstrapping mechanism ( #278 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com >
2024-03-27 10:24:24 +08:00
Binyang Li
bc465aefcd
Add __launch_bounds__ for mscclpp-test ( #273 )
2024-03-25 15:55:37 -07:00
Binyang Li
4734d8718f
Fix multi-node ci pipeline ( #272 )
...
Add `__launch_bounds__` to fix perf regression issue in CI pipeline
2024-03-12 09:39:00 -07:00
Changho Hwang
cdaf3aea3d
New packet format & optimizations ( #256 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-02-20 20:01:37 -08:00
Changho Hwang
6a19b19ece
Fix NVLS support ( #258 )
...
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Saeed Maleki
91d592dcc0
NVLS support. ( #250 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-04 20:46:10 -08:00
Changho Hwang
4eb0a08b8c
Add putWithSignal() latency tests ( #246 )
2024-01-24 01:10:35 +00:00
Changho Hwang
70e28b3c76
Do not check value of __HIP_PLATFORM_AMD__ ( #240 )
...
According to the
[document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary ),
`__HIP_PLATFORM_AMD__` is effective only by definition.
2023-12-25 13:51:18 +08:00
Changho Hwang
5ff8bc5ef2
Fix & improve perf for ROCm ( #232 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-12-18 11:30:08 +08:00
Changho Hwang
544ff0c21d
ROCm support ( #213 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-11-24 16:41:56 +08:00
Changho Hwang
dab19e00c1
Templatize Dockerfiles & update workflows ( #223 )
...
Now build images by a script with a shared Dockerfile template
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-11-22 13:29:12 -08:00
Changho Hwang
060fda12e6
mscclpp-test in Python ( #204 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Esha Choukse <eschouks@microsoft.com >
2023-11-16 12:45:25 +08:00
Changho Hwang
3521fb0280
Clear minor warnings ( #214 )
...
Clear warnings from the clang compiler.
2023-11-14 09:28:48 +08:00
Binyang2014
8a938de9c5
fix pipeline ( #209 )
...
fix pipeline for multi-node test
2023-11-03 05:18:32 +00:00
Binyang2014
6f43282c1d
Fix allreduce bug ( #197 )
...
Fix allreduce correctness issue
2023-10-18 23:16:57 +08:00
Changho Hwang
8c0f9e84d0
v0.3.0 ( #171 )
2023-10-11 22:35:54 +08:00
Changho Hwang
11ac824cc7
Align interfaces of put/get/putPackets/getPackets ( #185 )
2023-10-07 22:18:26 +08:00
Changho Hwang
b3d0fdb8df
Add an atomic signal perf test ( #183 )
2023-09-18 08:12:14 +00:00
Changho Hwang
6c0ee72916
Construct ProxyChannel with shared pointers ( #184 )
2023-09-18 05:46:23 +00:00
Changho Hwang
a6b24dcbed
Fix #163 ( #182 )
...
The bug was caused as frequent calls of initialize() temporarily exhaust
all available ephemeral ports. Fixed by retrying `bind()` after a while
upon `EADDRINUSE`.
2023-09-15 08:35:01 +00:00
Changho Hwang
3aa72098d9
Add poll() for semaphores ( #181 )
2023-09-15 07:40:44 +00:00
Changho Hwang
d2f13f1e54
Fix #174 ( #180 )
...
Added `extern "C"` based on another specification in
`/usr/local/cuda/include/crt/common_functions.h`.
2023-09-15 06:44:41 +00:00
Binyang2014
952f2da9cc
Improve single node allreduce performance ( #169 )
...
Improve all reduce performance for single node.
New number:
| n_ctx | size | target latency (us) | allreduce5 | allreduce6 |
|---------|---------|----------------|------------|------------|
| 1 | 24.0kB | 7.7 | | 7.23|
| 2 | 48.0kB | 7.7 | | 7.69|
| 4 | 96.0kB | 8 | | 8.34|
| 8 | 192.0kB | 12.6 | | 9.75|
| 12 | 288.0kB | 13 | | 11.34|
| 16 | 384.0kB | 13.3 | | 12.99|
| 768 | 18.0MB | 158.7 | 160.3| |
| 896 | 21.0MB | 184.5 | 183.8| |
| 1024 | 24.0MB | 209.5 | 207.5| |
| 1152 | 27.0MB | 234.3 | 231.9| |
| 1280 | 30.0MB | 260 | 255.6| |
| 1408 | 33.0MB | 284.9 | 278.7| |
| 1536 | 36.0MB | 310.3 | 302.0| |
| 1664 | 39.0MB | 336.2 | 325.3| |
| 1792 | 42.0MB | 361.4 | 348.8| |
| 1920 | 45.0MB | 384.6 | 372.2| |
| 2048 | 48.0MB | 409.1 | 395.4| |
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-09-13 14:30:08 +00:00
Saeed Maleki
015e29c138
adding signal for atomic op ( #178 )
...
This address [this](https://github.com/microsoft/mscclpp/issues/177 ).
2023-09-11 10:46:25 -07:00