Olli Saarikivi
6b39fd9a54
Remove Setuppable
2023-11-08 18:44:45 +00:00
Olli Saarikivi
8cb63a7d1a
Get rid of comm.setup()
2023-11-08 18:44:45 +00:00
Binyang2014
8a938de9c5
fix pipeline ( #209 )
...
fix pipeline for multi-node test
2023-11-03 05:18:32 +00:00
Binyang2014
6f43282c1d
Fix allreduce bug ( #197 )
...
Fix allreduce correctness issue
2023-10-18 23:16:57 +08:00
Changho Hwang
8c0f9e84d0
v0.3.0 ( #171 )
2023-10-11 22:35:54 +08:00
Changho Hwang
11ac824cc7
Align interfaces of put/get/putPackets/getPackets ( #185 )
2023-10-07 22:18:26 +08:00
Changho Hwang
b3d0fdb8df
Add an atomic signal perf test ( #183 )
2023-09-18 08:12:14 +00:00
Changho Hwang
6c0ee72916
Construct ProxyChannel with shared pointers ( #184 )
2023-09-18 05:46:23 +00:00
Changho Hwang
a6b24dcbed
Fix #163 ( #182 )
...
The bug was caused as frequent calls of initialize() temporarily exhaust
all available ephemeral ports. Fixed by retrying `bind()` after a while
upon `EADDRINUSE`.
2023-09-15 08:35:01 +00:00
Changho Hwang
3aa72098d9
Add poll() for semaphores ( #181 )
2023-09-15 07:40:44 +00:00
Changho Hwang
d2f13f1e54
Fix #174 ( #180 )
...
Added `extern "C"` based on another specification in
`/usr/local/cuda/include/crt/common_functions.h`.
2023-09-15 06:44:41 +00:00
Binyang2014
952f2da9cc
Improve single node allreduce performance ( #169 )
...
Improve all reduce performance for single node.
New number:
| n_ctx | size | target latency (us) | allreduce5 | allreduce6 |
|---------|---------|----------------|------------|------------|
| 1 | 24.0kB | 7.7 | | 7.23|
| 2 | 48.0kB | 7.7 | | 7.69|
| 4 | 96.0kB | 8 | | 8.34|
| 8 | 192.0kB | 12.6 | | 9.75|
| 12 | 288.0kB | 13 | | 11.34|
| 16 | 384.0kB | 13.3 | | 12.99|
| 768 | 18.0MB | 158.7 | 160.3| |
| 896 | 21.0MB | 184.5 | 183.8| |
| 1024 | 24.0MB | 209.5 | 207.5| |
| 1152 | 27.0MB | 234.3 | 231.9| |
| 1280 | 30.0MB | 260 | 255.6| |
| 1408 | 33.0MB | 284.9 | 278.7| |
| 1536 | 36.0MB | 310.3 | 302.0| |
| 1664 | 39.0MB | 336.2 | 325.3| |
| 1792 | 42.0MB | 361.4 | 348.8| |
| 1920 | 45.0MB | 384.6 | 372.2| |
| 2048 | 48.0MB | 409.1 | 395.4| |
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-09-13 14:30:08 +00:00
Saeed Maleki
015e29c138
adding signal for atomic op ( #178 )
...
This address [this](https://github.com/microsoft/mscclpp/issues/177 ).
2023-09-11 10:46:25 -07:00
Binyang2014
097aa8843a
Fix pytest unstable issue. ( #170 )
...
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Olli Saarikivi
828be48b21
Add Context and Endpoint classes to enable non-Communicator use-cases ( #166 )
...
This PR implements and closes #137 . The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.
The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.
Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.
A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.
Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.
2023-09-06 13:10:04 +08:00
Binyang2014
858e381829
Pytest ( #162 )
...
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest
---------
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com >
2023-09-01 21:22:11 +08:00
Saeed Maleki
8d1b984bed
Change device handle interfaces & others ( #142 )
...
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net >
2023-08-16 20:00:56 +08:00
Binyang2014
a58e2e9623
Make sure the semaphore not be released during the lifecycle of SmChannel ( #131 )
...
Fix #126
- Put `std::shared_ptr<SmDevice2DeviceSemaphore>` into the `SmChannel`
- add a `DeviceHandle` struct in `SmChannel`
- add `DeviceHandle` template
Users need to write code like this to use channel in device side:
```
using DeviceHandle = mscclpp::DeviceHandle<T>;
__device__ DeviceHandle<mscclpp::SimpleProxyChannel> channel;
__device__ DeviceHandle<mscclpp::SmChannel> smChannel;
```
To cover a channel to deviceHandle, need to call this function:
`mscclpp::deviceHandle(SimpleProxyChannel or SmChannel)`
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-07-20 12:18:22 +08:00
Saeed Maleki
e7d5e652df
Python bindings ( #125 )
...
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-19 15:35:54 +08:00
Changho Hwang
1d71715d19
Separate mscclpp-test kernels ( #122 )
...
Separate different kernel implementations in mscclpp-test to reduce the
number of registers required by the kernels.
2023-07-10 10:11:20 -07:00
Binyang2014
56bdbc2f32
Enable test for both cuda11 and cuda12 ( #124 )
...
Update pipeline: enable test for both cuda11 and cuda12
2023-07-10 13:19:14 +08:00
Changho Hwang
4114d65c60
Documents & minor updates ( #119 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-07 17:35:05 +08:00
Changho Hwang
bb7b85a810
2-node AllReduce improvements ( #118 )
...
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-07-07 07:05:46 +00:00
Changho Hwang
6ec585f3d8
Packet copy for IB ( #109 )
...
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00
Saeed Maleki
df2f0c14ab
bootstrap now takes interface ( #113 )
...
This PR fixes the issue regarding taking the interface as an input.
2023-06-29 00:16:06 +08:00
Changho Hwang
21eed722af
Add license comments ( #106 )
2023-06-25 12:40:12 +08:00
Binyang2014
2640578b22
Add performance check for mscclpp-test ( #110 )
...
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Saeed Maleki
cd7797fd5e
FIFO optimization ( #112 )
...
This saves 2us on IB latency
2023-06-19 05:36:56 +00:00
Changho Hwang
60b3dd5a61
Bug fixes & resolve warnings ( #107 )
...
* Fix a bug in host hashing
* Fix a bug in `HostEpoch::wait()`
* Remove misc warnings
2023-06-16 09:31:23 +00:00
Binyang2014
8410fcd8fc
Fix allgather kernel 2 perf bug ( #108 )
...
Fix #105
2023-06-16 15:36:20 +08:00
Changho Hwang
6cd8960394
DirectChannel Unit Tests ( #102 )
...
* Add DirectChannel unit tests
* Split mp_unit_tests.cu into multiple files
2023-06-15 20:55:57 +08:00
Changho Hwang
c4a5958dfc
Fix hanging bootstrap issues ( #100 )
...
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-06-15 11:29:49 +08:00
Binyang2014
8efacae332
update pipeline ( #103 )
...
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
2023-06-14 20:14:57 +08:00
Changho Hwang
4d0b0a650f
Remove vulnerable sscanf ( #101 )
2023-06-14 10:02:46 +08:00
Binyang2014
b1ce368656
Implement host offload algorithm for allgather ( #84 )
...
Implement host offload algorithm for allgather
For 1n-8p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 32 73.02 0.01 0.01 0
# Out of bounds values : 0 OK
#
```
For 2n-16p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 16 90.30 0.01 0.01 0
# Out of bounds values : 0 OK
#
```
2023-06-13 10:01:58 +00:00
Binyang2014
6ee4e80317
Create Azure pipeline for multi-node tests ( #97 )
...
Create Azure pipeline to run mscclpp-test on multi-nodes
2023-06-13 06:34:07 +00:00
Changho Hwang
76718e4015
Saemal/atomic signal ( #96 )
...
* code complelete
* fix correctness issue
* Fix correctness issuee
* fix lint
* ass compile
* Fix build issue
* Fix runtime error
* Fix correctness issue
* Fix crash issue
* minor change
* Fix memory leak
* Fix review comments
* Finish allgather
* address comments
* load element to register first then store to remote address
* Finish allGather
* init
* Build connections
* allreduce_test works
* Bug fix
* Add CUDA flags
* Add packet copy (LL)
* Lint
* Set tmpPtr from constructors
* Lint
* Multiple blocks per peer
* Beautify
* Temporal ring reduce
* Ring reduce works correctly
* Overlapping
* Fix overlapping
* Improve vector sum
* figuring out how to use atomics
* working now
* wip
* Enhance LL AllReduce
* Support multiple blocks per peer
* Fix a ring reduce bug
* Fix a AllReduce kernel 2 bug
* Bug fix
* wip
* Make it compilable
* Lint
* Lint
* Minor changes
* Unit test to reproduce memory consistency bugs
* Unit test bug fixes
* Fixes
* Typo
* wip
* done with core
* wip
* wip
* compiles
* only the atomic is failing
* almost working
* all tests pass now
* clang-12
* More jailbreaks
* bug fix for common.cu
* adding stdint to concurrency.hpp
* Out-of-place for AllReduce kernel 2
* Optimize `sync()`
* Fix mp_unit_tests
* Init TestEngine with TestArgs
* Change common.cu into common.cc
* Cleanup common.hpp
* Lint
* fixes to the mscclpp-tests
* fixed common.cc
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-06-12 21:38:06 -07:00
Changho Hwang
43de015f3f
Add packet copy (LL) for AllReduce ( #85 )
2023-06-12 21:53:50 +08:00
Olli Saarikivi
5d5e9a1805
Make bootstrap use persistent sockets ( #98 )
2023-06-12 15:13:30 +08:00
Changho Hwang
5a4885ccbb
Misc updates ( #95 )
2023-06-12 13:53:43 +08:00
Changho Hwang
798631bd52
Update unit tests ( #81 )
2023-06-08 09:58:05 +00:00
Changho Hwang
0c14a67ad2
[mscclpp-test] Add AllReduce and AllToAll tests ( #83 )
2023-06-07 10:58:47 +00:00
Changho Hwang
9cee6c4a74
Cleanup old files and functions ( #86 )
2023-06-01 17:34:57 +08:00
Olli Saarikivi
457c422791
Remove alloc.h and beef up cuda_utils.hpp ( #82 )
2023-05-24 08:34:18 +00:00
Binyang2014
216373eab2
Add allgather test to mscclpp-test ( #78 )
...
Finish allGather
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-05-23 00:37:25 -07:00
Binyang2014
a3cf48cc5d
Rewrite mscclpp-test with cpp style API ( #77 )
...
- Rewrite mscclpp-test with cpp style API
- Add SM copy
- add new sendRecv test
2023-05-19 14:14:19 +08:00
Olli Saarikivi
4e4d1972e3
Cuda smart pointers
2023-05-16 16:16:00 -07:00
Olli Saarikivi
00d4896c25
Rudimentary CTest support for test executables
2023-05-16 16:16:00 -07:00
Olli Saarikivi
d83343ef4e
Make getWc not return a void pointer
2023-05-16 22:52:38 +00:00
Saeed Maleki
5de083ad7e
freeing cudaMalloc'ed pointers
2023-05-15 23:53:30 +00:00