Commit Graph

17 Commits

Author SHA1 Message Date
Changho Hwang
6cd8960394 DirectChannel Unit Tests (#102)
* Add DirectChannel unit tests
* Split mp_unit_tests.cu into multiple files
2023-06-15 20:55:57 +08:00
Changho Hwang
c4a5958dfc Fix hanging bootstrap issues (#100)
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-15 11:29:49 +08:00
Binyang2014
b1ce368656 Implement host offload algorithm for allgather (#84)
Implement host offload algorithm for allgather
For 1n-8p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            32    73.02    0.01    0.01      0

# Out of bounds values : 0 OK
#
```
For 2n-16p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16    90.30    0.01    0.01      0

# Out of bounds values : 0 OK
#
```
2023-06-13 10:01:58 +00:00
Changho Hwang
76718e4015 Saemal/atomic signal (#96)
* code complelete

* fix correctness issue

* Fix correctness issuee

* fix lint

* ass compile

* Fix build issue

* Fix runtime error

* Fix correctness issue

* Fix crash issue

* minor change

* Fix memory leak

* Fix review comments

* Finish allgather

* address comments

* load element to register first then store to remote address

* Finish allGather

* init

* Build connections

* allreduce_test works

* Bug fix

* Add CUDA flags

* Add packet copy (LL)

* Lint

* Set tmpPtr from constructors

* Lint

* Multiple blocks per peer

* Beautify

* Temporal ring reduce

* Ring reduce works correctly

* Overlapping

* Fix overlapping

* Improve vector sum

* figuring out how to use atomics

* working now

* wip

* Enhance LL AllReduce

* Support multiple blocks per peer

* Fix a ring reduce bug

* Fix a AllReduce kernel 2 bug

* Bug fix

* wip

* Make it compilable

* Lint

* Lint

* Minor changes

* Unit test to reproduce memory consistency bugs

* Unit test bug fixes

* Fixes

* Typo

* wip

* done with core

* wip

* wip

* compiles

* only the atomic is failing

* almost working

* all tests pass now

* clang-12

* More jailbreaks

* bug fix for common.cu

* adding stdint to concurrency.hpp

* Out-of-place for AllReduce kernel 2

* Optimize `sync()`

* Fix mp_unit_tests

* Init TestEngine with TestArgs

* Change common.cu into common.cc

* Cleanup common.hpp

* Lint

* fixes to the mscclpp-tests

* fixed common.cc

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-12 21:38:06 -07:00
Changho Hwang
43de015f3f Add packet copy (LL) for AllReduce (#85) 2023-06-12 21:53:50 +08:00
Changho Hwang
5a4885ccbb Misc updates (#95) 2023-06-12 13:53:43 +08:00
Changho Hwang
798631bd52 Update unit tests (#81) 2023-06-08 09:58:05 +00:00
Changho Hwang
0c14a67ad2 [mscclpp-test] Add AllReduce and AllToAll tests (#83) 2023-06-07 10:58:47 +00:00
Changho Hwang
9cee6c4a74 Cleanup old files and functions (#86) 2023-06-01 17:34:57 +08:00
Olli Saarikivi
457c422791 Remove alloc.h and beef up cuda_utils.hpp (#82) 2023-05-24 08:34:18 +00:00
Binyang2014
a3cf48cc5d Rewrite mscclpp-test with cpp style API (#77)
- Rewrite mscclpp-test with cpp style API
- Add SM copy
- add new sendRecv test
2023-05-19 14:14:19 +08:00
Olli Saarikivi
4c0883bc91 Add a missing throw 2023-05-16 16:16:00 -07:00
Olli Saarikivi
4e4d1972e3 Cuda smart pointers 2023-05-16 16:16:00 -07:00
Olli Saarikivi
d58e698d51 Add headers to install and set default install dir 2023-05-12 21:23:01 +00:00
Binyang Li
785a973ace refine exception 2023-05-11 08:25:25 +00:00
Olli Saarikivi
9f6c48cbf9 Format all files 2023-05-11 00:23:14 +00:00
Olli Saarikivi
beaf2aea39 Move public headers under include/ 2023-05-10 20:46:49 +00:00