Commit Graph

515 Commits

Author SHA1 Message Date
Changho Hwang
6cd8960394 DirectChannel Unit Tests (#102)
* Add DirectChannel unit tests
* Split mp_unit_tests.cu into multiple files
2023-06-15 20:55:57 +08:00
Changho Hwang
c4a5958dfc Fix hanging bootstrap issues (#100)
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-15 11:29:49 +08:00
Binyang2014
8efacae332 update pipeline (#103)
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
2023-06-14 20:14:57 +08:00
Changho Hwang
4d0b0a650f Remove vulnerable sscanf (#101) 2023-06-14 10:02:46 +08:00
Binyang2014
b1ce368656 Implement host offload algorithm for allgather (#84)
Implement host offload algorithm for allgather
For 1n-8p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            32    73.02    0.01    0.01      0

# Out of bounds values : 0 OK
#
```
For 2n-16p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16    90.30    0.01    0.01      0

# Out of bounds values : 0 OK
#
```
2023-06-13 10:01:58 +00:00
Binyang2014
6ee4e80317 Create Azure pipeline for multi-node tests (#97)
Create Azure pipeline to run mscclpp-test on multi-nodes
2023-06-13 06:34:07 +00:00
Changho Hwang
76718e4015 Saemal/atomic signal (#96)
* code complelete

* fix correctness issue

* Fix correctness issuee

* fix lint

* ass compile

* Fix build issue

* Fix runtime error

* Fix correctness issue

* Fix crash issue

* minor change

* Fix memory leak

* Fix review comments

* Finish allgather

* address comments

* load element to register first then store to remote address

* Finish allGather

* init

* Build connections

* allreduce_test works

* Bug fix

* Add CUDA flags

* Add packet copy (LL)

* Lint

* Set tmpPtr from constructors

* Lint

* Multiple blocks per peer

* Beautify

* Temporal ring reduce

* Ring reduce works correctly

* Overlapping

* Fix overlapping

* Improve vector sum

* figuring out how to use atomics

* working now

* wip

* Enhance LL AllReduce

* Support multiple blocks per peer

* Fix a ring reduce bug

* Fix a AllReduce kernel 2 bug

* Bug fix

* wip

* Make it compilable

* Lint

* Lint

* Minor changes

* Unit test to reproduce memory consistency bugs

* Unit test bug fixes

* Fixes

* Typo

* wip

* done with core

* wip

* wip

* compiles

* only the atomic is failing

* almost working

* all tests pass now

* clang-12

* More jailbreaks

* bug fix for common.cu

* adding stdint to concurrency.hpp

* Out-of-place for AllReduce kernel 2

* Optimize `sync()`

* Fix mp_unit_tests

* Init TestEngine with TestArgs

* Change common.cu into common.cc

* Cleanup common.hpp

* Lint

* fixes to the mscclpp-tests

* fixed common.cc

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-12 21:38:06 -07:00
Changho Hwang
43de015f3f Add packet copy (LL) for AllReduce (#85) 2023-06-12 21:53:50 +08:00
Olli Saarikivi
5d5e9a1805 Make bootstrap use persistent sockets (#98) 2023-06-12 15:13:30 +08:00
Changho Hwang
5a4885ccbb Misc updates (#95) 2023-06-12 13:53:43 +08:00
Changho Hwang
798631bd52 Update unit tests (#81) 2023-06-08 09:58:05 +00:00
Changho Hwang
0c14a67ad2 [mscclpp-test] Add AllReduce and AllToAll tests (#83) 2023-06-07 10:58:47 +00:00
Binyang2014
d9568a3235 Setup Azure pipeline to run mscclpp-test (#93) 2023-06-07 14:14:49 +08:00
Changho Hwang
7346e70109 Use MSCCL++ Docker image for CodeQL (#94) 2023-06-06 18:42:22 +08:00
Changho Hwang
85e664c2f7 Update docs (#88) 2023-06-05 13:13:10 +08:00
Changho Hwang
9cee6c4a74 Cleanup old files and functions (#86) 2023-06-01 17:34:57 +08:00
Olli Saarikivi
457c422791 Remove alloc.h and beef up cuda_utils.hpp (#82) 2023-05-24 08:34:18 +00:00
Binyang2014
216373eab2 Add allgather test to mscclpp-test (#78)
Finish allGather

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-05-23 00:37:25 -07:00
Changho Hwang
0581bfb431 Fix CodeQL workflow (#80) 2023-05-22 14:03:30 +08:00
Changho Hwang
8d54bf3301 Update CI (#79) 2023-05-21 11:45:41 -07:00
Binyang2014
a3cf48cc5d Rewrite mscclpp-test with cpp style API (#77)
- Rewrite mscclpp-test with cpp style API
- Add SM copy
- add new sendRecv test
2023-05-19 14:14:19 +08:00
Olli Saarikivi
4c0883bc91 Add a missing throw 2023-05-16 16:16:00 -07:00
Olli Saarikivi
4e4d1972e3 Cuda smart pointers 2023-05-16 16:16:00 -07:00
Olli Saarikivi
00d4896c25 Rudimentary CTest support for test executables 2023-05-16 16:16:00 -07:00
Olli Saarikivi
d83343ef4e Make getWc not return a void pointer 2023-05-16 22:52:38 +00:00
Olli Saarikivi
dee55997e9 Remove free and most reinterpret_casts in IB code 2023-05-16 22:48:16 +00:00
Saeed Maleki
5de083ad7e freeing cudaMalloc'ed pointers 2023-05-15 23:53:30 +00:00
Saeed Maleki
966402706c Merge pull request #72 from microsoft/ziyyang/doxygen
Add Doxygen-based document
2023-05-15 16:50:17 -07:00
Saeed Maleki
e21392e2c3 Merge branch 'main' into ziyyang/doxygen 2023-05-15 23:45:54 +00:00
Saeed Maleki
112d1eeb22 Merge pull request #75 from microsoft/api-extension
Merge api-extension branch to main
2023-05-15 16:35:50 -07:00
Saeed Maleki
c9ac615b20 Merge pull request #74 from microsoft/saemal/offloading
offloading allgather to CPU entirely
2023-05-15 16:27:00 -07:00
Saeed Maleki
6f7ca05305 Merge remote-tracking branch 'origin/api-extension' into saemal/offloading 2023-05-12 22:43:22 +00:00
Saeed Maleki
2a7b745972 fully working with double buffering 2023-05-12 22:42:22 +00:00
Olli Saarikivi
8f2d7922ed Change install dir 2023-05-12 21:25:29 +00:00
Olli Saarikivi
d58e698d51 Add headers to install and set default install dir 2023-05-12 21:23:01 +00:00
Saeed Maleki
2691784b88 working -- at least for single node 2023-05-12 20:21:58 +00:00
Saeed Maleki
113473a116 more progress 2023-05-12 07:01:21 +00:00
Saeed Maleki
31851ad82c host epoch removed 2023-05-12 06:11:12 +00:00
Saeed Maleki
ef558a42e8 wip 2023-05-12 05:54:32 +00:00
Saeed Maleki
260c3e35f0 Merge pull request #73 from microsoft/binyli/exception
Refine exception
2023-05-11 14:29:41 -07:00
Saeed Maleki
62f96f316c Merge branch 'api-extension' into binyli/exception 2023-05-11 21:24:18 +00:00
Binyang2014
643771bf93 Merge pull request #71 from microsoft/binyli/merge-main
Resolve conflict and merge main branch to api-extension
2023-05-11 17:39:06 +08:00
Binyang Li
e63aae7142 Merge apt-extension 2023-05-11 09:20:41 +00:00
Binyang Li
5704fb7c6a update 2023-05-11 08:55:51 +00:00
Binyang Li
1487596dc8 update cpplint 2023-05-11 08:34:57 +00:00
Binyang Li
785a973ace refine exception 2023-05-11 08:25:25 +00:00
Ziyue Yang
e257f19cb8 add doc section in readme 2023-05-11 00:46:02 +00:00
Olli Saarikivi
96a0c45fb4 Remove makefile 2023-05-11 00:23:21 +00:00
Olli Saarikivi
9f6c48cbf9 Format all files 2023-05-11 00:23:14 +00:00
Olli Saarikivi
ccf45b33a2 Delete old init code and other C-style code 2023-05-10 22:03:42 +00:00