Commit Graph

335 Commits

Author SHA1 Message Date
Changho Hwang
8232ec731f Working 2023-09-06 12:15:12 +00:00
Changho Hwang
89cad56721 updates 2023-09-06 02:33:21 +00:00
Changho Hwang
ad13693fe8 IB gather WIP 2023-09-05 14:41:08 +00:00
Binyang2014
858e381829 Pytest (#162)
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest

---------

Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>
2023-09-01 21:22:11 +08:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Binyang2014
a58e2e9623 Make sure the semaphore not be released during the lifecycle of SmChannel (#131)
Fix #126

 - Put `std::shared_ptr<SmDevice2DeviceSemaphore>` into the `SmChannel` 
 - add a `DeviceHandle` struct in `SmChannel`
 - add `DeviceHandle` template
 
Users need to write code like this to use channel in device side:
```
using DeviceHandle = mscclpp::DeviceHandle<T>;
__device__ DeviceHandle<mscclpp::SimpleProxyChannel> channel;
__device__ DeviceHandle<mscclpp::SmChannel> smChannel;
```

To cover a channel to deviceHandle, need to call this function:
`mscclpp::deviceHandle(SimpleProxyChannel or SmChannel)`

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-07-20 12:18:22 +08:00
Saeed Maleki
e7d5e652df Python bindings (#125)
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-19 15:35:54 +08:00
Changho Hwang
4114d65c60 Documents & minor updates (#119)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-07 17:35:05 +08:00
Changho Hwang
6ec585f3d8 Packet copy for IB (#109)
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00
Saeed Maleki
df2f0c14ab bootstrap now takes interface (#113)
This PR fixes the issue regarding taking the interface as an input.
2023-06-29 00:16:06 +08:00
Changho Hwang
21eed722af Add license comments (#106) 2023-06-25 12:40:12 +08:00
Saeed Maleki
cd69704c7d Minor IB bug fix (#111)
`wr_->next` for IB is set to `nullptr` always.
2023-06-19 12:28:38 +08:00
Changho Hwang
60b3dd5a61 Bug fixes & resolve warnings (#107)
* Fix a bug in host hashing
* Fix a bug in `HostEpoch::wait()`
* Remove misc warnings
2023-06-16 09:31:23 +00:00
Changho Hwang
c4a5958dfc Fix hanging bootstrap issues (#100)
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-15 11:29:49 +08:00
Binyang2014
b1ce368656 Implement host offload algorithm for allgather (#84)
Implement host offload algorithm for allgather
For 1n-8p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            32    73.02    0.01    0.01      0

# Out of bounds values : 0 OK
#
```
For 2n-16p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
#                                    in-place                       out-of-place          
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16    90.30    0.01    0.01      0

# Out of bounds values : 0 OK
#
```
2023-06-13 10:01:58 +00:00
Changho Hwang
76718e4015 Saemal/atomic signal (#96)
* code complelete

* fix correctness issue

* Fix correctness issuee

* fix lint

* ass compile

* Fix build issue

* Fix runtime error

* Fix correctness issue

* Fix crash issue

* minor change

* Fix memory leak

* Fix review comments

* Finish allgather

* address comments

* load element to register first then store to remote address

* Finish allGather

* init

* Build connections

* allreduce_test works

* Bug fix

* Add CUDA flags

* Add packet copy (LL)

* Lint

* Set tmpPtr from constructors

* Lint

* Multiple blocks per peer

* Beautify

* Temporal ring reduce

* Ring reduce works correctly

* Overlapping

* Fix overlapping

* Improve vector sum

* figuring out how to use atomics

* working now

* wip

* Enhance LL AllReduce

* Support multiple blocks per peer

* Fix a ring reduce bug

* Fix a AllReduce kernel 2 bug

* Bug fix

* wip

* Make it compilable

* Lint

* Lint

* Minor changes

* Unit test to reproduce memory consistency bugs

* Unit test bug fixes

* Fixes

* Typo

* wip

* done with core

* wip

* wip

* compiles

* only the atomic is failing

* almost working

* all tests pass now

* clang-12

* More jailbreaks

* bug fix for common.cu

* adding stdint to concurrency.hpp

* Out-of-place for AllReduce kernel 2

* Optimize `sync()`

* Fix mp_unit_tests

* Init TestEngine with TestArgs

* Change common.cu into common.cc

* Cleanup common.hpp

* Lint

* fixes to the mscclpp-tests

* fixed common.cc

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-12 21:38:06 -07:00
Olli Saarikivi
5d5e9a1805 Make bootstrap use persistent sockets (#98) 2023-06-12 15:13:30 +08:00
Changho Hwang
5a4885ccbb Misc updates (#95) 2023-06-12 13:53:43 +08:00
Changho Hwang
798631bd52 Update unit tests (#81) 2023-06-08 09:58:05 +00:00
Changho Hwang
9cee6c4a74 Cleanup old files and functions (#86) 2023-06-01 17:34:57 +08:00
Olli Saarikivi
457c422791 Remove alloc.h and beef up cuda_utils.hpp (#82) 2023-05-24 08:34:18 +00:00
Changho Hwang
8d54bf3301 Update CI (#79) 2023-05-21 11:45:41 -07:00
Binyang2014
a3cf48cc5d Rewrite mscclpp-test with cpp style API (#77)
- Rewrite mscclpp-test with cpp style API
- Add SM copy
- add new sendRecv test
2023-05-19 14:14:19 +08:00
Olli Saarikivi
4e4d1972e3 Cuda smart pointers 2023-05-16 16:16:00 -07:00
Olli Saarikivi
d83343ef4e Make getWc not return a void pointer 2023-05-16 22:52:38 +00:00
Olli Saarikivi
dee55997e9 Remove free and most reinterpret_casts in IB code 2023-05-16 22:48:16 +00:00
Saeed Maleki
6f7ca05305 Merge remote-tracking branch 'origin/api-extension' into saemal/offloading 2023-05-12 22:43:22 +00:00
Saeed Maleki
2a7b745972 fully working with double buffering 2023-05-12 22:42:22 +00:00
Saeed Maleki
2691784b88 working -- at least for single node 2023-05-12 20:21:58 +00:00
Saeed Maleki
62f96f316c Merge branch 'api-extension' into binyli/exception 2023-05-11 21:24:18 +00:00
Binyang Li
e63aae7142 Merge apt-extension 2023-05-11 09:20:41 +00:00
Binyang Li
785a973ace refine exception 2023-05-11 08:25:25 +00:00
Olli Saarikivi
9f6c48cbf9 Format all files 2023-05-11 00:23:14 +00:00
Olli Saarikivi
ccf45b33a2 Delete old init code and other C-style code 2023-05-10 22:03:42 +00:00
Olli Saarikivi
b2dfd8a8fe Merge branch 'api-extension' of https://github.com/microsoft/mscclpp into api-extension 2023-05-10 20:50:51 +00:00
Olli Saarikivi
beaf2aea39 Move public headers under include/ 2023-05-10 20:46:49 +00:00
Saeed Maleki
c05586f074 Merge branch 'api-extension' of https://github.com/microsoft/mscclpp into api-extension 2023-05-10 20:24:40 +00:00
Saeed Maleki
33eb4093ac timeout fix 2023-05-10 20:24:33 +00:00
Olli Saarikivi
75a2af8de2 Add GoogleTest with CTest integration + some tests
Also rename addSetup to onSetup to unify naming.
2023-05-10 18:46:55 +00:00
Saeed Maleki
1769138568 Host Epoch + Error code 2023-05-09 23:10:12 +00:00
Saeed Maleki
8b384600a9 host epoch works 2023-05-09 22:17:43 +00:00
Binyang Li
9c40d616d9 Merge main branch 2023-05-09 10:59:04 +00:00
Binyang2014
8650dbaff8 Add exception class for mscclpp (#67)
Add exception class for mscclpp
2023-05-06 16:27:25 +08:00
Saeed Maleki
9fb29f9dfc timeout for flush 2023-05-04 17:48:24 +00:00
Binyang Li
bb3239fd6b Fix IB write issue 2023-05-04 11:03:45 +00:00
Olli Saarikivi
bd2121a2ef CMake improvement 2023-05-04 00:53:50 +00:00
Olli Saarikivi
09d5f7c12e Fixes for cmake 2023-05-04 00:39:30 +00:00
Olli Saarikivi
503cdd5c7e CMake build system transition WIP 2023-05-03 23:52:13 +00:00
Saeed Maleki
7af687954c removing old mscclppComm_t comm from communicator 2023-05-03 20:23:51 +00:00
Olli Saarikivi
4a41c19e72 Fix performance bug and base pointer offset 2023-05-03 19:40:23 +00:00