Binyang Li
774a0104d7
fix
2023-08-18 07:11:22 +00:00
Binyang Li
05d35cdce5
merge main
2023-08-18 05:35:40 +00:00
Saeed Maleki
8d1b984bed
Change device handle interfaces & others ( #142 )
...
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net >
2023-08-16 20:00:56 +08:00
Binyang Li
a54e6a7aa9
update
2023-07-28 03:04:19 +00:00
Binyang Li
59e15c80de
address comments
2023-07-27 09:13:43 +00:00
Binyang Li
72b87d9645
add doc string
2023-07-26 11:20:00 +00:00
Binyang Li
6d18f7b9f8
add UT
2023-07-26 11:03:12 +00:00
Binyang Li
4cda05c7c2
add test
2023-07-26 06:20:51 +00:00
Binyang2014
a58e2e9623
Make sure the semaphore not be released during the lifecycle of SmChannel ( #131 )
...
Fix #126
- Put `std::shared_ptr<SmDevice2DeviceSemaphore>` into the `SmChannel`
- add a `DeviceHandle` struct in `SmChannel`
- add `DeviceHandle` template
Users need to write code like this to use channel in device side:
```
using DeviceHandle = mscclpp::DeviceHandle<T>;
__device__ DeviceHandle<mscclpp::SimpleProxyChannel> channel;
__device__ DeviceHandle<mscclpp::SmChannel> smChannel;
```
To cover a channel to deviceHandle, need to call this function:
`mscclpp::deviceHandle(SimpleProxyChannel or SmChannel)`
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-07-20 12:18:22 +08:00
Saeed Maleki
e7d5e652df
Python bindings ( #125 )
...
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-19 15:35:54 +08:00
Changho Hwang
1d71715d19
Separate mscclpp-test kernels ( #122 )
...
Separate different kernel implementations in mscclpp-test to reduce the
number of registers required by the kernels.
2023-07-10 10:11:20 -07:00
Binyang2014
56bdbc2f32
Enable test for both cuda11 and cuda12 ( #124 )
...
Update pipeline: enable test for both cuda11 and cuda12
2023-07-10 13:19:14 +08:00
Changho Hwang
4114d65c60
Documents & minor updates ( #119 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-07 17:35:05 +08:00
Changho Hwang
bb7b85a810
2-node AllReduce improvements ( #118 )
...
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-07-07 07:05:46 +00:00
Changho Hwang
6ec585f3d8
Packet copy for IB ( #109 )
...
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00
Saeed Maleki
df2f0c14ab
bootstrap now takes interface ( #113 )
...
This PR fixes the issue regarding taking the interface as an input.
2023-06-29 00:16:06 +08:00
Changho Hwang
21eed722af
Add license comments ( #106 )
2023-06-25 12:40:12 +08:00
Binyang2014
2640578b22
Add performance check for mscclpp-test ( #110 )
...
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Saeed Maleki
cd7797fd5e
FIFO optimization ( #112 )
...
This saves 2us on IB latency
2023-06-19 05:36:56 +00:00
Changho Hwang
60b3dd5a61
Bug fixes & resolve warnings ( #107 )
...
* Fix a bug in host hashing
* Fix a bug in `HostEpoch::wait()`
* Remove misc warnings
2023-06-16 09:31:23 +00:00
Binyang2014
8410fcd8fc
Fix allgather kernel 2 perf bug ( #108 )
...
Fix #105
2023-06-16 15:36:20 +08:00
Changho Hwang
6cd8960394
DirectChannel Unit Tests ( #102 )
...
* Add DirectChannel unit tests
* Split mp_unit_tests.cu into multiple files
2023-06-15 20:55:57 +08:00
Changho Hwang
c4a5958dfc
Fix hanging bootstrap issues ( #100 )
...
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-06-15 11:29:49 +08:00
Binyang2014
8efacae332
update pipeline ( #103 )
...
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
2023-06-14 20:14:57 +08:00
Changho Hwang
4d0b0a650f
Remove vulnerable sscanf ( #101 )
2023-06-14 10:02:46 +08:00
Binyang2014
b1ce368656
Implement host offload algorithm for allgather ( #84 )
...
Implement host offload algorithm for allgather
For 1n-8p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 32 73.02 0.01 0.01 0
# Out of bounds values : 0 OK
#
```
For 2n-16p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 16 90.30 0.01 0.01 0
# Out of bounds values : 0 OK
#
```
2023-06-13 10:01:58 +00:00
Binyang2014
6ee4e80317
Create Azure pipeline for multi-node tests ( #97 )
...
Create Azure pipeline to run mscclpp-test on multi-nodes
2023-06-13 06:34:07 +00:00
Changho Hwang
76718e4015
Saemal/atomic signal ( #96 )
...
* code complelete
* fix correctness issue
* Fix correctness issuee
* fix lint
* ass compile
* Fix build issue
* Fix runtime error
* Fix correctness issue
* Fix crash issue
* minor change
* Fix memory leak
* Fix review comments
* Finish allgather
* address comments
* load element to register first then store to remote address
* Finish allGather
* init
* Build connections
* allreduce_test works
* Bug fix
* Add CUDA flags
* Add packet copy (LL)
* Lint
* Set tmpPtr from constructors
* Lint
* Multiple blocks per peer
* Beautify
* Temporal ring reduce
* Ring reduce works correctly
* Overlapping
* Fix overlapping
* Improve vector sum
* figuring out how to use atomics
* working now
* wip
* Enhance LL AllReduce
* Support multiple blocks per peer
* Fix a ring reduce bug
* Fix a AllReduce kernel 2 bug
* Bug fix
* wip
* Make it compilable
* Lint
* Lint
* Minor changes
* Unit test to reproduce memory consistency bugs
* Unit test bug fixes
* Fixes
* Typo
* wip
* done with core
* wip
* wip
* compiles
* only the atomic is failing
* almost working
* all tests pass now
* clang-12
* More jailbreaks
* bug fix for common.cu
* adding stdint to concurrency.hpp
* Out-of-place for AllReduce kernel 2
* Optimize `sync()`
* Fix mp_unit_tests
* Init TestEngine with TestArgs
* Change common.cu into common.cc
* Cleanup common.hpp
* Lint
* fixes to the mscclpp-tests
* fixed common.cc
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-06-12 21:38:06 -07:00
Changho Hwang
43de015f3f
Add packet copy (LL) for AllReduce ( #85 )
2023-06-12 21:53:50 +08:00
Olli Saarikivi
5d5e9a1805
Make bootstrap use persistent sockets ( #98 )
2023-06-12 15:13:30 +08:00
Changho Hwang
5a4885ccbb
Misc updates ( #95 )
2023-06-12 13:53:43 +08:00
Changho Hwang
798631bd52
Update unit tests ( #81 )
2023-06-08 09:58:05 +00:00
Changho Hwang
0c14a67ad2
[mscclpp-test] Add AllReduce and AllToAll tests ( #83 )
2023-06-07 10:58:47 +00:00
Changho Hwang
9cee6c4a74
Cleanup old files and functions ( #86 )
2023-06-01 17:34:57 +08:00
Olli Saarikivi
457c422791
Remove alloc.h and beef up cuda_utils.hpp ( #82 )
2023-05-24 08:34:18 +00:00
Binyang2014
216373eab2
Add allgather test to mscclpp-test ( #78 )
...
Finish allGather
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-05-23 00:37:25 -07:00
Binyang2014
a3cf48cc5d
Rewrite mscclpp-test with cpp style API ( #77 )
...
- Rewrite mscclpp-test with cpp style API
- Add SM copy
- add new sendRecv test
2023-05-19 14:14:19 +08:00
Olli Saarikivi
4e4d1972e3
Cuda smart pointers
2023-05-16 16:16:00 -07:00
Olli Saarikivi
00d4896c25
Rudimentary CTest support for test executables
2023-05-16 16:16:00 -07:00
Olli Saarikivi
d83343ef4e
Make getWc not return a void pointer
2023-05-16 22:52:38 +00:00
Saeed Maleki
5de083ad7e
freeing cudaMalloc'ed pointers
2023-05-15 23:53:30 +00:00
Saeed Maleki
2a7b745972
fully working with double buffering
2023-05-12 22:42:22 +00:00
Saeed Maleki
2691784b88
working -- at least for single node
2023-05-12 20:21:58 +00:00
Saeed Maleki
113473a116
more progress
2023-05-12 07:01:21 +00:00
Saeed Maleki
31851ad82c
host epoch removed
2023-05-12 06:11:12 +00:00
Saeed Maleki
ef558a42e8
wip
2023-05-12 05:54:32 +00:00
Binyang Li
e63aae7142
Merge apt-extension
2023-05-11 09:20:41 +00:00
Olli Saarikivi
beaf2aea39
Move public headers under include/
2023-05-10 20:46:49 +00:00
Olli Saarikivi
f4ecae7c96
Rename tests/ to test/
2023-05-10 18:49:02 +00:00