Changho Hwang
|
6cd8960394
|
DirectChannel Unit Tests (#102)
* Add DirectChannel unit tests
* Split mp_unit_tests.cu into multiple files
|
2023-06-15 20:55:57 +08:00 |
|
Changho Hwang
|
c4a5958dfc
|
Fix hanging bootstrap issues (#100)
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
|
2023-06-15 11:29:49 +08:00 |
|
Binyang2014
|
8efacae332
|
update pipeline (#103)
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
|
2023-06-14 20:14:57 +08:00 |
|
Changho Hwang
|
4d0b0a650f
|
Remove vulnerable sscanf (#101)
|
2023-06-14 10:02:46 +08:00 |
|
Binyang2014
|
b1ce368656
|
Implement host offload algorithm for allgather (#84)
Implement host offload algorithm for allgather
For 1n-8p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 32 73.02 0.01 0.01 0
# Out of bounds values : 0 OK
#
```
For 2n-16p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 16 90.30 0.01 0.01 0
# Out of bounds values : 0 OK
#
```
|
2023-06-13 10:01:58 +00:00 |
|
Binyang2014
|
6ee4e80317
|
Create Azure pipeline for multi-node tests (#97)
Create Azure pipeline to run mscclpp-test on multi-nodes
|
2023-06-13 06:34:07 +00:00 |
|
Changho Hwang
|
76718e4015
|
Saemal/atomic signal (#96)
* code complelete
* fix correctness issue
* Fix correctness issuee
* fix lint
* ass compile
* Fix build issue
* Fix runtime error
* Fix correctness issue
* Fix crash issue
* minor change
* Fix memory leak
* Fix review comments
* Finish allgather
* address comments
* load element to register first then store to remote address
* Finish allGather
* init
* Build connections
* allreduce_test works
* Bug fix
* Add CUDA flags
* Add packet copy (LL)
* Lint
* Set tmpPtr from constructors
* Lint
* Multiple blocks per peer
* Beautify
* Temporal ring reduce
* Ring reduce works correctly
* Overlapping
* Fix overlapping
* Improve vector sum
* figuring out how to use atomics
* working now
* wip
* Enhance LL AllReduce
* Support multiple blocks per peer
* Fix a ring reduce bug
* Fix a AllReduce kernel 2 bug
* Bug fix
* wip
* Make it compilable
* Lint
* Lint
* Minor changes
* Unit test to reproduce memory consistency bugs
* Unit test bug fixes
* Fixes
* Typo
* wip
* done with core
* wip
* wip
* compiles
* only the atomic is failing
* almost working
* all tests pass now
* clang-12
* More jailbreaks
* bug fix for common.cu
* adding stdint to concurrency.hpp
* Out-of-place for AllReduce kernel 2
* Optimize `sync()`
* Fix mp_unit_tests
* Init TestEngine with TestArgs
* Change common.cu into common.cc
* Cleanup common.hpp
* Lint
* fixes to the mscclpp-tests
* fixed common.cc
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
|
2023-06-12 21:38:06 -07:00 |
|
Changho Hwang
|
43de015f3f
|
Add packet copy (LL) for AllReduce (#85)
|
2023-06-12 21:53:50 +08:00 |
|
Olli Saarikivi
|
5d5e9a1805
|
Make bootstrap use persistent sockets (#98)
|
2023-06-12 15:13:30 +08:00 |
|
Changho Hwang
|
5a4885ccbb
|
Misc updates (#95)
|
2023-06-12 13:53:43 +08:00 |
|
Changho Hwang
|
798631bd52
|
Update unit tests (#81)
|
2023-06-08 09:58:05 +00:00 |
|
Changho Hwang
|
0c14a67ad2
|
[mscclpp-test] Add AllReduce and AllToAll tests (#83)
|
2023-06-07 10:58:47 +00:00 |
|
Changho Hwang
|
9cee6c4a74
|
Cleanup old files and functions (#86)
|
2023-06-01 17:34:57 +08:00 |
|
Olli Saarikivi
|
457c422791
|
Remove alloc.h and beef up cuda_utils.hpp (#82)
|
2023-05-24 08:34:18 +00:00 |
|
Binyang2014
|
216373eab2
|
Add allgather test to mscclpp-test (#78)
Finish allGather
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
|
2023-05-23 00:37:25 -07:00 |
|
Binyang2014
|
a3cf48cc5d
|
Rewrite mscclpp-test with cpp style API (#77)
- Rewrite mscclpp-test with cpp style API
- Add SM copy
- add new sendRecv test
|
2023-05-19 14:14:19 +08:00 |
|
Olli Saarikivi
|
4e4d1972e3
|
Cuda smart pointers
|
2023-05-16 16:16:00 -07:00 |
|
Olli Saarikivi
|
00d4896c25
|
Rudimentary CTest support for test executables
|
2023-05-16 16:16:00 -07:00 |
|
Olli Saarikivi
|
d83343ef4e
|
Make getWc not return a void pointer
|
2023-05-16 22:52:38 +00:00 |
|
Saeed Maleki
|
5de083ad7e
|
freeing cudaMalloc'ed pointers
|
2023-05-15 23:53:30 +00:00 |
|
Saeed Maleki
|
2a7b745972
|
fully working with double buffering
|
2023-05-12 22:42:22 +00:00 |
|
Saeed Maleki
|
2691784b88
|
working -- at least for single node
|
2023-05-12 20:21:58 +00:00 |
|
Saeed Maleki
|
113473a116
|
more progress
|
2023-05-12 07:01:21 +00:00 |
|
Saeed Maleki
|
31851ad82c
|
host epoch removed
|
2023-05-12 06:11:12 +00:00 |
|
Saeed Maleki
|
ef558a42e8
|
wip
|
2023-05-12 05:54:32 +00:00 |
|
Binyang Li
|
e63aae7142
|
Merge apt-extension
|
2023-05-11 09:20:41 +00:00 |
|
Olli Saarikivi
|
beaf2aea39
|
Move public headers under include/
|
2023-05-10 20:46:49 +00:00 |
|
Olli Saarikivi
|
f4ecae7c96
|
Rename tests/ to test/
|
2023-05-10 18:49:02 +00:00 |
|