Changho Hwang
|
6cd8960394
|
DirectChannel Unit Tests (#102)
* Add DirectChannel unit tests
* Split mp_unit_tests.cu into multiple files
|
2023-06-15 20:55:57 +08:00 |
|
Changho Hwang
|
c4a5958dfc
|
Fix hanging bootstrap issues (#100)
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
|
2023-06-15 11:29:49 +08:00 |
|
Binyang2014
|
8efacae332
|
update pipeline (#103)
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
|
2023-06-14 20:14:57 +08:00 |
|
Changho Hwang
|
4d0b0a650f
|
Remove vulnerable sscanf (#101)
|
2023-06-14 10:02:46 +08:00 |
|
Binyang2014
|
b1ce368656
|
Implement host offload algorithm for allgather (#84)
Implement host offload algorithm for allgather
For 1n-8p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 32 73.02 0.01 0.01 0
# Out of bounds values : 0 OK
#
```
For 2n-16p
```
# Initializing MSCCL++
# Setting up the connection in MSCCL++
#
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 16 90.30 0.01 0.01 0
# Out of bounds values : 0 OK
#
```
|
2023-06-13 10:01:58 +00:00 |
|
Binyang2014
|
6ee4e80317
|
Create Azure pipeline for multi-node tests (#97)
Create Azure pipeline to run mscclpp-test on multi-nodes
|
2023-06-13 06:34:07 +00:00 |
|
Changho Hwang
|
76718e4015
|
Saemal/atomic signal (#96)
* code complelete
* fix correctness issue
* Fix correctness issuee
* fix lint
* ass compile
* Fix build issue
* Fix runtime error
* Fix correctness issue
* Fix crash issue
* minor change
* Fix memory leak
* Fix review comments
* Finish allgather
* address comments
* load element to register first then store to remote address
* Finish allGather
* init
* Build connections
* allreduce_test works
* Bug fix
* Add CUDA flags
* Add packet copy (LL)
* Lint
* Set tmpPtr from constructors
* Lint
* Multiple blocks per peer
* Beautify
* Temporal ring reduce
* Ring reduce works correctly
* Overlapping
* Fix overlapping
* Improve vector sum
* figuring out how to use atomics
* working now
* wip
* Enhance LL AllReduce
* Support multiple blocks per peer
* Fix a ring reduce bug
* Fix a AllReduce kernel 2 bug
* Bug fix
* wip
* Make it compilable
* Lint
* Lint
* Minor changes
* Unit test to reproduce memory consistency bugs
* Unit test bug fixes
* Fixes
* Typo
* wip
* done with core
* wip
* wip
* compiles
* only the atomic is failing
* almost working
* all tests pass now
* clang-12
* More jailbreaks
* bug fix for common.cu
* adding stdint to concurrency.hpp
* Out-of-place for AllReduce kernel 2
* Optimize `sync()`
* Fix mp_unit_tests
* Init TestEngine with TestArgs
* Change common.cu into common.cc
* Cleanup common.hpp
* Lint
* fixes to the mscclpp-tests
* fixed common.cc
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
|
2023-06-12 21:38:06 -07:00 |
|
Changho Hwang
|
43de015f3f
|
Add packet copy (LL) for AllReduce (#85)
|
2023-06-12 21:53:50 +08:00 |
|
Olli Saarikivi
|
5d5e9a1805
|
Make bootstrap use persistent sockets (#98)
|
2023-06-12 15:13:30 +08:00 |
|
Changho Hwang
|
5a4885ccbb
|
Misc updates (#95)
|
2023-06-12 13:53:43 +08:00 |
|
Changho Hwang
|
798631bd52
|
Update unit tests (#81)
|
2023-06-08 09:58:05 +00:00 |
|
Changho Hwang
|
0c14a67ad2
|
[mscclpp-test] Add AllReduce and AllToAll tests (#83)
|
2023-06-07 10:58:47 +00:00 |
|
Binyang2014
|
d9568a3235
|
Setup Azure pipeline to run mscclpp-test (#93)
|
2023-06-07 14:14:49 +08:00 |
|
Changho Hwang
|
7346e70109
|
Use MSCCL++ Docker image for CodeQL (#94)
|
2023-06-06 18:42:22 +08:00 |
|
Changho Hwang
|
85e664c2f7
|
Update docs (#88)
|
2023-06-05 13:13:10 +08:00 |
|
Changho Hwang
|
9cee6c4a74
|
Cleanup old files and functions (#86)
|
2023-06-01 17:34:57 +08:00 |
|
Olli Saarikivi
|
457c422791
|
Remove alloc.h and beef up cuda_utils.hpp (#82)
|
2023-05-24 08:34:18 +00:00 |
|
Binyang2014
|
216373eab2
|
Add allgather test to mscclpp-test (#78)
Finish allGather
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
|
2023-05-23 00:37:25 -07:00 |
|
Changho Hwang
|
0581bfb431
|
Fix CodeQL workflow (#80)
|
2023-05-22 14:03:30 +08:00 |
|
Changho Hwang
|
8d54bf3301
|
Update CI (#79)
|
2023-05-21 11:45:41 -07:00 |
|
Binyang2014
|
a3cf48cc5d
|
Rewrite mscclpp-test with cpp style API (#77)
- Rewrite mscclpp-test with cpp style API
- Add SM copy
- add new sendRecv test
|
2023-05-19 14:14:19 +08:00 |
|
Olli Saarikivi
|
4c0883bc91
|
Add a missing throw
|
2023-05-16 16:16:00 -07:00 |
|
Olli Saarikivi
|
4e4d1972e3
|
Cuda smart pointers
|
2023-05-16 16:16:00 -07:00 |
|
Olli Saarikivi
|
00d4896c25
|
Rudimentary CTest support for test executables
|
2023-05-16 16:16:00 -07:00 |
|
Olli Saarikivi
|
d83343ef4e
|
Make getWc not return a void pointer
|
2023-05-16 22:52:38 +00:00 |
|
Olli Saarikivi
|
dee55997e9
|
Remove free and most reinterpret_casts in IB code
|
2023-05-16 22:48:16 +00:00 |
|
Saeed Maleki
|
5de083ad7e
|
freeing cudaMalloc'ed pointers
|
2023-05-15 23:53:30 +00:00 |
|
Saeed Maleki
|
966402706c
|
Merge pull request #72 from microsoft/ziyyang/doxygen
Add Doxygen-based document
|
2023-05-15 16:50:17 -07:00 |
|
Saeed Maleki
|
e21392e2c3
|
Merge branch 'main' into ziyyang/doxygen
|
2023-05-15 23:45:54 +00:00 |
|
Saeed Maleki
|
112d1eeb22
|
Merge pull request #75 from microsoft/api-extension
Merge api-extension branch to main
|
2023-05-15 16:35:50 -07:00 |
|
Saeed Maleki
|
c9ac615b20
|
Merge pull request #74 from microsoft/saemal/offloading
offloading allgather to CPU entirely
|
2023-05-15 16:27:00 -07:00 |
|
Saeed Maleki
|
6f7ca05305
|
Merge remote-tracking branch 'origin/api-extension' into saemal/offloading
|
2023-05-12 22:43:22 +00:00 |
|
Saeed Maleki
|
2a7b745972
|
fully working with double buffering
|
2023-05-12 22:42:22 +00:00 |
|
Olli Saarikivi
|
8f2d7922ed
|
Change install dir
|
2023-05-12 21:25:29 +00:00 |
|
Olli Saarikivi
|
d58e698d51
|
Add headers to install and set default install dir
|
2023-05-12 21:23:01 +00:00 |
|
Saeed Maleki
|
2691784b88
|
working -- at least for single node
|
2023-05-12 20:21:58 +00:00 |
|
Saeed Maleki
|
113473a116
|
more progress
|
2023-05-12 07:01:21 +00:00 |
|
Saeed Maleki
|
31851ad82c
|
host epoch removed
|
2023-05-12 06:11:12 +00:00 |
|
Saeed Maleki
|
ef558a42e8
|
wip
|
2023-05-12 05:54:32 +00:00 |
|
Saeed Maleki
|
260c3e35f0
|
Merge pull request #73 from microsoft/binyli/exception
Refine exception
|
2023-05-11 14:29:41 -07:00 |
|
Saeed Maleki
|
62f96f316c
|
Merge branch 'api-extension' into binyli/exception
|
2023-05-11 21:24:18 +00:00 |
|
Binyang2014
|
643771bf93
|
Merge pull request #71 from microsoft/binyli/merge-main
Resolve conflict and merge main branch to api-extension
|
2023-05-11 17:39:06 +08:00 |
|
Binyang Li
|
e63aae7142
|
Merge apt-extension
|
2023-05-11 09:20:41 +00:00 |
|
Binyang Li
|
5704fb7c6a
|
update
|
2023-05-11 08:55:51 +00:00 |
|
Binyang Li
|
1487596dc8
|
update cpplint
|
2023-05-11 08:34:57 +00:00 |
|
Binyang Li
|
785a973ace
|
refine exception
|
2023-05-11 08:25:25 +00:00 |
|
Ziyue Yang
|
e257f19cb8
|
add doc section in readme
|
2023-05-11 00:46:02 +00:00 |
|
Olli Saarikivi
|
96a0c45fb4
|
Remove makefile
|
2023-05-11 00:23:21 +00:00 |
|
Olli Saarikivi
|
9f6c48cbf9
|
Format all files
|
2023-05-11 00:23:14 +00:00 |
|
Olli Saarikivi
|
ccf45b33a2
|
Delete old init code and other C-style code
|
2023-05-10 22:03:42 +00:00 |
|