Changho Hwang
|
869cdba00c
|
Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
|
2025-01-15 09:44:52 -08:00 |
|
Changho Hwang
|
f2b52c6318
|
Fix Python binding of exceptions (#444)
* Fixed errors to be catchable from Python code
* Skip IB tests in Python unit tests when IB ports are down
|
2025-01-09 11:58:23 -08:00 |
|
Changho Hwang
|
1e82dd444f
|
Make ibverbs optional at compile time (#340)
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
|
2024-08-21 12:47:05 -07:00 |
|
Caio Rocha
|
ead4efc315
|
Dynamically load libibverbs (#337)
|
2024-08-13 23:48:39 -07:00 |
|
Changho Hwang
|
5fa5bd2706
|
Check nvidia_peermem during runtime (#234)
|
2023-12-25 12:02:10 +08:00 |
|
Changho Hwang
|
e710701728
|
Warning ahead of CQ being full (#202)
|
2023-11-15 08:03:29 +00:00 |
|
Saeed Maleki
|
015e29c138
|
adding signal for atomic op (#178)
This address [this](https://github.com/microsoft/mscclpp/issues/177).
|
2023-09-11 10:46:25 -07:00 |
|
Saeed Maleki
|
8d1b984bed
|
Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
|
2023-08-16 20:00:56 +08:00 |
|
Changho Hwang
|
6ec585f3d8
|
Packet copy for IB (#109)
* Extend channels to support LL with IB
* Rename classes and interfaces
|
2023-06-28 10:39:31 -07:00 |
|
Changho Hwang
|
21eed722af
|
Add license comments (#106)
|
2023-06-25 12:40:12 +08:00 |
|
Saeed Maleki
|
cd69704c7d
|
Minor IB bug fix (#111)
`wr_->next` for IB is set to `nullptr` always.
|
2023-06-19 12:28:38 +08:00 |
|
Changho Hwang
|
c4a5958dfc
|
Fix hanging bootstrap issues (#100)
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
|
2023-06-15 11:29:49 +08:00 |
|
Changho Hwang
|
76718e4015
|
Saemal/atomic signal (#96)
* code complelete
* fix correctness issue
* Fix correctness issuee
* fix lint
* ass compile
* Fix build issue
* Fix runtime error
* Fix correctness issue
* Fix crash issue
* minor change
* Fix memory leak
* Fix review comments
* Finish allgather
* address comments
* load element to register first then store to remote address
* Finish allGather
* init
* Build connections
* allreduce_test works
* Bug fix
* Add CUDA flags
* Add packet copy (LL)
* Lint
* Set tmpPtr from constructors
* Lint
* Multiple blocks per peer
* Beautify
* Temporal ring reduce
* Ring reduce works correctly
* Overlapping
* Fix overlapping
* Improve vector sum
* figuring out how to use atomics
* working now
* wip
* Enhance LL AllReduce
* Support multiple blocks per peer
* Fix a ring reduce bug
* Fix a AllReduce kernel 2 bug
* Bug fix
* wip
* Make it compilable
* Lint
* Lint
* Minor changes
* Unit test to reproduce memory consistency bugs
* Unit test bug fixes
* Fixes
* Typo
* wip
* done with core
* wip
* wip
* compiles
* only the atomic is failing
* almost working
* all tests pass now
* clang-12
* More jailbreaks
* bug fix for common.cu
* adding stdint to concurrency.hpp
* Out-of-place for AllReduce kernel 2
* Optimize `sync()`
* Fix mp_unit_tests
* Init TestEngine with TestArgs
* Change common.cu into common.cc
* Cleanup common.hpp
* Lint
* fixes to the mscclpp-tests
* fixed common.cc
---------
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
|
2023-06-12 21:38:06 -07:00 |
|
Changho Hwang
|
9cee6c4a74
|
Cleanup old files and functions (#86)
|
2023-06-01 17:34:57 +08:00 |
|
Olli Saarikivi
|
4e4d1972e3
|
Cuda smart pointers
|
2023-05-16 16:16:00 -07:00 |
|
Olli Saarikivi
|
d83343ef4e
|
Make getWc not return a void pointer
|
2023-05-16 22:52:38 +00:00 |
|
Olli Saarikivi
|
dee55997e9
|
Remove free and most reinterpret_casts in IB code
|
2023-05-16 22:48:16 +00:00 |
|
Binyang Li
|
e63aae7142
|
Merge apt-extension
|
2023-05-11 09:20:41 +00:00 |
|
Olli Saarikivi
|
9f6c48cbf9
|
Format all files
|
2023-05-11 00:23:14 +00:00 |
|
Olli Saarikivi
|
ccf45b33a2
|
Delete old init code and other C-style code
|
2023-05-10 22:03:42 +00:00 |
|
Olli Saarikivi
|
beaf2aea39
|
Move public headers under include/
|
2023-05-10 20:46:49 +00:00 |
|
Saeed Maleki
|
1769138568
|
Host Epoch + Error code
|
2023-05-09 23:10:12 +00:00 |
|
Binyang Li
|
9c40d616d9
|
Merge main branch
|
2023-05-09 10:59:04 +00:00 |
|
Binyang2014
|
8650dbaff8
|
Add exception class for mscclpp (#67)
Add exception class for mscclpp
|
2023-05-06 16:27:25 +08:00 |
|
Binyang Li
|
bb3239fd6b
|
Fix IB write issue
|
2023-05-04 11:03:45 +00:00 |
|
Olli Saarikivi
|
4ba8516832
|
allgather_test_cpp functional again
|
2023-05-02 23:14:13 +00:00 |
|
Saeed Maleki
|
82c27625e6
|
ipc uses a base ptr now
|
2023-04-27 21:33:15 +00:00 |
|
Olli Saarikivi
|
06c6df2350
|
Separate out Transport and TransportFlags
|
2023-04-27 19:06:35 +00:00 |
|
Saeed Maleki
|
8eda6369ee
|
testing connection setup
|
2023-04-27 06:08:35 +00:00 |
|
Changho Hwang
|
08e80f1754
|
IB: completely replaced with C++ interfaces
|
2023-04-27 04:01:46 +00:00 |
|
Changho Hwang
|
35ade686ff
|
IB in cpp style WIP
|
2023-04-23 14:47:07 +00:00 |
|
Saeed Maleki
|
17e8ba17a7
|
lint + typo fix
|
2023-04-17 19:06:58 +00:00 |
|
Saeed Maleki
|
b885d46607
|
add grh flags
|
2023-04-15 01:43:37 +00:00 |
|
Saeed Maleki
|
151ea7658e
|
some changes
|
2023-04-14 18:56:54 +00:00 |
|
Saeed Maleki
|
8927dd4d72
|
great allgather numbers with the current binding mechanism
|
2023-04-01 18:54:42 +00:00 |
|
Saeed Maleki
|
e2cfd5ac83
|
a lot of documentation
|
2023-03-30 00:37:33 +00:00 |
|
Saeed Maleki
|
19bf369dc1
|
link format correction
|
2023-03-27 20:40:15 +00:00 |
|
Binyang Li
|
68a258fce5
|
Fix postSend bug
|
2023-03-24 05:31:10 +00:00 |
|
Changho Hwang
|
48a23243a4
|
Dealloc more resources
|
2023-03-22 12:06:35 +00:00 |
|
Olli Saarikivi
|
0cfe2dcffb
|
Add allpairs allreduce test
To support this include separate source and destination offsets in the trigger.
Add functions for getting the rank and world size from a communicator.
|
2023-03-21 19:00:13 +00:00 |
|
Changho Hwang
|
e357beef00
|
One fifo per proxy
|
2023-03-13 14:19:36 +00:00 |
|
Changho Hwang
|
2c10142c89
|
IB fixes
|
2023-03-03 08:37:23 +00:00 |
|
Changho Hwang
|
9e5573f16b
|
Misc changes and comments
|
2023-03-03 08:32:47 +00:00 |
|
Changho Hwang
|
a78b78aa43
|
Erase unnecessary memsets
|
2023-02-23 09:42:55 +00:00 |
|
Changho Hwang
|
29a430e7a8
|
NUMA binding
|
2023-02-23 08:18:12 +00:00 |
|
Changho Hwang
|
48b81edf6d
|
Move some files
|
2023-02-22 11:07:22 +00:00 |
|