Commit Graph

46 Commits

Author SHA1 Message Date
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
f2b52c6318 Fix Python binding of exceptions (#444)
* Fixed errors to be catchable from Python code
* Skip IB tests in Python unit tests when IB ports are down
2025-01-09 11:58:23 -08:00
Changho Hwang
1e82dd444f Make ibverbs optional at compile time (#340)
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-08-21 12:47:05 -07:00
Caio Rocha
ead4efc315 Dynamically load libibverbs (#337) 2024-08-13 23:48:39 -07:00
Changho Hwang
5fa5bd2706 Check nvidia_peermem during runtime (#234) 2023-12-25 12:02:10 +08:00
Changho Hwang
e710701728 Warning ahead of CQ being full (#202) 2023-11-15 08:03:29 +00:00
Saeed Maleki
015e29c138 adding signal for atomic op (#178)
This address [this](https://github.com/microsoft/mscclpp/issues/177).
2023-09-11 10:46:25 -07:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Changho Hwang
6ec585f3d8 Packet copy for IB (#109)
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00
Changho Hwang
21eed722af Add license comments (#106) 2023-06-25 12:40:12 +08:00
Saeed Maleki
cd69704c7d Minor IB bug fix (#111)
`wr_->next` for IB is set to `nullptr` always.
2023-06-19 12:28:38 +08:00
Changho Hwang
c4a5958dfc Fix hanging bootstrap issues (#100)
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-15 11:29:49 +08:00
Changho Hwang
76718e4015 Saemal/atomic signal (#96)
* code complelete

* fix correctness issue

* Fix correctness issuee

* fix lint

* ass compile

* Fix build issue

* Fix runtime error

* Fix correctness issue

* Fix crash issue

* minor change

* Fix memory leak

* Fix review comments

* Finish allgather

* address comments

* load element to register first then store to remote address

* Finish allGather

* init

* Build connections

* allreduce_test works

* Bug fix

* Add CUDA flags

* Add packet copy (LL)

* Lint

* Set tmpPtr from constructors

* Lint

* Multiple blocks per peer

* Beautify

* Temporal ring reduce

* Ring reduce works correctly

* Overlapping

* Fix overlapping

* Improve vector sum

* figuring out how to use atomics

* working now

* wip

* Enhance LL AllReduce

* Support multiple blocks per peer

* Fix a ring reduce bug

* Fix a AllReduce kernel 2 bug

* Bug fix

* wip

* Make it compilable

* Lint

* Lint

* Minor changes

* Unit test to reproduce memory consistency bugs

* Unit test bug fixes

* Fixes

* Typo

* wip

* done with core

* wip

* wip

* compiles

* only the atomic is failing

* almost working

* all tests pass now

* clang-12

* More jailbreaks

* bug fix for common.cu

* adding stdint to concurrency.hpp

* Out-of-place for AllReduce kernel 2

* Optimize `sync()`

* Fix mp_unit_tests

* Init TestEngine with TestArgs

* Change common.cu into common.cc

* Cleanup common.hpp

* Lint

* fixes to the mscclpp-tests

* fixed common.cc

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-12 21:38:06 -07:00
Changho Hwang
9cee6c4a74 Cleanup old files and functions (#86) 2023-06-01 17:34:57 +08:00
Olli Saarikivi
4e4d1972e3 Cuda smart pointers 2023-05-16 16:16:00 -07:00
Olli Saarikivi
d83343ef4e Make getWc not return a void pointer 2023-05-16 22:52:38 +00:00
Olli Saarikivi
dee55997e9 Remove free and most reinterpret_casts in IB code 2023-05-16 22:48:16 +00:00
Binyang Li
e63aae7142 Merge apt-extension 2023-05-11 09:20:41 +00:00
Olli Saarikivi
9f6c48cbf9 Format all files 2023-05-11 00:23:14 +00:00
Olli Saarikivi
ccf45b33a2 Delete old init code and other C-style code 2023-05-10 22:03:42 +00:00
Olli Saarikivi
beaf2aea39 Move public headers under include/ 2023-05-10 20:46:49 +00:00
Saeed Maleki
1769138568 Host Epoch + Error code 2023-05-09 23:10:12 +00:00
Binyang Li
9c40d616d9 Merge main branch 2023-05-09 10:59:04 +00:00
Binyang2014
8650dbaff8 Add exception class for mscclpp (#67)
Add exception class for mscclpp
2023-05-06 16:27:25 +08:00
Binyang Li
bb3239fd6b Fix IB write issue 2023-05-04 11:03:45 +00:00
Olli Saarikivi
4ba8516832 allgather_test_cpp functional again 2023-05-02 23:14:13 +00:00
Saeed Maleki
82c27625e6 ipc uses a base ptr now 2023-04-27 21:33:15 +00:00
Olli Saarikivi
06c6df2350 Separate out Transport and TransportFlags 2023-04-27 19:06:35 +00:00
Saeed Maleki
8eda6369ee testing connection setup 2023-04-27 06:08:35 +00:00
Changho Hwang
08e80f1754 IB: completely replaced with C++ interfaces 2023-04-27 04:01:46 +00:00
Changho Hwang
35ade686ff IB in cpp style WIP 2023-04-23 14:47:07 +00:00
Saeed Maleki
17e8ba17a7 lint + typo fix 2023-04-17 19:06:58 +00:00
Saeed Maleki
b885d46607 add grh flags 2023-04-15 01:43:37 +00:00
Saeed Maleki
151ea7658e some changes 2023-04-14 18:56:54 +00:00
Saeed Maleki
8927dd4d72 great allgather numbers with the current binding mechanism 2023-04-01 18:54:42 +00:00
Saeed Maleki
e2cfd5ac83 a lot of documentation 2023-03-30 00:37:33 +00:00
Saeed Maleki
19bf369dc1 link format correction 2023-03-27 20:40:15 +00:00
Binyang Li
68a258fce5 Fix postSend bug 2023-03-24 05:31:10 +00:00
Changho Hwang
48a23243a4 Dealloc more resources 2023-03-22 12:06:35 +00:00
Olli Saarikivi
0cfe2dcffb Add allpairs allreduce test
To support this include separate source and destination offsets in the trigger.
Add functions for getting the rank and world size from a communicator.
2023-03-21 19:00:13 +00:00
Changho Hwang
e357beef00 One fifo per proxy 2023-03-13 14:19:36 +00:00
Changho Hwang
2c10142c89 IB fixes 2023-03-03 08:37:23 +00:00
Changho Hwang
9e5573f16b Misc changes and comments 2023-03-03 08:32:47 +00:00
Changho Hwang
a78b78aa43 Erase unnecessary memsets 2023-02-23 09:42:55 +00:00
Changho Hwang
29a430e7a8 NUMA binding 2023-02-23 08:18:12 +00:00
Changho Hwang
48b81edf6d Move some files 2023-02-22 11:07:22 +00:00