Commit Graph

85 Commits

Author SHA1 Message Date
Changho Hwang
ae56698d67 New semaphore constructors (#559)
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Changho Hwang
83356957bd Improved documentation & minor interface revision (#541) 2025-06-03 14:26:27 -07:00
Changho Hwang
de664ad200 Fix #514 (#521)
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Qinghua Zhou
a7c364beb8 nccl/rccl integration (#469)
Use dlopen to load nccl/rccl Apis from shared library to
enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations.

Add three related environment variables
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all"
By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis.
---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2025-03-20 11:31:37 -07:00
Changho Hwang
869cdba00c Manage runtime environments (#452)
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
2127a3ba29 Improve CMake options (#376)
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-11-22 01:54:11 +00:00
Changho Hwang
0c150e5166 Fix copyright messages (#367) 2024-10-17 21:25:46 -07:00
Changho Hwang
d4ede480f4 Ethernet support (#284)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-04-25 11:06:43 -07:00
Binyang Li
64d837f9ab Add executor to execute schedule-plan file (#283)
Add executor to execute the JSON schedule file generated by msccl-tools

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-04-18 19:10:41 +00:00
Changho Hwang
5ba6ce00c7 Fix bootstrapping mechanism (#278)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>
2024-03-27 10:24:24 +08:00
Saeed Maleki
91d592dcc0 NVLS support. (#250)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-04 20:46:10 -08:00
Changho Hwang
a6b24dcbed Fix #163 (#182)
The bug was caused as frequent calls of initialize() temporarily exhaust
all available ephemeral ports. Fixed by retrying `bind()` after a while
upon `EADDRINUSE`.
2023-09-15 08:35:01 +00:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Saeed Maleki
e7d5e652df Python bindings (#125)
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-19 15:35:54 +08:00
Saeed Maleki
df2f0c14ab bootstrap now takes interface (#113)
This PR fixes the issue regarding taking the interface as an input.
2023-06-29 00:16:06 +08:00
Changho Hwang
21eed722af Add license comments (#106) 2023-06-25 12:40:12 +08:00
Changho Hwang
c4a5958dfc Fix hanging bootstrap issues (#100)
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-06-15 11:29:49 +08:00
Olli Saarikivi
5d5e9a1805 Make bootstrap use persistent sockets (#98) 2023-06-12 15:13:30 +08:00
Changho Hwang
5a4885ccbb Misc updates (#95) 2023-06-12 13:53:43 +08:00
Changho Hwang
9cee6c4a74 Cleanup old files and functions (#86) 2023-06-01 17:34:57 +08:00
Olli Saarikivi
457c422791 Remove alloc.h and beef up cuda_utils.hpp (#82) 2023-05-24 08:34:18 +00:00
Olli Saarikivi
4e4d1972e3 Cuda smart pointers 2023-05-16 16:16:00 -07:00
Olli Saarikivi
9f6c48cbf9 Format all files 2023-05-11 00:23:14 +00:00
Olli Saarikivi
ccf45b33a2 Delete old init code and other C-style code 2023-05-10 22:03:42 +00:00
Olli Saarikivi
beaf2aea39 Move public headers under include/ 2023-05-10 20:46:49 +00:00
Saeed Maleki
1769138568 Host Epoch + Error code 2023-05-09 23:10:12 +00:00
Binyang2014
8650dbaff8 Add exception class for mscclpp (#67)
Add exception class for mscclpp
2023-05-06 16:27:25 +08:00
Saeed Maleki
82c27625e6 ipc uses a base ptr now 2023-04-27 21:33:15 +00:00
Saeed Maleki
8fc822c848 more tests for bootstrap 2023-04-25 22:26:48 +00:00
Saeed Maleki
b73b0132ba using find instead of searching 2023-04-25 21:27:23 +00:00
Saeed Maleki
8f2f053f2f more clean up 2023-04-25 21:08:49 +00:00
Changho Hwang
71b075e0d7 Rename 2023-04-25 12:29:32 +00:00
Changho Hwang
4115559c2f cleanup 2023-04-25 12:25:08 +00:00
Changho Hwang
bb195b2f29 PascalCase for type names 2023-04-25 11:57:02 +00:00
Changho Hwang
31f7897d5d integrate with new interfaces in mscclpp.hpp 2023-04-25 11:47:58 +00:00
Saeed Maleki
8428b49858 a few minor changes 2023-04-25 01:51:47 +00:00
Saeed Maleki
3546e80aa0 unique ptr for pimpl_ in bootstrap 2023-04-25 00:47:48 +00:00
Saeed Maleki
3fd95265fd Revert "lint"
This reverts commit 2c52ab37ce.
2023-04-24 23:22:56 +00:00
Saeed Maleki
2c52ab37ce lint 2023-04-24 23:09:12 +00:00
Saeed Maleki
d6e91338d4 bootstrap tests pass 2023-04-24 23:07:38 +00:00
Saeed Maleki
27114d91fb bootstrap tests pass 2023-04-24 21:50:03 +00:00
Saeed Maleki
f0f058410a working bootstrap initialization 2023-04-24 19:25:06 +00:00
Saeed Maleki
6f4dc57331 fixed 2023-04-24 07:45:01 +00:00
Saeed Maleki
a9cfb82fcb wip 2023-04-24 05:58:11 +00:00
Binyang Li
073460c341 fx compile issue 2023-04-23 14:25:56 +00:00
Binyang Li
7e1a77a132 make build pass 2023-04-21 09:41:52 +00:00
Binyang Li
7ac861b1e9 Refactor bootstrap 2023-04-21 08:41:33 +00:00
Binyang2014
804692f282 Binyli/bootstrap (#60)
Bootstrap refactor.
2023-04-21 13:59:42 +08:00
Saeed Maleki
9c8942f7ac wip 2023-04-19 22:09:53 +00:00
Saeed Maleki
ec9737db82 progress 2023-04-19 00:34:47 +00:00