Changho Hwang
ae56698d67
New semaphore constructors ( #559 )
...
More intuitive interfaces for creating semaphores and channels. Also
allows channel construction using third-party bootstrappers directly
without overriding MSCCL++ Bootstrap.
2025-07-12 00:10:46 +00:00
Changho Hwang
83356957bd
Improved documentation & minor interface revision ( #541 )
2025-06-03 14:26:27 -07:00
Changho Hwang
de664ad200
Fix #514 ( #521 )
...
* In cases when the same `tag` is used for receiving data from the same
remote rank, #514 changed the behavior of `Communicator::connect` and
`Communicator::recvMemory` to receive data in the order of
`std::shared_future::get()` is called, instead of the original behvaior
that receive data in the order of the method calls. Since the original
behavior is more intuitive, we get that back. Now when `get()` is called
on a future, the async function will first call `wait()` on the latest
previously returned future. In a recursive manner, this will call
`wait()` on all previous futures that are not yet ready.
* Removed all deprecated API calls and replaced into the new ones.
2025-05-13 13:43:35 -07:00
Qinghua Zhou
a7c364beb8
nccl/rccl integration ( #469 )
...
Use dlopen to load nccl/rccl Apis from shared library to
enable Allgather, Allreduce, Broadcast, ReduceScatter fallback to nccl/rccl operations.
Add three related environment variables
-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path/libnccl.so/librccl.so
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather,broadcast,reducescatter" or "all"
By default, if MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, all these operations will be fallback to nccl/rccl apis.
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2025-03-20 11:31:37 -07:00
Changho Hwang
869cdba00c
Manage runtime environments ( #452 )
...
* Add `Env` class that manages all runtime environments.
* Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.
2025-01-15 09:44:52 -08:00
Changho Hwang
2127a3ba29
Improve CMake options ( #376 )
...
* Let all CMake option names start with `MSCCLPP_`
* Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-11-22 01:54:11 +00:00
Changho Hwang
0c150e5166
Fix copyright messages ( #367 )
2024-10-17 21:25:46 -07:00
Changho Hwang
d4ede480f4
Ethernet support ( #284 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-04-25 11:06:43 -07:00
Binyang Li
64d837f9ab
Add executor to execute schedule-plan file ( #283 )
...
Add executor to execute the JSON schedule file generated by msccl-tools
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-04-18 19:10:41 +00:00
Changho Hwang
5ba6ce00c7
Fix bootstrapping mechanism ( #278 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com >
2024-03-27 10:24:24 +08:00
Saeed Maleki
91d592dcc0
NVLS support. ( #250 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-04 20:46:10 -08:00
Changho Hwang
a6b24dcbed
Fix #163 ( #182 )
...
The bug was caused as frequent calls of initialize() temporarily exhaust
all available ephemeral ports. Fixed by retrying `bind()` after a while
upon `EADDRINUSE`.
2023-09-15 08:35:01 +00:00
Saeed Maleki
8d1b984bed
Change device handle interfaces & others ( #142 )
...
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net >
2023-08-16 20:00:56 +08:00
Saeed Maleki
e7d5e652df
Python bindings ( #125 )
...
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-19 15:35:54 +08:00
Saeed Maleki
df2f0c14ab
bootstrap now takes interface ( #113 )
...
This PR fixes the issue regarding taking the interface as an input.
2023-06-29 00:16:06 +08:00
Changho Hwang
21eed722af
Add license comments ( #106 )
2023-06-25 12:40:12 +08:00
Changho Hwang
c4a5958dfc
Fix hanging bootstrap issues ( #100 )
...
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-06-15 11:29:49 +08:00
Olli Saarikivi
5d5e9a1805
Make bootstrap use persistent sockets ( #98 )
2023-06-12 15:13:30 +08:00
Changho Hwang
5a4885ccbb
Misc updates ( #95 )
2023-06-12 13:53:43 +08:00
Changho Hwang
9cee6c4a74
Cleanup old files and functions ( #86 )
2023-06-01 17:34:57 +08:00
Olli Saarikivi
457c422791
Remove alloc.h and beef up cuda_utils.hpp ( #82 )
2023-05-24 08:34:18 +00:00
Olli Saarikivi
4e4d1972e3
Cuda smart pointers
2023-05-16 16:16:00 -07:00
Olli Saarikivi
9f6c48cbf9
Format all files
2023-05-11 00:23:14 +00:00
Olli Saarikivi
ccf45b33a2
Delete old init code and other C-style code
2023-05-10 22:03:42 +00:00
Olli Saarikivi
beaf2aea39
Move public headers under include/
2023-05-10 20:46:49 +00:00
Saeed Maleki
1769138568
Host Epoch + Error code
2023-05-09 23:10:12 +00:00
Binyang2014
8650dbaff8
Add exception class for mscclpp ( #67 )
...
Add exception class for mscclpp
2023-05-06 16:27:25 +08:00
Saeed Maleki
82c27625e6
ipc uses a base ptr now
2023-04-27 21:33:15 +00:00
Saeed Maleki
8fc822c848
more tests for bootstrap
2023-04-25 22:26:48 +00:00
Saeed Maleki
b73b0132ba
using find instead of searching
2023-04-25 21:27:23 +00:00
Saeed Maleki
8f2f053f2f
more clean up
2023-04-25 21:08:49 +00:00
Changho Hwang
71b075e0d7
Rename
2023-04-25 12:29:32 +00:00
Changho Hwang
4115559c2f
cleanup
2023-04-25 12:25:08 +00:00
Changho Hwang
bb195b2f29
PascalCase for type names
2023-04-25 11:57:02 +00:00
Changho Hwang
31f7897d5d
integrate with new interfaces in mscclpp.hpp
2023-04-25 11:47:58 +00:00
Saeed Maleki
8428b49858
a few minor changes
2023-04-25 01:51:47 +00:00
Saeed Maleki
3546e80aa0
unique ptr for pimpl_ in bootstrap
2023-04-25 00:47:48 +00:00
Saeed Maleki
3fd95265fd
Revert "lint"
...
This reverts commit 2c52ab37ce .
2023-04-24 23:22:56 +00:00
Saeed Maleki
2c52ab37ce
lint
2023-04-24 23:09:12 +00:00
Saeed Maleki
d6e91338d4
bootstrap tests pass
2023-04-24 23:07:38 +00:00
Saeed Maleki
27114d91fb
bootstrap tests pass
2023-04-24 21:50:03 +00:00
Saeed Maleki
f0f058410a
working bootstrap initialization
2023-04-24 19:25:06 +00:00
Saeed Maleki
6f4dc57331
fixed
2023-04-24 07:45:01 +00:00
Saeed Maleki
a9cfb82fcb
wip
2023-04-24 05:58:11 +00:00
Binyang Li
073460c341
fx compile issue
2023-04-23 14:25:56 +00:00
Binyang Li
7e1a77a132
make build pass
2023-04-21 09:41:52 +00:00
Binyang Li
7ac861b1e9
Refactor bootstrap
2023-04-21 08:41:33 +00:00
Binyang2014
804692f282
Binyli/bootstrap ( #60 )
...
Bootstrap refactor.
2023-04-21 13:59:42 +08:00
Saeed Maleki
9c8942f7ac
wip
2023-04-19 22:09:53 +00:00
Saeed Maleki
ec9737db82
progress
2023-04-19 00:34:47 +00:00