Changho Hwang
4cdb100265
Release GIL for Python APIs with wait ( #190 )
2023-11-14 21:11:01 +08:00
Changho Hwang
3521fb0280
Clear minor warnings ( #214 )
...
Clear warnings from the clang compiler.
2023-11-14 09:28:48 +08:00
Binyang2014
0863e862f5
minor fix ( #211 )
...
minor fix for pipeline
2023-11-03 15:52:42 +08:00
Binyang2014
8a938de9c5
fix pipeline ( #209 )
...
fix pipeline for multi-node test
2023-11-03 05:18:32 +00:00
Binyang2014
db0528ca7f
minor fix ( #203 )
...
minor fix: make sure load/store int4 in the correct bytes order
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-10-27 15:19:56 +08:00
Changho Hwang
f68820436c
Explicit build dependency on nvidia_peermem ( #201 )
2023-10-23 04:29:30 +00:00
Changho Hwang
7686e15fbd
Allow infinite waiting ( #200 )
2023-10-23 12:28:05 +08:00
Binyang2014
6f43282c1d
Fix allreduce bug ( #197 )
...
Fix allreduce correctness issue
2023-10-18 23:16:57 +08:00
Saeed Maleki
85e8017535
Atomic for semaphores instead of fences ( #188 )
...
Co-authored-by: Pratyush Patel <pratyushpatel.1995@gmail.com >
Co-authored-by: Esha Choukse <eschouks@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-10-13 18:57:08 +08:00
Saeed Maleki
c4785c9591
Improve debugging messages ( #195 )
...
Debugging information to understand what connections are being made.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-10-13 16:55:52 +08:00
Saeed Maleki
148681b4bc
Fix a pytest bug ( #196 )
2023-10-13 16:39:43 +08:00
Changho Hwang
8c0f9e84d0
v0.3.0 ( #171 )
v0.3.0
2023-10-11 22:35:54 +08:00
Changho Hwang
11ac824cc7
Align interfaces of put/get/putPackets/getPackets ( #185 )
2023-10-07 22:18:26 +08:00
Changho Hwang
497a9e0c82
Add backup workflows ( #189 )
2023-10-07 15:13:49 +08:00
Changho Hwang
b3d0fdb8df
Add an atomic signal perf test ( #183 )
2023-09-18 08:12:14 +00:00
Changho Hwang
6c0ee72916
Construct ProxyChannel with shared pointers ( #184 )
2023-09-18 05:46:23 +00:00
Changho Hwang
a6b24dcbed
Fix #163 ( #182 )
...
The bug was caused as frequent calls of initialize() temporarily exhaust
all available ephemeral ports. Fixed by retrying `bind()` after a while
upon `EADDRINUSE`.
2023-09-15 08:35:01 +00:00
Changho Hwang
3aa72098d9
Add poll() for semaphores ( #181 )
2023-09-15 07:40:44 +00:00
Changho Hwang
d2f13f1e54
Fix #174 ( #180 )
...
Added `extern "C"` based on another specification in
`/usr/local/cuda/include/crt/common_functions.h`.
2023-09-15 06:44:41 +00:00
Changho Hwang
e11109f6e8
Add CUDA 11.8 Dockerfiles ( #146 )
2023-09-15 05:53:03 +00:00
Changho Hwang
bb64f68d74
Update issue templates ( #179 )
2023-09-15 04:05:09 +00:00
Binyang2014
952f2da9cc
Improve single node allreduce performance ( #169 )
...
Improve all reduce performance for single node.
New number:
| n_ctx | size | target latency (us) | allreduce5 | allreduce6 |
|---------|---------|----------------|------------|------------|
| 1 | 24.0kB | 7.7 | | 7.23|
| 2 | 48.0kB | 7.7 | | 7.69|
| 4 | 96.0kB | 8 | | 8.34|
| 8 | 192.0kB | 12.6 | | 9.75|
| 12 | 288.0kB | 13 | | 11.34|
| 16 | 384.0kB | 13.3 | | 12.99|
| 768 | 18.0MB | 158.7 | 160.3| |
| 896 | 21.0MB | 184.5 | 183.8| |
| 1024 | 24.0MB | 209.5 | 207.5| |
| 1152 | 27.0MB | 234.3 | 231.9| |
| 1280 | 30.0MB | 260 | 255.6| |
| 1408 | 33.0MB | 284.9 | 278.7| |
| 1536 | 36.0MB | 310.3 | 302.0| |
| 1664 | 39.0MB | 336.2 | 325.3| |
| 1792 | 42.0MB | 361.4 | 348.8| |
| 1920 | 45.0MB | 384.6 | 372.2| |
| 2048 | 48.0MB | 409.1 | 395.4| |
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-09-13 14:30:08 +00:00
Saeed Maleki
015e29c138
adding signal for atomic op ( #178 )
...
This address [this](https://github.com/microsoft/mscclpp/issues/177 ).
2023-09-11 10:46:25 -07:00
Binyang2014
097aa8843a
Fix pytest unstable issue. ( #170 )
...
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Olli Saarikivi
828be48b21
Add Context and Endpoint classes to enable non-Communicator use-cases ( #166 )
...
This PR implements and closes #137 . The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.
The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.
Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.
A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.
Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.
2023-09-06 13:10:04 +08:00
Binyang2014
858e381829
Pytest ( #162 )
...
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest
---------
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com >
2023-09-01 21:22:11 +08:00
Changho Hwang
3df18d20a3
Update install guidelines ( #159 )
2023-08-30 10:40:40 -07:00
Saeed Maleki
8d1b984bed
Change device handle interfaces & others ( #142 )
...
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net >
2023-08-16 20:00:56 +08:00
Olli Saarikivi
4865b2017b
Add Python get_include() ( #141 )
...
Introduces a mscclpp.get_include() in the Python module.
The extension module is now named _mscclpp so that we can have
Python code in the mscclpp module.
Also does some miscellaneous cleanup.
2023-07-25 10:23:16 -07:00
Binyang2014
9a488f0da2
update python binding ( #136 )
...
update pythons binding for `device_handle`
2023-07-24 03:00:33 +00:00
Binyang2014
a58e2e9623
Make sure the semaphore not be released during the lifecycle of SmChannel ( #131 )
...
Fix #126
- Put `std::shared_ptr<SmDevice2DeviceSemaphore>` into the `SmChannel`
- add a `DeviceHandle` struct in `SmChannel`
- add `DeviceHandle` template
Users need to write code like this to use channel in device side:
```
using DeviceHandle = mscclpp::DeviceHandle<T>;
__device__ DeviceHandle<mscclpp::SimpleProxyChannel> channel;
__device__ DeviceHandle<mscclpp::SmChannel> smChannel;
```
To cover a channel to deviceHandle, need to call this function:
`mscclpp::deviceHandle(SimpleProxyChannel or SmChannel)`
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-07-20 12:18:22 +08:00
Saeed Maleki
e7d5e652df
Python bindings ( #125 )
...
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-19 15:35:54 +08:00
Changho Hwang
2e1645782e
v0.2.0 ( #123 )
v0.2.0
2023-07-11 10:54:19 +08:00
Changho Hwang
1d71715d19
Separate mscclpp-test kernels ( #122 )
...
Separate different kernel implementations in mscclpp-test to reduce the
number of registers required by the kernels.
2023-07-10 10:11:20 -07:00
Changho Hwang
2b983334a2
Add Dockerfiles ( #104 )
2023-07-10 18:07:32 +08:00
Binyang2014
56bdbc2f32
Enable test for both cuda11 and cuda12 ( #124 )
...
Update pipeline: enable test for both cuda11 and cuda12
2023-07-10 13:19:14 +08:00
Changho Hwang
4114d65c60
Documents & minor updates ( #119 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-07-07 17:35:05 +08:00
Changho Hwang
bb7b85a810
2-node AllReduce improvements ( #118 )
...
* Added `get()` interfaces to `SmChannel`
* Improved 2-node (8 gpus/node) AllReduce: algbw 139GB/s for 1GB (kernel
3) and 99GB/s for 48MB (kernel 4)
* Fixed a FIFO perf bug
* Several fixes & validations in mscclpp-test
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-07-07 07:05:46 +00:00
Changho Hwang
6ec585f3d8
Packet copy for IB ( #109 )
...
* Extend channels to support LL with IB
* Rename classes and interfaces
2023-06-28 10:39:31 -07:00
Saeed Maleki
df2f0c14ab
bootstrap now takes interface ( #113 )
...
This PR fixes the issue regarding taking the interface as an input.
2023-06-29 00:16:06 +08:00
Changho Hwang
21eed722af
Add license comments ( #106 )
2023-06-25 12:40:12 +08:00
Binyang2014
2640578b22
Add performance check for mscclpp-test ( #110 )
...
- Add ndmv4 perf baseline
- change mscclpp-test to output perf number into a json file
- add python script to check the perf result with the baseline
2023-06-21 07:42:53 +00:00
Saeed Maleki
cd7797fd5e
FIFO optimization ( #112 )
...
This saves 2us on IB latency
2023-06-19 05:36:56 +00:00
Saeed Maleki
cd69704c7d
Minor IB bug fix ( #111 )
...
`wr_->next` for IB is set to `nullptr` always.
2023-06-19 12:28:38 +08:00
Changho Hwang
60b3dd5a61
Bug fixes & resolve warnings ( #107 )
...
* Fix a bug in host hashing
* Fix a bug in `HostEpoch::wait()`
* Remove misc warnings
2023-06-16 09:31:23 +00:00
Binyang2014
8410fcd8fc
Fix allgather kernel 2 perf bug ( #108 )
...
Fix #105
2023-06-16 15:36:20 +08:00
Changho Hwang
6cd8960394
DirectChannel Unit Tests ( #102 )
...
* Add DirectChannel unit tests
* Split mp_unit_tests.cu into multiple files
2023-06-15 20:55:57 +08:00
Changho Hwang
c4a5958dfc
Fix hanging bootstrap issues ( #100 )
...
* Renew socket interfaces and error handling into C++ style
* Fix bootstrap hanging bugs
* Misc code cleanup
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-06-15 11:29:49 +08:00
Binyang2014
8efacae332
update pipeline ( #103 )
...
Update Azure pipeline:
- Using mscclpp:base-cuda12.1 image for building and testing
- Add mp-ut tests for multi-nodes
2023-06-14 20:14:57 +08:00
Changho Hwang
4d0b0a650f
Remove vulnerable sscanf ( #101 )
2023-06-14 10:02:46 +08:00