Commit Graph

71 Commits

Author SHA1 Message Date
Roshan Dathathri
41e0964d93 Allow binding allocated memory to NVLS multicast pointer (#290)
And change NVLS multimem instructions to static functions
2024-04-18 17:11:31 -07:00
Binyang Li
64d837f9ab Add executor to execute schedule-plan file (#283)
Add executor to execute the JSON schedule file generated by msccl-tools

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-04-18 19:10:41 +00:00
Changho Hwang
9406123711 Fix a typo name (#286) 2024-04-17 23:45:46 +00:00
Changho Hwang
1a7cb98e3a v0.4.3 (#279) 2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7 Fix bootstrapping mechanism (#278)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>
2024-03-27 10:24:24 +08:00
Saeed Maleki
a3d0799963 Fix the comm.py for nvls (#267)
Fix the comm.py for nvls
2024-02-19 10:39:21 +08:00
Binyang Li
5971508eed Remove cuda-python from project (#245)
Remove cuda-python and use CuPy APIs instead

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-13 21:44:11 +08:00
aashaka
d97fef4395 Allow semaphores and memory to be registered separately in ProxyService (#264)
This is needed in use cases where SimpleProxyChannel does not suffice.
For example, when a single semaphore is to be used for multiple tensors
or when multiple semaphores should be associated with a tensor.
2024-02-08 09:55:29 -08:00
Binyang Li
7c229fbdd8 Fix multi-nodes test failure (#262)
fix multi-nodes CI pipeline

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-07 18:21:05 -08:00
aashaka
2101f5251e Allow MSCCL++ CommGroup to take PyTorch tensors in args (#255)
Obtain data_ptr and tensor_size accordingly for torch.Tensor

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-06 19:47:25 -08:00
Changho Hwang
6a19b19ece Fix NVLS support (#258)
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Saeed Maleki
91d592dcc0 NVLS support. (#250)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-04 20:46:10 -08:00
Binyang Li
422c81f0f8 remove make pylib-copy command (#249)
Fix #216
Remove `make pylib-copy`
2024-01-19 12:29:15 -08:00
Binyang Li
163cba08c8 Update interface to let user change fifo size (#243)
Related with this issue:
https://github.com/microsoft/mscclpp/issues/242. The user may use more
threads than the number specified in `fifo_size` to interact with the
FIFO. In this case, there will be unexpected behavior.
Update the interface to let user change fifo size on their demands.
2024-01-09 22:14:36 -08:00
Changho Hwang
544ff0c21d ROCm support (#213)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-11-24 16:41:56 +08:00
Changho Hwang
dab19e00c1 Templatize Dockerfiles & update workflows (#223)
Now build images by a script with a shared Dockerfile template

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-11-22 13:29:12 -08:00
Changho Hwang
15f6dcca49 Update documentation (#217)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-11-22 12:58:04 -08:00
Changho Hwang
7bd66a938c Robust correctness test (#221)
Co-authored-by: Aashaka Shah <aashaka96@gmail.com>
2023-11-22 12:06:50 +08:00
Saeed Maleki
70eb6d7328 Fixing the bug in allreduce1 (#220) 2023-11-18 10:34:52 -08:00
Saeed Maleki
1d1199703a Auto-tune single-node AllReduce (#219)
single node auto-tuner + graph plotter + bug fix for illegal memory access

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-11-17 21:42:05 +08:00
Changho Hwang
060fda12e6 mscclpp-test in Python (#204)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Esha Choukse <eschouks@microsoft.com>
2023-11-16 12:45:25 +08:00
Changho Hwang
4cdb100265 Release GIL for Python APIs with wait (#190) 2023-11-14 21:11:01 +08:00
Changho Hwang
3521fb0280 Clear minor warnings (#214)
Clear warnings from the clang compiler.
2023-11-14 09:28:48 +08:00
Saeed Maleki
85e8017535 Atomic for semaphores instead of fences (#188)
Co-authored-by: Pratyush Patel <pratyushpatel.1995@gmail.com>
Co-authored-by: Esha Choukse <eschouks@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-10-13 18:57:08 +08:00
Saeed Maleki
148681b4bc Fix a pytest bug (#196) 2023-10-13 16:39:43 +08:00
Changho Hwang
8c0f9e84d0 v0.3.0 (#171) 2023-10-11 22:35:54 +08:00
Changho Hwang
11ac824cc7 Align interfaces of put/get/putPackets/getPackets (#185) 2023-10-07 22:18:26 +08:00
Changho Hwang
6c0ee72916 Construct ProxyChannel with shared pointers (#184) 2023-09-18 05:46:23 +00:00
Changho Hwang
3aa72098d9 Add poll() for semaphores (#181) 2023-09-15 07:40:44 +00:00
Binyang2014
097aa8843a Fix pytest unstable issue. (#170)
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Olli Saarikivi
828be48b21 Add Context and Endpoint classes to enable non-Communicator use-cases (#166)
This PR implements and closes #137. The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.

The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.

Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.

A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.

Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.
2023-09-06 13:10:04 +08:00
Binyang2014
858e381829 Pytest (#162)
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest

---------

Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>
2023-09-01 21:22:11 +08:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Olli Saarikivi
4865b2017b Add Python get_include() (#141)
Introduces a mscclpp.get_include() in the Python module.
The extension module is now named _mscclpp so that we can have
Python code in the mscclpp module.
Also does some miscellaneous cleanup.
2023-07-25 10:23:16 -07:00
Binyang2014
9a488f0da2 update python binding (#136)
update pythons binding for `device_handle`
2023-07-24 03:00:33 +00:00
Saeed Maleki
e7d5e652df Python bindings (#125)
Co-authored-by: Olli Saarikivi <olsaarik@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-07-19 15:35:54 +08:00
Madan Musuvathi
c042d9af54 Merge branch 'cpp-api' into saemal/api-extension 2023-04-13 22:32:38 +00:00
Crutcher Dunnavant
272097fb9b [python] switch to python setup.py build and wheels 2023-04-12 12:40:25 -07:00
Crutcher Dunnavant
d9077e5795 [python] switch to setup.py to build package 2023-04-12 12:29:17 -07:00
Felipe Petroski Such
516349c282 fix write_all 2023-04-12 12:29:17 -07:00
Crutcher Dunnavant
b93cfa3ca4 registeredmemory wip 2023-04-12 12:29:17 -07:00
Crutcher Dunnavant
19962a8002 format 2023-04-12 12:29:17 -07:00
Crutcher Dunnavant
b25bcf5f93 register buffers 2023-04-12 12:29:17 -07:00
Crutcher Dunnavant
00d382dbf7 format 2023-04-07 19:12:05 -07:00
Crutcher Dunnavant
34464b40bb register buffers 2023-04-07 19:11:50 -07:00
Crutcher Dunnavant
44a8a539ad types 2023-04-07 12:08:32 -07:00
Crutcher Dunnavant
d014693288 cleanup tests 2023-04-07 11:37:24 -07:00
Crutcher Dunnavant
68eff98bbc update ci.sh 2023-04-07 11:27:45 -07:00
Crutcher Dunnavant
e65def8657 bug 2023-04-07 11:27:45 -07:00
Crutcher Dunnavant
7753c38eb1 working on connect 2023-04-07 11:27:45 -07:00