Caio Rocha
08a0cec2eb
Fixing RegisterMemory Allocation for ProxyChannels ( #353 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-09-24 23:01:41 -07:00
Binyang Li
b30bb260e3
Tune threads per block for mscclpp executor ( #345 )
2024-09-18 17:21:47 -07:00
Binyang Li
0c7311e83f
Add CI for rocm ( #346 )
2024-09-15 22:30:54 +00:00
Roshan Dathathri
7ed13ec4b5
Auto-tune vector sizes for NVLS allreduce6 ( #338 )
...
Also fixes bugs in MscclppAllReduce6
Below is the performance when the algorithm is fixed to
MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2.
Float16:
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp16) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 2.0 KiB | 11.15 | 0.18 | PASS | 13.82 | 0.15 | PASS | 1.24 |
| 4.0 KiB | 11.15 | 0.37 | PASS | 14.74 | 0.28 | PASS | 1.32 |
| 8.0 KiB | 11.14 | 0.74 | PASS | 15.17 | 0.54 | PASS | 1.36 |
| 16.0 KiB | 11.16 | 1.47 | PASS | 15.77 | 1.04 | PASS | 1.41 |
| 32.0 KiB | 11.15 | 2.94 | PASS | 17.50 | 1.87 | PASS | 1.57 |
| 64.0 KiB | 11.18 | 5.86 | PASS | 17.64 | 3.71 | PASS | 1.58 |
| 128.0 KiB | 11.16 | 11.74 | PASS | 17.83 | 7.35 | PASS | 1.60 |
| 256.0 KiB | 11.21 | 23.38 | PASS | 18.00 | 14.57 | PASS | 1.60 |
| 512.0 KiB | 11.70 | 44.81 | PASS | 18.42 | 28.46 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.87 | PASS | 20.23 | 51.83 | PASS | 1.48 |
| 2.0 MiB | 17.29 | 121.27 | PASS | 31.60 | 66.36 | PASS | 1.83 |
| 4.0 MiB | 25.26 | 166.02 | PASS | 38.74 | 108.26 | PASS | 1.53 |
| 8.0 MiB | 40.17 | 208.83 | PASS | 62.86 | 133.45 | PASS | 1.56 |
| 16.0 MiB | 70.92 | 236.56 | PASS | 113.36 | 147.99 | PASS | 1.60 |
| 32.0 MiB | 131.38 | 255.41 | PASS | 203.21 | 165.13 | PASS | 1.55 |
| 64.0 MiB | 253.39 | 264.84 | PASS | 342.12 | 196.15 | PASS | 1.35 |
| 128.0 MiB | 496.74 | 270.20 | PASS | 670.62 | 200.14 | PASS | 1.35 |
| 256.0 MiB | 982.42 | 273.24 | PASS | 1318.36 | 203.61 | PASS | 1.34 |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
Float32:
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp32) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 4.0 KiB | 11.04 | 0.37 | PASS | 14.79 | 0.28 | PASS | 1.34 |
| 8.0 KiB | 11.15 | 0.73 | PASS | 15.25 | 0.54 | PASS | 1.37 |
| 16.0 KiB | 11.12 | 1.47 | PASS | 15.87 | 1.03 | PASS | 1.43 |
| 32.0 KiB | 11.13 | 2.95 | PASS | 17.21 | 1.90 | PASS | 1.55 |
| 64.0 KiB | 11.11 | 5.90 | PASS | 17.37 | 3.77 | PASS | 1.56 |
| 128.0 KiB | 11.08 | 11.83 | PASS | 17.54 | 7.47 | PASS | 1.58 |
| 256.0 KiB | 11.15 | 23.50 | PASS | 17.71 | 14.80 | PASS | 1.59 |
| 512.0 KiB | 11.56 | 45.34 | PASS | 18.21 | 28.79 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.90 | PASS | 19.87 | 52.77 | PASS | 1.46 |
| 2.0 MiB | 17.24 | 121.67 | PASS | 31.63 | 66.30 | PASS | 1.84 |
| 4.0 MiB | 25.19 | 166.47 | PASS | 38.63 | 108.57 | PASS | 1.53 |
| 8.0 MiB | 40.38 | 207.72 | PASS | 62.65 | 133.89 | PASS | 1.55 |
| 16.0 MiB | 70.72 | 237.23 | PASS | 114.57 | 146.44 | PASS | 1.62 |
| 32.0 MiB | 131.49 | 255.18 | PASS | 200.79 | 167.11 | PASS | 1.53 |
| 64.0 MiB | 253.98 | 264.23 | PASS | 342.58 | 195.89 | PASS | 1.35 |
| 128.0 MiB | 496.96 | 270.08 | PASS | 670.64 | 200.13 | PASS | 1.35 |
| 256.0 MiB | 982.83 | 273.12 | PASS | 1318.90 | 203.53 | PASS | 1.34 |
| 512.0 MiB | 1954.07 | 274.75 | PASS | 2609.04 | 205.77 | PASS | 1.34 |
+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
2024-08-16 11:11:54 +08:00
Changho Hwang
8c6fb429e9
bfloat16 support ( #336 )
...
* Add bfloat16 support for executor and NCCL interface
* Changed `gpu_data_types.hpp` into an internal header file
2024-08-12 15:41:58 -07:00
Ziyue Yang
faadc75649
Fix missing import in executor test ( #334 )
2024-08-06 14:24:50 -07:00
caiomcbr
67eb9b04cc
NCCL API Executor Integration ( #331 )
...
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-25 15:05:02 -07:00
Changho Hwang
c4ca2fbc8c
Resolve clang++ warnings ( #325 )
2024-07-11 07:48:35 +00:00
Angelica Moreira
0f796bbdf7
Update allreduce_bench.py ( #318 )
...
Replacing hardcoded network interface name for generic discovery
strategy.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-06-29 03:41:13 +00:00
Roshan Dathathri
91550dab4c
Simplify/improve barrier in AllReduce6 ( #317 )
...
Drop superfluous __threadfence_system()
2024-06-23 21:08:59 +00:00
Roshan Dathathri
93ed8e1e58
Add support for multicast reduce insruction ( #316 )
2024-06-19 13:28:12 -07:00
Ziyue Yang
76328fe623
Add NPKit GPU event support ( #310 )
2024-06-13 13:59:50 +08:00
Changho Hwang
1f62dfd7cd
Add C++ executor test ( #304 )
...
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-05-29 10:54:36 +00:00
Changho Hwang
d35a2f2dc2
Rename executor.cpp to executor_py.cpp ( #301 )
2024-05-17 13:31:27 -07:00
aashaka
0650371b54
Allow obtaining cuda stream handle from PyTorch stream when launching kernel ( #297 )
...
Use `cuda_stream` attribute of a torch stream if the stream is not an
instance of the cupy stream.
2024-05-04 04:57:07 +00:00
Changho Hwang
6c1fa5307c
Refactoring NVLS interfaces ( #293 )
...
Move NVLS details from the core to a separate interface
2024-04-24 10:05:41 -07:00
Roshan Dathathri
41e0964d93
Allow binding allocated memory to NVLS multicast pointer ( #290 )
...
And change NVLS multimem instructions to static functions
2024-04-18 17:11:31 -07:00
Binyang Li
64d837f9ab
Add executor to execute schedule-plan file ( #283 )
...
Add executor to execute the JSON schedule file generated by msccl-tools
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-04-18 19:10:41 +00:00
Changho Hwang
9406123711
Fix a typo name ( #286 )
2024-04-17 23:45:46 +00:00
Changho Hwang
1a7cb98e3a
v0.4.3 ( #279 )
2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7
Fix bootstrapping mechanism ( #278 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com >
2024-03-27 10:24:24 +08:00
Saeed Maleki
a3d0799963
Fix the comm.py for nvls ( #267 )
...
Fix the comm.py for nvls
2024-02-19 10:39:21 +08:00
Binyang Li
5971508eed
Remove cuda-python from project ( #245 )
...
Remove cuda-python and use CuPy APIs instead
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-13 21:44:11 +08:00
aashaka
d97fef4395
Allow semaphores and memory to be registered separately in ProxyService ( #264 )
...
This is needed in use cases where SimpleProxyChannel does not suffice.
For example, when a single semaphore is to be used for multiple tensors
or when multiple semaphores should be associated with a tensor.
2024-02-08 09:55:29 -08:00
Binyang Li
7c229fbdd8
Fix multi-nodes test failure ( #262 )
...
fix multi-nodes CI pipeline
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-07 18:21:05 -08:00
aashaka
2101f5251e
Allow MSCCL++ CommGroup to take PyTorch tensors in args ( #255 )
...
Obtain data_ptr and tensor_size accordingly for torch.Tensor
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-02-06 19:47:25 -08:00
Changho Hwang
6a19b19ece
Fix NVLS support ( #258 )
...
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Saeed Maleki
91d592dcc0
NVLS support. ( #250 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-04 20:46:10 -08:00
Binyang Li
422c81f0f8
remove make pylib-copy command ( #249 )
...
Fix #216
Remove `make pylib-copy`
2024-01-19 12:29:15 -08:00
Binyang Li
163cba08c8
Update interface to let user change fifo size ( #243 )
...
Related with this issue:
https://github.com/microsoft/mscclpp/issues/242 . The user may use more
threads than the number specified in `fifo_size` to interact with the
FIFO. In this case, there will be unexpected behavior.
Update the interface to let user change fifo size on their demands.
2024-01-09 22:14:36 -08:00
Changho Hwang
544ff0c21d
ROCm support ( #213 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-11-24 16:41:56 +08:00
Changho Hwang
dab19e00c1
Templatize Dockerfiles & update workflows ( #223 )
...
Now build images by a script with a shared Dockerfile template
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-11-22 13:29:12 -08:00
Changho Hwang
15f6dcca49
Update documentation ( #217 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-11-22 12:58:04 -08:00
Changho Hwang
7bd66a938c
Robust correctness test ( #221 )
...
Co-authored-by: Aashaka Shah <aashaka96@gmail.com >
2023-11-22 12:06:50 +08:00
Saeed Maleki
70eb6d7328
Fixing the bug in allreduce1 ( #220 )
2023-11-18 10:34:52 -08:00
Saeed Maleki
1d1199703a
Auto-tune single-node AllReduce ( #219 )
...
single node auto-tuner + graph plotter + bug fix for illegal memory access
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-11-17 21:42:05 +08:00
Changho Hwang
060fda12e6
mscclpp-test in Python ( #204 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Esha Choukse <eschouks@microsoft.com >
2023-11-16 12:45:25 +08:00
Changho Hwang
4cdb100265
Release GIL for Python APIs with wait ( #190 )
2023-11-14 21:11:01 +08:00
Changho Hwang
3521fb0280
Clear minor warnings ( #214 )
...
Clear warnings from the clang compiler.
2023-11-14 09:28:48 +08:00
Saeed Maleki
85e8017535
Atomic for semaphores instead of fences ( #188 )
...
Co-authored-by: Pratyush Patel <pratyushpatel.1995@gmail.com >
Co-authored-by: Esha Choukse <eschouks@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-10-13 18:57:08 +08:00
Saeed Maleki
148681b4bc
Fix a pytest bug ( #196 )
2023-10-13 16:39:43 +08:00
Changho Hwang
8c0f9e84d0
v0.3.0 ( #171 )
2023-10-11 22:35:54 +08:00
Changho Hwang
11ac824cc7
Align interfaces of put/get/putPackets/getPackets ( #185 )
2023-10-07 22:18:26 +08:00
Changho Hwang
6c0ee72916
Construct ProxyChannel with shared pointers ( #184 )
2023-09-18 05:46:23 +00:00
Changho Hwang
3aa72098d9
Add poll() for semaphores ( #181 )
2023-09-15 07:40:44 +00:00
Binyang2014
097aa8843a
Fix pytest unstable issue. ( #170 )
...
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Olli Saarikivi
828be48b21
Add Context and Endpoint classes to enable non-Communicator use-cases ( #166 )
...
This PR implements and closes #137 . The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.
The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.
Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.
A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.
Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.
2023-09-06 13:10:04 +08:00
Binyang2014
858e381829
Pytest ( #162 )
...
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest
---------
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com >
2023-09-01 21:22:11 +08:00
Saeed Maleki
8d1b984bed
Change device handle interfaces & others ( #142 )
...
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net >
2023-08-16 20:00:56 +08:00
Olli Saarikivi
4865b2017b
Add Python get_include() ( #141 )
...
Introduces a mscclpp.get_include() in the Python module.
The extension module is now named _mscclpp so that we can have
Python code in the mscclpp module.
Also does some miscellaneous cleanup.
2023-07-25 10:23:16 -07:00