Commit Graph

87 Commits

Author SHA1 Message Date
Caio Rocha
08a0cec2eb Fixing RegisterMemory Allocation for ProxyChannels (#353)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-09-24 23:01:41 -07:00
Binyang Li
b30bb260e3 Tune threads per block for mscclpp executor (#345) 2024-09-18 17:21:47 -07:00
Binyang Li
0c7311e83f Add CI for rocm (#346) 2024-09-15 22:30:54 +00:00
Roshan Dathathri
7ed13ec4b5 Auto-tune vector sizes for NVLS allreduce6 (#338)
Also fixes bugs in MscclppAllReduce6

Below is the performance when the algorithm is fixed to
MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2.

Float16:

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp16) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 2.0 KiB | 11.15 | 0.18 | PASS | 13.82 | 0.15 | PASS | 1.24 |
| 4.0 KiB | 11.15 | 0.37 | PASS | 14.74 | 0.28 | PASS | 1.32 |
| 8.0 KiB | 11.14 | 0.74 | PASS | 15.17 | 0.54 | PASS | 1.36 |
| 16.0 KiB | 11.16 | 1.47 | PASS | 15.77 | 1.04 | PASS | 1.41 |
| 32.0 KiB | 11.15 | 2.94 | PASS | 17.50 | 1.87 | PASS | 1.57 |
| 64.0 KiB | 11.18 | 5.86 | PASS | 17.64 | 3.71 | PASS | 1.58 |
| 128.0 KiB | 11.16 | 11.74 | PASS | 17.83 | 7.35 | PASS | 1.60 |
| 256.0 KiB | 11.21 | 23.38 | PASS | 18.00 | 14.57 | PASS | 1.60 |
| 512.0 KiB | 11.70 | 44.81 | PASS | 18.42 | 28.46 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.87 | PASS | 20.23 | 51.83 | PASS | 1.48 |
| 2.0 MiB | 17.29 | 121.27 | PASS | 31.60 | 66.36 | PASS | 1.83 |
| 4.0 MiB | 25.26 | 166.02 | PASS | 38.74 | 108.26 | PASS | 1.53 |
| 8.0 MiB | 40.17 | 208.83 | PASS | 62.86 | 133.45 | PASS | 1.56 |
| 16.0 MiB | 70.92 | 236.56 | PASS | 113.36 | 147.99 | PASS | 1.60 |
| 32.0 MiB | 131.38 | 255.41 | PASS | 203.21 | 165.13 | PASS | 1.55 |
| 64.0 MiB | 253.39 | 264.84 | PASS | 342.12 | 196.15 | PASS | 1.35 |
| 128.0 MiB | 496.74 | 270.20 | PASS | 670.62 | 200.14 | PASS | 1.35 |
| 256.0 MiB | 982.42 | 273.24 | PASS | 1318.36 | 203.61 | PASS | 1.34 |

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
 
 Float32:

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| Size (fp32) | Time (us) | AlgBW (GB/s) | Correctness | NCCL Time (us)
| NCCL AlgBW (GB/s) | NCCL Correctness | Speed Up |

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
| 4.0 KiB | 11.04 | 0.37 | PASS | 14.79 | 0.28 | PASS | 1.34 |
| 8.0 KiB | 11.15 | 0.73 | PASS | 15.25 | 0.54 | PASS | 1.37 |
| 16.0 KiB | 11.12 | 1.47 | PASS | 15.87 | 1.03 | PASS | 1.43 |
| 32.0 KiB | 11.13 | 2.95 | PASS | 17.21 | 1.90 | PASS | 1.55 |
| 64.0 KiB | 11.11 | 5.90 | PASS | 17.37 | 3.77 | PASS | 1.56 |
| 128.0 KiB | 11.08 | 11.83 | PASS | 17.54 | 7.47 | PASS | 1.58 |
| 256.0 KiB | 11.15 | 23.50 | PASS | 17.71 | 14.80 | PASS | 1.59 |
| 512.0 KiB | 11.56 | 45.34 | PASS | 18.21 | 28.79 | PASS | 1.57 |
| 1.0 MiB | 13.64 | 76.90 | PASS | 19.87 | 52.77 | PASS | 1.46 |
| 2.0 MiB | 17.24 | 121.67 | PASS | 31.63 | 66.30 | PASS | 1.84 |
| 4.0 MiB | 25.19 | 166.47 | PASS | 38.63 | 108.57 | PASS | 1.53 |
| 8.0 MiB | 40.38 | 207.72 | PASS | 62.65 | 133.89 | PASS | 1.55 |
| 16.0 MiB | 70.72 | 237.23 | PASS | 114.57 | 146.44 | PASS | 1.62 |
| 32.0 MiB | 131.49 | 255.18 | PASS | 200.79 | 167.11 | PASS | 1.53 |
| 64.0 MiB | 253.98 | 264.23 | PASS | 342.58 | 195.89 | PASS | 1.35 |
| 128.0 MiB | 496.96 | 270.08 | PASS | 670.64 | 200.13 | PASS | 1.35 |
| 256.0 MiB | 982.83 | 273.12 | PASS | 1318.90 | 203.53 | PASS | 1.34 |
| 512.0 MiB | 1954.07 | 274.75 | PASS | 2609.04 | 205.77 | PASS | 1.34 |

+-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+
2024-08-16 11:11:54 +08:00
Changho Hwang
8c6fb429e9 bfloat16 support (#336)
* Add bfloat16 support for executor and NCCL interface
* Changed `gpu_data_types.hpp` into an internal header file
2024-08-12 15:41:58 -07:00
Ziyue Yang
faadc75649 Fix missing import in executor test (#334) 2024-08-06 14:24:50 -07:00
caiomcbr
67eb9b04cc NCCL API Executor Integration (#331)
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-07-25 15:05:02 -07:00
Changho Hwang
c4ca2fbc8c Resolve clang++ warnings (#325) 2024-07-11 07:48:35 +00:00
Angelica Moreira
0f796bbdf7 Update allreduce_bench.py (#318)
Replacing hardcoded network interface name for generic discovery
strategy.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-06-29 03:41:13 +00:00
Roshan Dathathri
91550dab4c Simplify/improve barrier in AllReduce6 (#317)
Drop superfluous __threadfence_system()
2024-06-23 21:08:59 +00:00
Roshan Dathathri
93ed8e1e58 Add support for multicast reduce insruction (#316) 2024-06-19 13:28:12 -07:00
Ziyue Yang
76328fe623 Add NPKit GPU event support (#310) 2024-06-13 13:59:50 +08:00
Changho Hwang
1f62dfd7cd Add C++ executor test (#304)
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-05-29 10:54:36 +00:00
Changho Hwang
d35a2f2dc2 Rename executor.cpp to executor_py.cpp (#301) 2024-05-17 13:31:27 -07:00
aashaka
0650371b54 Allow obtaining cuda stream handle from PyTorch stream when launching kernel (#297)
Use `cuda_stream` attribute of a torch stream if the stream is not an
instance of the cupy stream.
2024-05-04 04:57:07 +00:00
Changho Hwang
6c1fa5307c Refactoring NVLS interfaces (#293)
Move NVLS details from the core to a separate interface
2024-04-24 10:05:41 -07:00
Roshan Dathathri
41e0964d93 Allow binding allocated memory to NVLS multicast pointer (#290)
And change NVLS multimem instructions to static functions
2024-04-18 17:11:31 -07:00
Binyang Li
64d837f9ab Add executor to execute schedule-plan file (#283)
Add executor to execute the JSON schedule file generated by msccl-tools

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-04-18 19:10:41 +00:00
Changho Hwang
9406123711 Fix a typo name (#286) 2024-04-17 23:45:46 +00:00
Changho Hwang
1a7cb98e3a v0.4.3 (#279) 2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7 Fix bootstrapping mechanism (#278)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>
2024-03-27 10:24:24 +08:00
Saeed Maleki
a3d0799963 Fix the comm.py for nvls (#267)
Fix the comm.py for nvls
2024-02-19 10:39:21 +08:00
Binyang Li
5971508eed Remove cuda-python from project (#245)
Remove cuda-python and use CuPy APIs instead

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-13 21:44:11 +08:00
aashaka
d97fef4395 Allow semaphores and memory to be registered separately in ProxyService (#264)
This is needed in use cases where SimpleProxyChannel does not suffice.
For example, when a single semaphore is to be used for multiple tensors
or when multiple semaphores should be associated with a tensor.
2024-02-08 09:55:29 -08:00
Binyang Li
7c229fbdd8 Fix multi-nodes test failure (#262)
fix multi-nodes CI pipeline

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-07 18:21:05 -08:00
aashaka
2101f5251e Allow MSCCL++ CommGroup to take PyTorch tensors in args (#255)
Obtain data_ptr and tensor_size accordingly for torch.Tensor

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-06 19:47:25 -08:00
Changho Hwang
6a19b19ece Fix NVLS support (#258)
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Saeed Maleki
91d592dcc0 NVLS support. (#250)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-04 20:46:10 -08:00
Binyang Li
422c81f0f8 remove make pylib-copy command (#249)
Fix #216
Remove `make pylib-copy`
2024-01-19 12:29:15 -08:00
Binyang Li
163cba08c8 Update interface to let user change fifo size (#243)
Related with this issue:
https://github.com/microsoft/mscclpp/issues/242. The user may use more
threads than the number specified in `fifo_size` to interact with the
FIFO. In this case, there will be unexpected behavior.
Update the interface to let user change fifo size on their demands.
2024-01-09 22:14:36 -08:00
Changho Hwang
544ff0c21d ROCm support (#213)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2023-11-24 16:41:56 +08:00
Changho Hwang
dab19e00c1 Templatize Dockerfiles & update workflows (#223)
Now build images by a script with a shared Dockerfile template

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-11-22 13:29:12 -08:00
Changho Hwang
15f6dcca49 Update documentation (#217)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
2023-11-22 12:58:04 -08:00
Changho Hwang
7bd66a938c Robust correctness test (#221)
Co-authored-by: Aashaka Shah <aashaka96@gmail.com>
2023-11-22 12:06:50 +08:00
Saeed Maleki
70eb6d7328 Fixing the bug in allreduce1 (#220) 2023-11-18 10:34:52 -08:00
Saeed Maleki
1d1199703a Auto-tune single-node AllReduce (#219)
single node auto-tuner + graph plotter + bug fix for illegal memory access

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-11-17 21:42:05 +08:00
Changho Hwang
060fda12e6 mscclpp-test in Python (#204)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Esha Choukse <eschouks@microsoft.com>
2023-11-16 12:45:25 +08:00
Changho Hwang
4cdb100265 Release GIL for Python APIs with wait (#190) 2023-11-14 21:11:01 +08:00
Changho Hwang
3521fb0280 Clear minor warnings (#214)
Clear warnings from the clang compiler.
2023-11-14 09:28:48 +08:00
Saeed Maleki
85e8017535 Atomic for semaphores instead of fences (#188)
Co-authored-by: Pratyush Patel <pratyushpatel.1995@gmail.com>
Co-authored-by: Esha Choukse <eschouks@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2023-10-13 18:57:08 +08:00
Saeed Maleki
148681b4bc Fix a pytest bug (#196) 2023-10-13 16:39:43 +08:00
Changho Hwang
8c0f9e84d0 v0.3.0 (#171) 2023-10-11 22:35:54 +08:00
Changho Hwang
11ac824cc7 Align interfaces of put/get/putPackets/getPackets (#185) 2023-10-07 22:18:26 +08:00
Changho Hwang
6c0ee72916 Construct ProxyChannel with shared pointers (#184) 2023-09-18 05:46:23 +00:00
Changho Hwang
3aa72098d9 Add poll() for semaphores (#181) 2023-09-15 07:40:44 +00:00
Binyang2014
097aa8843a Fix pytest unstable issue. (#170)
- remove `#include <cstdint>` from `poll.hpp`. To make it only contains
device-side code
- Fix compilation issue, which will cause pytest fail randomly. Reuse
the compiled result for same kernel with different arguments
2023-09-06 17:09:04 -07:00
Olli Saarikivi
828be48b21 Add Context and Endpoint classes to enable non-Communicator use-cases (#166)
This PR implements and closes #137. The new `Endpoint` and `Context`
classes expose the connection establishing functionality from
`Communicator`, which now is only responsible for tying together the
bootstrapper with a context.

The largest breaking change here is that
`Communicator.connectOnSetup(...)` now returns the `Connection` wrapped
inside a `NonblockingFuture`. This is because with the way `Context` is
implemented a `Connection` is now fully initialized on construction.

Some smaller breaking API changes from this change are that
`RegisteredMemory` no longer has a `rank()` function (as there maybe no
concept of rank), and similarly `Connection` has no `remoteRank()` and
`tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf`
functions in `Communicator`.

A new `EndpointConfig` class is introduced to avoid duplication of the
IB configuration parameters in the APIs of `Context` and `Communicator`.
The usual usage pattern of just passing in a `Transport` still works due
to an implicit conversion into `EndpointConfig`.

Miscellaneous changes:
-Cleans up how the PIMPL pattern is applied by making both the `Impl`
struct and the `pimpl_` pointers private for all relevant classes in the
core API.
-Enables ctest to be run from the build root directory.
2023-09-06 13:10:04 +08:00
Binyang2014
858e381829 Pytest (#162)
Port python tests to mscclpp.
Please run
`mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest

---------

Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>
2023-09-01 21:22:11 +08:00
Saeed Maleki
8d1b984bed Change device handle interfaces & others (#142)
* Changed device handle interfaces
* Changed proxy service interfaces
* Move device code into separate files
* Fixed FIFO polling issues
* Add configuration arguments in several interface functions

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>
2023-08-16 20:00:56 +08:00
Olli Saarikivi
4865b2017b Add Python get_include() (#141)
Introduces a mscclpp.get_include() in the Python module.
The extension module is now named _mscclpp so that we can have
Python code in the mscclpp module.
Also does some miscellaneous cleanup.
2023-07-25 10:23:16 -07:00