mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-07-01 03:47:27 +00:00

Author	SHA1	Message	Date
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	0c7311e83f	Add CI for rocm (#346 )	2024-09-15 22:30:54 +00:00
Roshan Dathathri	7ed13ec4b5	Auto-tune vector sizes for NVLS allreduce6 (#338 ) Also fixes bugs in MscclppAllReduce6 Below is the performance when the algorithm is fixed to MscclppAllReduce6 on 8 H100 GPUs connected with NVLink using CUDA 12.2. Float16: +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| Size (fp16) \| Time (us) \| AlgBW (GB/s) \| Correctness \| NCCL Time (us) \| NCCL AlgBW (GB/s) \| NCCL Correctness \| Speed Up \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| 2.0 KiB \| 11.15 \| 0.18 \| PASS \| 13.82 \| 0.15 \| PASS \| 1.24 \| \| 4.0 KiB \| 11.15 \| 0.37 \| PASS \| 14.74 \| 0.28 \| PASS \| 1.32 \| \| 8.0 KiB \| 11.14 \| 0.74 \| PASS \| 15.17 \| 0.54 \| PASS \| 1.36 \| \| 16.0 KiB \| 11.16 \| 1.47 \| PASS \| 15.77 \| 1.04 \| PASS \| 1.41 \| \| 32.0 KiB \| 11.15 \| 2.94 \| PASS \| 17.50 \| 1.87 \| PASS \| 1.57 \| \| 64.0 KiB \| 11.18 \| 5.86 \| PASS \| 17.64 \| 3.71 \| PASS \| 1.58 \| \| 128.0 KiB \| 11.16 \| 11.74 \| PASS \| 17.83 \| 7.35 \| PASS \| 1.60 \| \| 256.0 KiB \| 11.21 \| 23.38 \| PASS \| 18.00 \| 14.57 \| PASS \| 1.60 \| \| 512.0 KiB \| 11.70 \| 44.81 \| PASS \| 18.42 \| 28.46 \| PASS \| 1.57 \| \| 1.0 MiB \| 13.64 \| 76.87 \| PASS \| 20.23 \| 51.83 \| PASS \| 1.48 \| \| 2.0 MiB \| 17.29 \| 121.27 \| PASS \| 31.60 \| 66.36 \| PASS \| 1.83 \| \| 4.0 MiB \| 25.26 \| 166.02 \| PASS \| 38.74 \| 108.26 \| PASS \| 1.53 \| \| 8.0 MiB \| 40.17 \| 208.83 \| PASS \| 62.86 \| 133.45 \| PASS \| 1.56 \| \| 16.0 MiB \| 70.92 \| 236.56 \| PASS \| 113.36 \| 147.99 \| PASS \| 1.60 \| \| 32.0 MiB \| 131.38 \| 255.41 \| PASS \| 203.21 \| 165.13 \| PASS \| 1.55 \| \| 64.0 MiB \| 253.39 \| 264.84 \| PASS \| 342.12 \| 196.15 \| PASS \| 1.35 \| \| 128.0 MiB \| 496.74 \| 270.20 \| PASS \| 670.62 \| 200.14 \| PASS \| 1.35 \| \| 256.0 MiB \| 982.42 \| 273.24 \| PASS \| 1318.36 \| 203.61 \| PASS \| 1.34 \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ Float32: +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| Size (fp32) \| Time (us) \| AlgBW (GB/s) \| Correctness \| NCCL Time (us) \| NCCL AlgBW (GB/s) \| NCCL Correctness \| Speed Up \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+ \| 4.0 KiB \| 11.04 \| 0.37 \| PASS \| 14.79 \| 0.28 \| PASS \| 1.34 \| \| 8.0 KiB \| 11.15 \| 0.73 \| PASS \| 15.25 \| 0.54 \| PASS \| 1.37 \| \| 16.0 KiB \| 11.12 \| 1.47 \| PASS \| 15.87 \| 1.03 \| PASS \| 1.43 \| \| 32.0 KiB \| 11.13 \| 2.95 \| PASS \| 17.21 \| 1.90 \| PASS \| 1.55 \| \| 64.0 KiB \| 11.11 \| 5.90 \| PASS \| 17.37 \| 3.77 \| PASS \| 1.56 \| \| 128.0 KiB \| 11.08 \| 11.83 \| PASS \| 17.54 \| 7.47 \| PASS \| 1.58 \| \| 256.0 KiB \| 11.15 \| 23.50 \| PASS \| 17.71 \| 14.80 \| PASS \| 1.59 \| \| 512.0 KiB \| 11.56 \| 45.34 \| PASS \| 18.21 \| 28.79 \| PASS \| 1.57 \| \| 1.0 MiB \| 13.64 \| 76.90 \| PASS \| 19.87 \| 52.77 \| PASS \| 1.46 \| \| 2.0 MiB \| 17.24 \| 121.67 \| PASS \| 31.63 \| 66.30 \| PASS \| 1.84 \| \| 4.0 MiB \| 25.19 \| 166.47 \| PASS \| 38.63 \| 108.57 \| PASS \| 1.53 \| \| 8.0 MiB \| 40.38 \| 207.72 \| PASS \| 62.65 \| 133.89 \| PASS \| 1.55 \| \| 16.0 MiB \| 70.72 \| 237.23 \| PASS \| 114.57 \| 146.44 \| PASS \| 1.62 \| \| 32.0 MiB \| 131.49 \| 255.18 \| PASS \| 200.79 \| 167.11 \| PASS \| 1.53 \| \| 64.0 MiB \| 253.98 \| 264.23 \| PASS \| 342.58 \| 195.89 \| PASS \| 1.35 \| \| 128.0 MiB \| 496.96 \| 270.08 \| PASS \| 670.64 \| 200.13 \| PASS \| 1.35 \| \| 256.0 MiB \| 982.83 \| 273.12 \| PASS \| 1318.90 \| 203.53 \| PASS \| 1.34 \| \| 512.0 MiB \| 1954.07 \| 274.75 \| PASS \| 2609.04 \| 205.77 \| PASS \| 1.34 \| +-------------+-----------+--------------+-------------+----------------+-------------------+------------------+----------+	2024-08-16 11:11:54 +08:00
Changho Hwang	8c6fb429e9	bfloat16 support (#336 ) * Add bfloat16 support for executor and NCCL interface * Changed `gpu_data_types.hpp` into an internal header file	2024-08-12 15:41:58 -07:00
Ziyue Yang	faadc75649	Fix missing import in executor test (#334 )	2024-08-06 14:24:50 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
Changho Hwang	c4ca2fbc8c	Resolve clang++ warnings (#325 )	2024-07-11 07:48:35 +00:00
Angelica Moreira	0f796bbdf7	Update allreduce_bench.py (#318 ) Replacing hardcoded network interface name for generic discovery strategy. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-06-29 03:41:13 +00:00
Roshan Dathathri	91550dab4c	Simplify/improve barrier in AllReduce6 (#317 ) Drop superfluous __threadfence_system()	2024-06-23 21:08:59 +00:00
Roshan Dathathri	93ed8e1e58	Add support for multicast reduce insruction (#316 )	2024-06-19 13:28:12 -07:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Changho Hwang	d35a2f2dc2	Rename executor.cpp to executor_py.cpp (#301 )	2024-05-17 13:31:27 -07:00
aashaka	0650371b54	Allow obtaining cuda stream handle from PyTorch stream when launching kernel (#297 ) Use `cuda_stream` attribute of a torch stream if the stream is not an instance of the cupy stream.	2024-05-04 04:57:07 +00:00
Changho Hwang	6c1fa5307c	Refactoring NVLS interfaces (#293 ) Move NVLS details from the core to a separate interface	2024-04-24 10:05:41 -07:00
Roshan Dathathri	41e0964d93	Allow binding allocated memory to NVLS multicast pointer (#290 ) And change NVLS multimem instructions to static functions	2024-04-18 17:11:31 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	9406123711	Fix a typo name (#286 )	2024-04-17 23:45:46 +00:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 )	2024-03-27 11:53:09 -07:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Saeed Maleki	a3d0799963	Fix the comm.py for nvls (#267 ) Fix the comm.py for nvls	2024-02-19 10:39:21 +08:00
Binyang Li	5971508eed	Remove cuda-python from project (#245 ) Remove cuda-python and use CuPy APIs instead --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-13 21:44:11 +08:00
aashaka	d97fef4395	Allow semaphores and memory to be registered separately in ProxyService (#264 ) This is needed in use cases where SimpleProxyChannel does not suffice. For example, when a single semaphore is to be used for multiple tensors or when multiple semaphores should be associated with a tensor.	2024-02-08 09:55:29 -08:00
Binyang Li	7c229fbdd8	Fix multi-nodes test failure (#262 ) fix multi-nodes CI pipeline Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-07 18:21:05 -08:00
aashaka	2101f5251e	Allow MSCCL++ CommGroup to take PyTorch tensors in args (#255 ) Obtain data_ptr and tensor_size accordingly for torch.Tensor Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-06 19:47:25 -08:00
Changho Hwang	6a19b19ece	Fix NVLS support (#258 ) * Do not compile nvls_test with ROCm * Fix multi-node tests	2024-02-06 23:24:13 +00:00
Saeed Maleki	91d592dcc0	NVLS support. (#250 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-04 20:46:10 -08:00
Binyang Li	422c81f0f8	remove make pylib-copy command (#249 ) Fix #216 Remove `make pylib-copy`	2024-01-19 12:29:15 -08:00
Binyang Li	163cba08c8	Update interface to let user change fifo size (#243 ) Related with this issue: https://github.com/microsoft/mscclpp/issues/242. The user may use more threads than the number specified in `fifo_size` to interact with the FIFO. In this case, there will be unexpected behavior. Update the interface to let user change fifo size on their demands.	2024-01-09 22:14:36 -08:00
Changho Hwang	544ff0c21d	ROCm support (#213 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-11-24 16:41:56 +08:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00
Changho Hwang	15f6dcca49	Update documentation (#217 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 12:58:04 -08:00
Changho Hwang	7bd66a938c	Robust correctness test (#221 ) Co-authored-by: Aashaka Shah <aashaka96@gmail.com>	2023-11-22 12:06:50 +08:00
Saeed Maleki	70eb6d7328	Fixing the bug in allreduce1 (#220 )	2023-11-18 10:34:52 -08:00
Saeed Maleki	1d1199703a	Auto-tune single-node AllReduce (#219 ) single node auto-tuner + graph plotter + bug fix for illegal memory access --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-11-17 21:42:05 +08:00
Changho Hwang	060fda12e6	mscclpp-test in Python (#204 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com>	2023-11-16 12:45:25 +08:00
Changho Hwang	4cdb100265	Release GIL for Python APIs with wait (#190 )	2023-11-14 21:11:01 +08:00
Changho Hwang	3521fb0280	Clear minor warnings (#214 ) Clear warnings from the clang compiler.	2023-11-14 09:28:48 +08:00
Saeed Maleki	85e8017535	Atomic for semaphores instead of fences (#188 ) Co-authored-by: Pratyush Patel <pratyushpatel.1995@gmail.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-10-13 18:57:08 +08:00
Saeed Maleki	148681b4bc	Fix a pytest bug (#196 )	2023-10-13 16:39:43 +08:00
Changho Hwang	8c0f9e84d0	v0.3.0 (#171 )	2023-10-11 22:35:54 +08:00
Changho Hwang	11ac824cc7	Align interfaces of put/get/putPackets/getPackets (#185 )	2023-10-07 22:18:26 +08:00
Changho Hwang	6c0ee72916	Construct `ProxyChannel` with shared pointers (#184 )	2023-09-18 05:46:23 +00:00
Changho Hwang	3aa72098d9	Add `poll()` for semaphores (#181 )	2023-09-15 07:40:44 +00:00
Binyang2014	097aa8843a	Fix pytest unstable issue. (#170 ) - remove `#include <cstdint>` from `poll.hpp`. To make it only contains device-side code - Fix compilation issue, which will cause pytest fail randomly. Reuse the compiled result for same kernel with different arguments	2023-09-06 17:09:04 -07:00
Olli Saarikivi	828be48b21	Add Context and Endpoint classes to enable non-Communicator use-cases (#166 ) This PR implements and closes #137. The new `Endpoint` and `Context` classes expose the connection establishing functionality from `Communicator`, which now is only responsible for tying together the bootstrapper with a context. The largest breaking change here is that `Communicator.connectOnSetup(...)` now returns the `Connection` wrapped inside a `NonblockingFuture`. This is because with the way `Context` is implemented a `Connection` is now fully initialized on construction. Some smaller breaking API changes from this change are that `RegisteredMemory` no longer has a `rank()` function (as there maybe no concept of rank), and similarly `Connection` has no `remoteRank()` and `tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf` functions in `Communicator`. A new `EndpointConfig` class is introduced to avoid duplication of the IB configuration parameters in the APIs of `Context` and `Communicator`. The usual usage pattern of just passing in a `Transport` still works due to an implicit conversion into `EndpointConfig`. Miscellaneous changes: -Cleans up how the PIMPL pattern is applied by making both the `Impl` struct and the `pimpl_` pointers private for all relevant classes in the core API. -Enables ctest to be run from the build root directory.	2023-09-06 13:10:04 +08:00
Binyang2014	858e381829	Pytest (#162 ) Port python tests to mscclpp. Please run `mpirun -tag-output -np 8 pytest ./python/test/test_mscclpp.py -x` to start pytest --------- Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Saeed Maleki <30272783+saeedmaleki@users.noreply.github.com>	2023-09-01 21:22:11 +08:00
Saeed Maleki	8d1b984bed	Change device handle interfaces & others (#142 ) * Changed device handle interfaces * Changed proxy service interfaces * Move device code into separate files * Fixed FIFO polling issues * Add configuration arguments in several interface functions --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: root <root@a100-saemal0.qxveptpukjsuthqvv514inp03c.gx.internal.cloudapp.net>	2023-08-16 20:00:56 +08:00
Olli Saarikivi	4865b2017b	Add Python get_include() (#141 ) Introduces a mscclpp.get_include() in the Python module. The extension module is now named _mscclpp so that we can have Python code in the mscclpp module. Also does some miscellaneous cleanup.	2023-07-25 10:23:16 -07:00

1 2

87 Commits