mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 09:17:06 +00:00

Author	SHA1	Message	Date
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	c65f19ad1a	Move pipeline to official org (#406 ) Move pipeline to official org. Unify all pipelines	2024-12-16 09:43:00 -08:00
Binyang Li	7a3dcb0627	Setup pipeline for mscclpp over nccl (#401 ) Setup pipeline for mscclpp over nccl Run `all_reduce_perf` via nccl API	2024-12-07 08:57:45 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Binyang Li	593478e1b7	Add cross threadblock barrier (#383 )	2024-11-26 05:13:30 +00:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Changho Hwang	0c150e5166	Fix copyright messages (#367 )	2024-10-17 21:25:46 -07:00
Changho Hwang	74130c7c5e	Use IB transport flags only when an IB device exists (#355 )	2024-09-19 07:13:11 +00:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	7bedb25054	Add proxy channel related operations (#351 ) Add Flush, PutWithSignal, PutWithFlushAndSignal operation	2024-09-15 13:24:57 -07:00
Caio Rocha	4eca6f1e95	Support executors to send packets over ProxyChannel (#344 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-08-30 22:10:33 +00:00
Changho Hwang	1e82dd444f	Make ibverbs optional at compile time (#340 ) Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-08-21 12:47:05 -07:00
Binyang Li	1351f9f1c5	Add "packet type" option for executor test (#313 ) Add "packet type" option for executor test	2024-06-14 09:53:58 +00:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Changho Hwang	f76eae4dca	Fix assert declaration & add a compile test (#303 )	2024-05-20 02:39:30 +00:00
Changho Hwang	a3cd95bd42	Upgrade gtest (#300 ) The new gtest version resolves a type casting issue: `3044657e7a`	2024-05-07 20:49:26 -07:00
Binyang Li	6226556ce2	Optimized the execution kernel (#294 )	2024-05-03 11:54:50 -07:00
Changho Hwang	d4ede480f4	Ethernet support (#284 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-04-25 11:06:43 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 )	2024-03-27 11:53:09 -07:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Binyang Li	bc465aefcd	Add __launch_bounds__ for mscclpp-test (#273 )	2024-03-25 15:55:37 -07:00
Binyang Li	4734d8718f	Fix multi-node ci pipeline (#272 ) Add `__launch_bounds__` to fix perf regression issue in CI pipeline	2024-03-12 09:39:00 -07:00
Changho Hwang	cdaf3aea3d	New packet format & optimizations (#256 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-20 20:01:37 -08:00
Changho Hwang	6a19b19ece	Fix NVLS support (#258 ) * Do not compile nvls_test with ROCm * Fix multi-node tests	2024-02-06 23:24:13 +00:00
Saeed Maleki	91d592dcc0	NVLS support. (#250 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-04 20:46:10 -08:00
Changho Hwang	4eb0a08b8c	Add `putWithSignal()` latency tests (#246 )	2024-01-24 01:10:35 +00:00
Changho Hwang	70e28b3c76	Do not check value of `__HIP_PLATFORM_AMD__` (#240 ) According to the [document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary), `__HIP_PLATFORM_AMD__` is effective only by definition.	2023-12-25 13:51:18 +08:00
Changho Hwang	5ff8bc5ef2	Fix & improve perf for ROCm (#232 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-12-18 11:30:08 +08:00
Changho Hwang	544ff0c21d	ROCm support (#213 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-11-24 16:41:56 +08:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00
Changho Hwang	060fda12e6	mscclpp-test in Python (#204 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com>	2023-11-16 12:45:25 +08:00
Changho Hwang	3521fb0280	Clear minor warnings (#214 ) Clear warnings from the clang compiler.	2023-11-14 09:28:48 +08:00
Binyang2014	8a938de9c5	fix pipeline (#209 ) fix pipeline for multi-node test	2023-11-03 05:18:32 +00:00
Binyang2014	6f43282c1d	Fix allreduce bug (#197 ) Fix allreduce correctness issue	2023-10-18 23:16:57 +08:00
Changho Hwang	8c0f9e84d0	v0.3.0 (#171 )	2023-10-11 22:35:54 +08:00
Changho Hwang	11ac824cc7	Align interfaces of put/get/putPackets/getPackets (#185 )	2023-10-07 22:18:26 +08:00
Changho Hwang	b3d0fdb8df	Add an atomic signal perf test (#183 )	2023-09-18 08:12:14 +00:00
Changho Hwang	6c0ee72916	Construct `ProxyChannel` with shared pointers (#184 )	2023-09-18 05:46:23 +00:00
Changho Hwang	a6b24dcbed	Fix #163 (#182 ) The bug was caused as frequent calls of initialize() temporarily exhaust all available ephemeral ports. Fixed by retrying `bind()` after a while upon `EADDRINUSE`.	2023-09-15 08:35:01 +00:00
Changho Hwang	3aa72098d9	Add `poll()` for semaphores (#181 )	2023-09-15 07:40:44 +00:00
Changho Hwang	d2f13f1e54	Fix #174 (#180 ) Added `extern "C"` based on another specification in `/usr/local/cuda/include/crt/common_functions.h`.	2023-09-15 06:44:41 +00:00
Binyang2014	952f2da9cc	Improve single node allreduce performance (#169 ) Improve all reduce performance for single node. New number: \| n_ctx \| size \| target latency (us) \| allreduce5 \| allreduce6 \| \|---------\|---------\|----------------\|------------\|------------\| \| 1 \| 24.0kB \| 7.7 \| \| 7.23\| \| 2 \| 48.0kB \| 7.7 \| \| 7.69\| \| 4 \| 96.0kB \| 8 \| \| 8.34\| \| 8 \| 192.0kB \| 12.6 \| \| 9.75\| \| 12 \| 288.0kB \| 13 \| \| 11.34\| \| 16 \| 384.0kB \| 13.3 \| \| 12.99\| \| 768 \| 18.0MB \| 158.7 \| 160.3\| \| \| 896 \| 21.0MB \| 184.5 \| 183.8\| \| \| 1024 \| 24.0MB \| 209.5 \| 207.5\| \| \| 1152 \| 27.0MB \| 234.3 \| 231.9\| \| \| 1280 \| 30.0MB \| 260 \| 255.6\| \| \| 1408 \| 33.0MB \| 284.9 \| 278.7\| \| \| 1536 \| 36.0MB \| 310.3 \| 302.0\| \| \| 1664 \| 39.0MB \| 336.2 \| 325.3\| \| \| 1792 \| 42.0MB \| 361.4 \| 348.8\| \| \| 1920 \| 45.0MB \| 384.6 \| 372.2\| \| \| 2048 \| 48.0MB \| 409.1 \| 395.4\| \| --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-09-13 14:30:08 +00:00
Saeed Maleki	015e29c138	adding signal for atomic op (#178 ) This address [this](https://github.com/microsoft/mscclpp/issues/177).	2023-09-11 10:46:25 -07:00
Binyang2014	097aa8843a	Fix pytest unstable issue. (#170 ) - remove `#include <cstdint>` from `poll.hpp`. To make it only contains device-side code - Fix compilation issue, which will cause pytest fail randomly. Reuse the compiled result for same kernel with different arguments	2023-09-06 17:09:04 -07:00
Olli Saarikivi	828be48b21	Add Context and Endpoint classes to enable non-Communicator use-cases (#166 ) This PR implements and closes #137. The new `Endpoint` and `Context` classes expose the connection establishing functionality from `Communicator`, which now is only responsible for tying together the bootstrapper with a context. The largest breaking change here is that `Communicator.connectOnSetup(...)` now returns the `Connection` wrapped inside a `NonblockingFuture`. This is because with the way `Context` is implemented a `Connection` is now fully initialized on construction. Some smaller breaking API changes from this change are that `RegisteredMemory` no longer has a `rank()` function (as there maybe no concept of rank), and similarly `Connection` has no `remoteRank()` and `tag()` functions. The latter are replaced by `remoteRankOf` and `tagOf` functions in `Communicator`. A new `EndpointConfig` class is introduced to avoid duplication of the IB configuration parameters in the APIs of `Context` and `Communicator`. The usual usage pattern of just passing in a `Transport` still works due to an implicit conversion into `EndpointConfig`. Miscellaneous changes: -Cleans up how the PIMPL pattern is applied by making both the `Impl` struct and the `pimpl_` pointers private for all relevant classes in the core API. -Enables ctest to be run from the build root directory.	2023-09-06 13:10:04 +08:00

1 2

93 Commits