mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-13 09:46:00 +00:00

Author	SHA1	Message	Date
Binyang Li	f124dc1df9	Add min operation for allreduce (#481 ) Add min operation for allreduce	2025-03-16 20:47:36 -07:00
Binyang Li	0b840baa05	Update allgather fallback algo (#476 ) Enhancements to all-gather operation, a temporary solution to fix the memory overhead when integrating msccl++ with pytorch. This solution will not register input/output buffer to msccl++, so the temp output buffer for allgather could be reused by torch automatically. * Introduced a new `allgather8` kernel function in `apps/nccl/src/allgather.hpp` to handle larger data sizes more efficiently. This includes double buffering to hide synchronization overhead and support for both in-place and out-of-place operations. * Modified the `allgather` function to decide between `allgather6` and `allgather8` based on data size and platform, improving performance for large data sizes. Configuration and environment improvements: * Added a new environment variable `MSCCLPP_DISABLE_CHANNEL_CACHE` to control whether the channel cache is disabled, enhancing configurability. This variable is now part of the `Env` class and is logged during environment initialization. * Removed the redundant global variable `mscclppDisableChannelCache` from `src/debug.cc` and updated its usage to refer to the new environment variable.	2025-03-14 11:18:03 -07:00
Qinghua Zhou	591276f9d0	Disable channel cache (#463 ) Add workaround of disabling channel cache. Related runtime parameter: -x MSCCLPP_DISABLE_CHANNEL_CACHE=TRUE (Default value: False) In this PR, some other features (e.g., ncclCommSplit) come from branch binyangli/nccl-api --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-02-19 19:26:12 +00:00
Caio Rocha	55789bc551	Support ReduceScatter in the NCCL interface (#460 ) Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net> Co-authored-by: Caio Rocha <aiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-02-11 13:28:19 -08:00
Changho Hwang	3565bfdf6d	Renaming channels (#436 ) Renamed `ProxyChannel` to `PortChannel` and `SmChannel` to `MemoryChannel`	2025-01-24 14:25:31 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	6d26b92665	Fix azure pipeline (#437 )	2025-01-04 19:41:10 -08:00
Pedram Alizadeh	97eaca2bd2	[NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl (#399 )	2025-01-03 20:38:57 +00:00
Qinghua Zhou	ba0d0d68b8	Enhance the nccl error message handling (#434 ) Add WARN or INFO before returning the nccl error message. Change NCCL_DEBUG to MSCCLPP_DEBUG in debug message.	2025-01-03 00:50:36 +00:00
Changho Hwang	e2230aab26	Tackle build warnings (#422 ) * Comply with [CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html) * Tackle other warnings during build	2024-12-19 16:51:50 -08:00
SreevatsaAnantharamu	0c7ed2c674	Add ncclBcast / ncclBroadcast support (#419 ) A simple broadcast using scratch buffer and option to use an executor.	2024-12-19 01:16:30 +00:00
David Sidler	d8d0dfbffa	Fix synchronization in allreduce8 kernel (#407 ) Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a synchronization bug. The PR addresses it by ensuring that signals are only issued after all threads in the block have issued their writes to guarantee correct ordering between data writes and signal writes. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 17:10:22 -08:00
Caio Rocha	774602d49c	Supporting Executor multi node in NCCL API (#412 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-12-18 15:50:58 -08:00
Binyang Li	fcb2e46cb1	NVLS support for NCCL API (#410 ) Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 09:55:35 +00:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Caio Rocha	d9c297ba14	AllGather Executor Support in NCCL Interface (#393 ) Co-authored-by: Ziyue Yang <ziyyang@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-27 17:05:51 -08:00
Caio Rocha	93628d2066	Fixing Message Boundary AllReduce Fallback Code (#391 )	2024-11-23 12:15:56 -08:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Binyang Li	4136153a76	[Doc] mscclpp docs (#348 ) Generate docs for mescclpp. Setup github action to auto-deploy github-page doc link here: https://microsoft.github.io/mscclpp --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-10-18 06:08:31 +00:00
Changho Hwang	f8c0bcca2b	Perf optimization & support clipping (#364 ) Co-authored-by: Nusrat Islam <Nusrat.Islam@amd.com>	2024-10-16 14:35:08 -07:00
Changho Hwang	e9294357c5	Fix NCCL API bugs (#363 )	2024-10-16 14:16:34 -07:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Changho Hwang	1e82dd444f	Make ibverbs optional at compile time (#340 ) Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-08-21 12:47:05 -07:00
Changho Hwang	8c6fb429e9	bfloat16 support (#336 ) * Add bfloat16 support for executor and NCCL interface * Changed `gpu_data_types.hpp` into an internal header file	2024-08-12 15:41:58 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
caiomcbr	7493e2f075	Double buffering for NCCL APIs (#324 ) Using two scratch buffers in each peer to exchange data. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-15 22:18:53 +00:00
Changho Hwang	c4ca2fbc8c	Resolve clang++ warnings (#325 )	2024-07-11 07:48:35 +00:00
caiomcbr	f4c3c8f916	AllReduce Kernel for Small Messages (#322 ) Adding allreduce kernel code for message sizes smaller than 32 bytes, when the number of elements are smaller than the number of ranks. --------- Co-authored-by: Caio Rocha <caio.rocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-05 21:08:43 +00:00
caiomcbr	b1b9d0626c	Support NCCL APIs (#319 ) Start supporting NCCL APIs with a few limitations. --------- Co-authored-by: Caio Rocha <caio.rocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-06-27 23:54:06 +00:00

31 Commits