mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-04-19 22:39:11 +00:00

Author	SHA1	Message	Date
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Binyang Li	8ac50dc85d	Resolve cuMemMap error (#451 ) * Updated `RegisteredMemory::Impl::Impl(const std::vector<char>& serialization)` to use both minimum and recommended granularities for memory address reservation and mapping. This will resolve the cuMemMap error	2025-01-10 14:22:14 -08:00
Changho Hwang	2b54af7e27	Auto-update version numbers in CMakeLists.txt (#450 )	2025-01-09 17:54:10 -08:00
Changho Hwang	f2b52c6318	Fix Python binding of exceptions (#444 ) * Fixed errors to be catchable from Python code * Skip IB tests in Python unit tests when IB ports are down	2025-01-09 11:58:23 -08:00
Caio Rocha	80abce59ef	Flushing Proxy Channels at CPU side upon reaching the Inflight Request Limit (#415 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-01-08 09:02:36 -08:00
Changho Hwang	1989d4be9c	Fix CMake build messages (#443 )	2025-01-08 02:44:01 +00:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	6d26b92665	Fix azure pipeline (#437 )	2025-01-04 19:41:10 -08:00
Pedram Alizadeh	97eaca2bd2	[NPKIT] Adding the NPKIT support for kernel allreduce7 in mscclpp-nccl (#399 )	2025-01-03 20:38:57 +00:00
Qinghua Zhou	ba0d0d68b8	Enhance the nccl error message handling (#434 ) Add WARN or INFO before returning the nccl error message. Change NCCL_DEBUG to MSCCLPP_DEBUG in debug message.	2025-01-03 00:50:36 +00:00
Binyang Li	3d6bfed2cf	Update version number (#433 ) Co-authored-by: github-actions <github-actions@github.com>	2025-01-02 16:45:08 -08:00
Changho Hwang	dff21905e3	Fix typos in the pipeline (#420 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-01-02 09:49:50 -08:00
Binyang Li	3e7801b1a8	Fix CI trigger issue (#428 )	2024-12-20 16:27:52 -08:00
Binyang Li	f18a440feb	trigger ci for release branches (#426 )	2024-12-21 00:05:13 +00:00
Changho Hwang	e2230aab26	Tackle build warnings (#422 ) * Comply with [CMP0165](https://cmake.org/cmake/help/latest/policy/CMP0165.html) * Tackle other warnings during build	2024-12-19 16:51:50 -08:00
Binyang Li	6fedb7c0e8	Fix nccl-test failure issue (#421 )	2024-12-19 12:07:00 -08:00
Binyang Li	776f24e787	update READMED (#414 )	2024-12-19 05:54:27 +00:00
SreevatsaAnantharamu	0c7ed2c674	Add ncclBcast / ncclBroadcast support (#419 ) A simple broadcast using scratch buffer and option to use an executor.	2024-12-19 01:16:30 +00:00
David Sidler	d8d0dfbffa	Fix synchronization in allreduce8 kernel (#407 ) Running kernel allreduce8 across 64 vGPUs (in CPX mode) revealed a synchronization bug. The PR addresses it by ensuring that signals are only issued after all threads in the block have issued their writes to guarantee correct ordering between data writes and signal writes. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 17:10:22 -08:00
Caio Rocha	774602d49c	Supporting Executor multi node in NCCL API (#412 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-12-18 15:50:58 -08:00
Binyang Li	fcb2e46cb1	NVLS support for NCCL API (#410 ) Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 09:55:35 +00:00
Binyang Li	863a599360	Disable CuMemMap check for ROCm (#411 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-17 08:36:25 +00:00
Binyang Li	c65f19ad1a	Move pipeline to official org (#406 ) Move pipeline to official org. Unify all pipelines	2024-12-16 09:43:00 -08:00
Binyang Li	ee75caf365	Reduce memory usage for scratch buffer (#403 ) In the executor, we allocate the scratch buffer based on `sendMemRange`. However, for certain execution plans, this allocation may be unsuitable, as the plan does not support messages of this size. To avoid allocating to much data and cause OOM error, set scratch buffer size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan), scratchBufferSize(sendMemRange))`	2024-12-13 13:00:04 -08:00
Caio Rocha	01fd813f1b	Exception Max Number Operation per Tb (#405 )	2024-12-11 16:06:15 -08:00
Binyang Li	7a3dcb0627	Setup pipeline for mscclpp over nccl (#401 ) Setup pipeline for mscclpp over nccl Run `all_reduce_perf` via nccl API	2024-12-07 08:57:45 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Ziyue Yang	f6305a3c1d	Add connection events for NPKit (#386 )	2024-12-05 00:06:37 +08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Caio Rocha	ff18bb8d0b	Providing reduce-scatter test support (#390 )	2024-11-28 09:19:30 -08:00
Caio Rocha	d9c297ba14	AllGather Executor Support in NCCL Interface (#393 ) Co-authored-by: Ziyue Yang <ziyyang@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-27 17:05:51 -08:00
Binyang Li	593478e1b7	Add cross threadblock barrier (#383 )	2024-11-26 05:13:30 +00:00
Binyang Li	1b8d020650	Fix mscclpp_benchmark (#392 ) Enable 1GB message size for NVLS transport in mscclpp_benchmark	2024-11-25 19:59:51 +00:00
Caio Rocha	93628d2066	Fixing Message Boundary AllReduce Fallback Code (#391 )	2024-11-23 12:15:56 -08:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	db8e187407	Fix typo (#389 )	2024-11-21 22:45:50 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Ziyue Yang	3e51e9b359	Fix missing packet parameter for executor (#385 )	2024-11-19 08:36:37 +08:00
Caio Rocha	b3dc74c020	Small Adjust in Test Data AllGather at Executor Test (#384 )	2024-11-16 15:21:00 +08:00
Binyang Li	1baea89fa0	Fix light load bug (#379 ) Fix lightLoadExecutionPlan issue. An execution context many have multi device execution plans. These plans share the channel connections which are constructed before. A deviceExecutionPlanKey is introduced to identify these plans. We can get the current device execution plan key via: `contexts.currentDevicePlan`	2024-11-13 07:58:43 +00:00
Caio Rocha	d5d608abdc	Fixing Bug Const Offset in Execution Plan (#380 ) The offset was not differentiating between the buffer types, causing the offset to be incorrect when the buffer type was not `SCRATCH`. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-11-11 20:02:02 -08:00
Changho Hwang	85fdde7a73	Lazily create the context stream (#381 ) Create the context stream only when needed.	2024-11-11 10:39:32 +08:00
Ziyue Yang	9526d76fc7	Add kernel-based verification for executor_test (#378 ) Add kernels to fill and test data for correctness test in executor_test.py.	2024-11-07 14:14:20 +08:00
Jeff Rasley	449c274326	[docs] fix quickstart link (#374 ) Small fix to update quickstart link	2024-10-30 13:13:33 +08:00
Ziyue Yang	95ab1088ef	Fix in-place all-gather input buffer in executor_test (#372 )	2024-10-24 23:04:11 +08:00
Binyang Li	b72decbfeb	Update docker image for cuda12.4 (#370 ) Update docker image for cuda12.4 Image pushed to registry --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-10-22 12:51:28 +08:00
Binyang Li	582d386b3b	Fix algo repo name (#369 ) Change algo repo name from azure-mscclpp to msccl-users Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-10-22 10:59:15 +08:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Binyang Li	4136153a76	[Doc] mscclpp docs (#348 ) Generate docs for mescclpp. Setup github action to auto-deploy github-page doc link here: https://microsoft.github.io/mscclpp --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-10-18 06:08:31 +00:00
Changho Hwang	0c150e5166	Fix copyright messages (#367 )	2024-10-17 21:25:46 -07:00

1 2 3 4 5 ...

703 Commits