mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-25 07:14:40 +00:00

Author	SHA1	Message	Date
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	6fedb7c0e8	Fix nccl-test failure issue (#421 )	2024-12-19 12:07:00 -08:00
Binyang Li	fcb2e46cb1	NVLS support for NCCL API (#410 ) Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-18 09:55:35 +00:00
Binyang Li	863a599360	Disable CuMemMap check for ROCm (#411 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-12-17 08:36:25 +00:00
Binyang Li	ee75caf365	Reduce memory usage for scratch buffer (#403 ) In the executor, we allocate the scratch buffer based on `sendMemRange`. However, for certain execution plans, this allocation may be unsuitable, as the plan does not support messages of this size. To avoid allocating to much data and cause OOM error, set scratch buffer size to `min(scratchBufferSize(maxMessageSizeSupportedForPlan), scratchBufferSize(sendMemRange))`	2024-12-13 13:00:04 -08:00
Caio Rocha	01fd813f1b	Exception Max Number Operation per Tb (#405 )	2024-12-11 16:06:15 -08:00
Changho Hwang	756f24c697	Revised ProxyChannel interfaces (#400 ) * Renamed `ProxyChannel` -> `BaseProxyChannel` and `SimpleProxyChannel` -> `ProxyChannel`. It makes the interface more consistent by defining channels to be associated with a certain src/dst memory region: `ProxyChannel` as "sema + src/dst + fifo" and `SmChannel` as "sema + src/dst". BaseProxyChannel is not associated with any memory regions, as "sema + fifo". * `ProxyChannelDeviceHandle` now inherits from `BaseProxyChannelDeviceHandle`, instead of having one as a member.	2024-12-06 10:53:34 -08:00
Ziyue Yang	f6305a3c1d	Add connection events for NPKit (#386 )	2024-12-05 00:06:37 +08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Binyang Li	593478e1b7	Add cross threadblock barrier (#383 )	2024-11-26 05:13:30 +00:00
Changho Hwang	2127a3ba29	Improve CMake options (#376 ) * Let all CMake option names start with `MSCCLPP_` * Explain the `MSCCLPP_BUILD_PYTHON_BINDINGS` option in readme --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-11-22 01:54:11 +00:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Ziyue Yang	3e51e9b359	Fix missing packet parameter for executor (#385 )	2024-11-19 08:36:37 +08:00
Binyang Li	1baea89fa0	Fix light load bug (#379 ) Fix lightLoadExecutionPlan issue. An execution context many have multi device execution plans. These plans share the channel connections which are constructed before. A deviceExecutionPlanKey is introduced to identify these plans. We can get the current device execution plan key via: `contexts.currentDevicePlan`	2024-11-13 07:58:43 +00:00
Caio Rocha	d5d608abdc	Fixing Bug Const Offset in Execution Plan (#380 ) The offset was not differentiating between the buffer types, causing the offset to be incorrect when the buffer type was not `SCRATCH`. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-11-11 20:02:02 -08:00
Changho Hwang	85fdde7a73	Lazily create the context stream (#381 ) Create the context stream only when needed.	2024-11-11 10:39:32 +08:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Changho Hwang	0c150e5166	Fix copyright messages (#367 )	2024-10-17 21:25:46 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Ziyue Yang	5c4e105814	Fix NPKit exit event offset (#356 )	2024-09-19 13:35:44 +08:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Binyang Li	7bedb25054	Add proxy channel related operations (#351 ) Add Flush, PutWithSignal, PutWithFlushAndSignal operation	2024-09-15 13:24:57 -07:00
Binyang Li	26a87535f9	Fix bug for construct sempaphore (#341 ) Current semaphore construction requires two-way communication, e.g., to construct a semaphore signaling from rank 0 to rank 1, both rank 0 and rank 1 need to send a message to each other. This PR fixes an executor bug that fails to conduct two-way communication for constructing such one-way semaphores, and instead hangs during the semaphore construction. In the future, we may need to change the implementation to construct semaphore via one-way communication. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-04 19:42:03 +08:00
Changho Hwang	72b99a4229	Fix for ROCm 6.0 (#347 )	2024-09-01 20:22:33 -07:00
Caio Rocha	4eca6f1e95	Support executors to send packets over ProxyChannel (#344 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-08-30 22:10:33 +00:00
Caio Rocha	1af62ea43d	ProxyChannel Support in Executor (#342 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-08-27 10:09:44 -07:00
Changho Hwang	1e82dd444f	Make ibverbs optional at compile time (#340 ) Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-08-21 12:47:05 -07:00
Caio Rocha	ead4efc315	Dynamically load libibverbs (#337 )	2024-08-13 23:48:39 -07:00
Changho Hwang	8c6fb429e9	bfloat16 support (#336 ) * Add bfloat16 support for executor and NCCL interface * Changed `gpu_data_types.hpp` into an internal header file	2024-08-12 15:41:58 -07:00
caiomcbr	67eb9b04cc	NCCL API Executor Integration (#331 ) Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-25 15:05:02 -07:00
Roshan Dathathri	f131fae3ec	Add support for different vector sizes in multimem instructions (#332 )	2024-07-25 10:14:02 -07:00
Ziyue Yang	b5a48f836c	Separate NPKit CPU timestamp access from different blocks for AMD platform (#321 ) Reference: https://github.com/ROCm/rccl/pull/1229	2024-07-02 19:36:48 +08:00
Ziyue Yang	f29095b3b1	Fix NPKit support for AMD (#312 )	2024-06-14 16:22:14 +08:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Binyang Li	80aefe55bc	Cumulative Updates (#309 ) Bug fix: Unable to execute communication primitives with the same execution plan but varying message sizes. Add reduce_packets OP	2024-06-12 19:17:57 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Binyang Li	3a18068cd4	Fix security issue (#305 ) Change sprintf to snprintf to avoid potential security issue	2024-05-25 23:12:57 -07:00
Binyang Li	6226556ce2	Optimized the execution kernel (#294 )	2024-05-03 11:54:50 -07:00
Binyang Li	5628362715	Resolve multi-nodes test failure issue (#295 ) Fix bug, resolve multi-nodes test failure issue.	2024-04-26 13:06:57 +08:00
Changho Hwang	d4ede480f4	Ethernet support (#284 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-04-25 11:06:43 -07:00
Changho Hwang	89896ff94f	Include GPU data types only for kernel code (#292 )	2024-04-24 20:55:02 -07:00
Changho Hwang	6c1fa5307c	Refactoring NVLS interfaces (#293 ) Move NVLS details from the core to a separate interface	2024-04-24 10:05:41 -07:00
Changho Hwang	9934c982a8	Seperate headers for GPU data types (#291 ) Prevent unnecessarily including data type headers in everywhere.	2024-04-19 05:52:43 +00:00
Roshan Dathathri	41e0964d93	Allow binding allocated memory to NVLS multicast pointer (#290 ) And change NVLS multimem instructions to static functions	2024-04-18 17:11:31 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	9406123711	Fix a typo name (#286 )	2024-04-17 23:45:46 +00:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Changho Hwang	d34e097b40	Fix wrong offset calculation (#257 )	2024-02-06 08:55:43 +08:00
Saeed Maleki	91d592dcc0	NVLS support. (#250 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-04 20:46:10 -08:00
Binyang Li	163cba08c8	Update interface to let user change fifo size (#243 ) Related with this issue: https://github.com/microsoft/mscclpp/issues/242. The user may use more threads than the number specified in `fifo_size` to interact with the FIFO. In this case, there will be unexpected behavior. Update the interface to let user change fifo size on their demands.	2024-01-09 22:14:36 -08:00

1 2 3 4 5 ...

396 Commits