mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-03-24 00:57:47 +00:00

Author	SHA1	Message	Date
Caio Rocha	dff3bc7bbb	Support Fusion for ReadPutPacket Operation at DSL (#742 ) Support is being added for fusing the ReadPutPacket operation on DSL, which reduces the overhead caused by reading packet data multiple times in the scratch buffer. Fusion will occur when two rppkt operations are executed consecutively with the same src_buffer: rppkt(src, dst0) + rppkt(src, dst1) -> rppkt(src, [dst0, dst1] Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-02-12 17:27:20 -08:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Caio Rocha	9261b1d278	AlltoAll Test Support (#606 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-15 16:00:41 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Caio Rocha	55789bc551	Support ReduceScatter in the NCCL interface (#460 ) Co-authored-by: root <root@mscclpp-000002.tn3ujtlnlkjehmmeegdavazkfg.jx.internal.cloudapp.net> Co-authored-by: Caio Rocha <aiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-02-11 13:28:19 -08:00
Changho Hwang	869cdba00c	Manage runtime environments (#452 ) * Add `Env` class that manages all runtime environments. * Changed `NPKIT_DUMP_DIR` to `MSCCLPP_NPKIT_DUMP_DIR`.	2025-01-15 09:44:52 -08:00
Changho Hwang	34945fb107	Add `GpuBuffer` class (#423 ) * Renamed and moved mem alloc functions into the `mscclpp::detail::` namespace (now `mscclpp::detail::gpuCalloc<T>()`) Deprecated constructor-calling mem alloc functions (`mscclpp::makeShared<T>()` and `mscclpp::makeUnique<T>()`) * Added a new `mscclpp::GpuBuffer<T>()` class that should be used in general for allocating communication buffers * Added a new `mscclpp.utils.GpuBuffer` Python class that inherits `cupy.ndarray` and allocates using `mscclpp::gpuMemAlloc` * Renamed `mscclpp::memcpyCuda<T>()` functions into `mscclpp::gpuMemcpy<T>()` for name consistency * A few fixes in NVLS memory allocation * Tackled minor compiler warnings	2025-01-07 18:40:01 -08:00
Binyang Li	88d28e07a7	Select algo according to json config (#396 ) The way to run nccl-test over mscclpp: mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$(pwd)/build/apps/nccl/libmscclpp_nccl.so -x NCCL_DEBUG=WARN -x MSCCLPP_EXECUTION_PLAN_DIR=/execution-files /root/nccl-tests/build/all_reduce_perf -b 1K -e 1G -f 2 -d half -G 20 -w 10 -n 20	2024-12-03 22:39:20 +00:00
Caio Rocha	ff18bb8d0b	Providing reduce-scatter test support (#390 )	2024-11-28 09:19:30 -08:00
Binyang Li	28a57b0610	NVLS support for msccl++ executor (#375 ) - Support mote datatype for multicast operation - Add new OP MULTI_LOAD_REDUCE_STORE to support NVLS - Modify allocSharedPhysicalCuda, which return std::shared_ptr<T> instead of std::shared_ptr<PhysicalCudaMemory> - Add Python support for allocSharedPhysicalCuda Test passed for `allreduce_nvls.json`	2024-11-20 06:43:28 +00:00
Ziyue Yang	9526d76fc7	Add kernel-based verification for executor_test (#378 ) Add kernels to fill and test data for correctness test in executor_test.py.	2024-11-07 14:14:20 +08:00
Ziyue Yang	95ab1088ef	Fix in-place all-gather input buffer in executor_test (#372 )	2024-10-24 23:04:11 +08:00
Caio Rocha	c6e06cfad7	Executor AllGather In-Place Support (#365 )	2024-10-21 05:45:56 -07:00
Caio Rocha	08a0cec2eb	Fixing RegisterMemory Allocation for ProxyChannels (#353 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-09-24 23:01:41 -07:00
Binyang Li	b30bb260e3	Tune threads per block for mscclpp executor (#345 )	2024-09-18 17:21:47 -07:00
Ziyue Yang	faadc75649	Fix missing import in executor test (#334 )	2024-08-06 14:24:50 -07:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00

20 Commits