mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-24 23:06:17 +00:00

Author	SHA1	Message	Date
Binyang Li	42ece408b9	Fix memory leak	2026-05-24 05:56:11 +00:00
Binyang Li	7308c321a0	merge main	2026-05-22 23:06:04 +00:00
Binyang Li	08ee18be64	Add check to filter invalid nblock/nthread candidates (#811 ) Add check for invalid nblock/nthread candidate	2026-05-22 09:18:41 -07:00
Binyang Li	9e177b388c	remove useless sync (#809 )	2026-05-20 16:49:49 -07:00
Binyang Li	ac44e98d96	update	2026-05-20 20:33:27 +00:00
Binyang Li	35331cf91d	Fix collective topology sizing Rename native collective context workSize to worldSize and use nRanksPerIpcDomain for allpair peer topology. Include the staged DSL signal/wait pairing validation changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-20 20:21:06 +00:00
Binyang Li	72621e7221	add nBlocks check for allreduce_allpair_packet algo (#807 ) - Fix the correctness issue for allreduce_allpair_packet algo. Make sure no overwrite for input buffer. Use same tb for send/reduce/write-back. - Check if nBlocks/nthreads validate for packet algorithm. - Add more logs - Modify flag update logic, make it work for the case: nthreadPerNBlock < nflags --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-20 09:29:55 -07:00
Caio Rocha	c1071318c8	Include a static synchronization check in the DSL. (#806 )	2026-05-19 13:06:53 -07:00
Binyang Li	4db71b93b7	Move barrier into setupNvlsChannels and clean up NVLS pipeline state - setupNvlsChannels now takes the Communicator and barriers internally after binding all switch channels, replacing the explicit bootstrap()->barrier() previously done only in AllreduceNvlsPacket. - Demote nRanksPerIpcDomain_ / nBaseChannels_ to locals in AllreduceNvlsBlockPipeline and AllreduceNvlsWarpPipeline; they were never read outside initialize(). - Drive-by: pick up in-tree edits to switch_channel_device.hpp, executor.cc, communicator.hpp, and allreduce_rsag.cu. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-18 20:50:01 +00:00
Qinghua Zhou	18d37379d2	Tighten NVML IPC domain hash lookup Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-16 23:23:30 +00:00
Qinghua Zhou	594dc79657	Address NVLS review feedback Handle unsupported FP8 NVLS paths safely, tighten IPC-domain guards, align IPC-domain naming, and add IPC-domain fabric hash logging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-16 23:19:25 +00:00
Binyang Li	f32cfb1fb8	update	2026-05-16 19:29:18 +00:00
Binyang Li	94af88d88d	Fix tuning example hang Avoid probing invalid packet allreduce configurations and reduce the default tuning sweep so the 8-rank tuning example completes reliably. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-16 01:24:56 +00:00
Binyang Li	0744e806fc	detect ipc domain automaticlly	2026-05-16 00:39:49 +00:00
Binyang Li	93b43547cc	temp solution	2026-05-15 23:15:40 +00:00
Binyang Li	dbebde2b58	Configure IPC domain per communicator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 22:26:53 +00:00
Binyang Li	ee82cc4c41	Merge branch 'main' into binyli/mnnvl Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 21:36:36 +00:00
Binyang Li	60a6d7219f	Clean up completed communicator receives (#804 ) ## Summary - Release the reference after last requests are ready. - Keep ordered receive chaining for repeated rank/tag operations while cleaning up completed receive bookkeeping. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 21:06:50 +00:00
Changho Hwang	252a422030	Handle PortChannel flush asynchronously from the host proxy (#802 ) When a `PortChannel` requests `flush()`, the host-side proxy was being blocked, which may cause head-of-line blocking of other parallel `PortChannel`s' requests. Now the proxy handles `flush()` requests asynchronously. This feature especially helps performance when we need multiple IB QPs and need to flush QPs.	2026-05-15 11:50:43 -07:00
Binyang Li	24850ef2de	Merge branch 'main' into binyli/mnnvl	2026-05-15 09:22:02 -07:00
Changho Hwang	5d608feaa5	Enhance cross-node CudaIpc availability check (#803 )	2026-05-14 14:06:12 -07:00
Caio Rocha	40295df4c4	Adding Support to bf16 Executor Tests (#801 ) This pull request adds support for the `bfloat16` (bf16) data type to the test executor, including both Python and CUDA components. The changes ensure that `bfloat16` is handled consistently across argument parsing, data type conversion, and test kernel implementations. Additionally, the CUDA verification kernels are refactored to use parameterized tolerances for improved numerical accuracy checks. Support for bfloat16 data type: * Added handling for `bfloat16`/`bf16` in the Python test executor's argument parsing, data type conversion (`parse_dtype`, `dtype_to_mscclpp_dtype`), and help text. [[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28) [[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135) [[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251) * Updated output to display the correct data type string for `bfloat16`. CUDA kernel and test improvements: * Included `bfloat16` headers and defined test data fill and gather kernels for `bfloat16` on both CUDA and HIP platforms. [[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11) [[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35) [[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59) [[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133) * Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to use an explicit tolerance parameter (`Eps`) and added correct tolerances for each data type, including `bfloat16`. [[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85) [[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113) These changes ensure full support for `bfloat16` in the test executor and improve the accuracy and maintainability of the CUDA test kernels. --------- Co-authored-by: Caio Rocha <caiorocha@microsof.com>	2026-05-14 09:56:11 -07:00
copilot-swe-agent[bot]	7724e49f31	Fix lint and ROCm error alias Agent-Logs-Url: https://github.com/microsoft/mscclpp/sessions/0f0e525d-a69c-4ff7-8913-983243b5cbf7 Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-05-13 03:26:53 +00:00
Binyang Li	0c09239b06	Merge branch 'main' into binyli/mnnvl	2026-05-12 20:17:02 -07:00
Binyang Li	224b3deb84	Clean up completed communicator receives Erase completed receive bookkeeping from the communicator once the deferred receive future finishes, while preserving ordered receive chaining for repeated rank/tag operations. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-13 01:22:51 +00:00
Caio Rocha	0c9b9abfd5	Adding Support 4 Nodes AllReduce Small Message Size (#794 ) Results on 4 Nodes H200: \| Size \| NCCL \| MSCCL++ 57TB \| MSCCL++ 29TB \| \|------\|-------\|--------------\|--------------\| \| 8K \| 45.75 \| 17.74 \| 18.18 \| \| 16K \| 47.08 \| 18.9 \| 18.42 \| \| 32K \| 47.29 \| 19.48 \| 19.12 \| \| 64K \| 50.34 \| 20.51 \| 19.29 \| \| 128K \| 59.65 \| 21.37 \| 20.25 \| \| 256K \| 87.46 \| 23.87 \| 23.51 \| \| 512K \| 106.55\| 29.15 \| 29.51 \| \| 1M \| 115 \| 40.64 \| 41.83 \| \| 2M \| 135.89\| 63.73 \| 70.45 \| \| 4M \| 177.59\| 121.76 \| 128.79 \| \| 8M \| 251.17\| 228.5 \| 251.36 \| --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsof.com>	2026-05-12 13:45:55 -07:00
Binyang Li	825fc124a5	address hang issue	2026-05-09 03:16:33 +00:00
Binyang Li	e208cc326b	WIP	2026-05-08 04:30:05 +00:00
Binyang Li	5516bdbb6b	fix	2026-05-08 04:22:50 +00:00
Binyang Li	654bcfa6ba	update	2026-05-08 03:54:32 +00:00
Binyang Li	9ff7e1c2c3	update	2026-05-08 03:43:34 +00:00
Binyang Li	113d859d13	fix	2026-05-08 03:29:05 +00:00
Binyang Li	d1b04a3b26	NVLS zero-copy allreduce: support FP16 accumulator for FP8 inputs multimem.ld_reduce on FP8 inputs accumulates in FP32 by default. The ISA also exposes an .acc::f16 variant that keeps the reduction in FP16, which is faster but lower precision. Plumb AccumT through: - include/mscclpp/switch_channel_device.hpp: Extend SwitchChannelDeviceHandle::multimemLoadReduce with an optional AccumT template parameter. When VectorType is one of the FP8 vector types (f8_e4m3x{4,8,16} / f8_e5m2x{4,8,16}) and AccumT is __half, emit the .acc::f16 form of the instruction; otherwise unchanged. - src/ext/collectives/include/allreduce/common.hpp: Make handleMultiLoadReduceStore template on AccumT and forward it to multimemLoadReduce<vectorType, AccumT>(...). - src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu: Template allreduceNvls and NvlsAdapter on AccumT and forward to handleMultiLoadReduceStore<T, AccumT>; the existing dispatch<> machinery already plumbs AccumT through from the algorithm context. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-07 00:38:31 +00:00
Binyang Li	7d80a33360	Default torch example SYMMETRIC_MEMORY env to 1 The non-symmetric rsag_zero_copy path uses an incrementing tag in its context key, so cross-rank memory registration handshakes happen on every call rather than being cached. At single-host x 8 GPUs and sizes >= 512 KB this becomes the only candidate (since nvls_zero_copy is filtered out without symmetric memory) and degrades into apparent hang. Defaulting SYMMETRIC_MEMORY=1 lets a plain `mpirun ...` invocation work out of the box; users can still override with `SYMMETRIC_MEMORY=0` to exercise the non-symmetric path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 23:43:37 +00:00
Binyang Li	e8caab7c8e	Strip preflight validation blocks from NVLS pipeline allreduce kernels allreduce_nvls_block_pipeline.cu and allreduce_nvls_warp_pipeline.cu were carrying ~45 lines of per-call invariant-checking added during the MNNVL work. Restore main's simple defaulting pattern (just `if (==0) set defaults`); incorrect inputs will manifest as CUDA errors via the existing error-handling path. Also drop the unreachable `6 * ipcDomainNranks > NUM_SEMAPHORES` throw in the block_pipeline initialize (max ipcDomainNranks=72, NUM_SEMAPHORES=512), the now-unused `<mscclpp/errors.hpp>` include, and trim the verbose comments around `nBaseChannels_` sizing in both files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 23:04:41 +00:00
Binyang Li	639b80de7b	Tie AllreduceAllpairPacket maxBlockNum_ to MAX_IPC_DOMAIN_NRANKS - 1 The hard-coded 72 was off by one from what the comment claims is the minimum (MAX_IPC_DOMAIN_NRANKS - 1 = 71). Express the value via the constant so the relationship is self-documenting and any future change to MAX_IPC_DOMAIN_NRANKS propagates automatically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 22:31:15 +00:00
Binyang Li	095cfff11d	Revert RSAG nBlocks default to 64 The 128-block default fires only when the caller passes nBlocks=0 (i.e. no tuning). Tuning explicitly drives nBlocks via the adapter, so the historical default of 64 is fine. Keep nChannelsPerConnection_=128 so the tuner can still request up to 128 blocks for MNNVL configs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 22:23:18 +00:00
Binyang Li	bde23ce38e	Revert verbose RSAG zero-copy comment; rename NRanksPerNode template param - Restore the original two-line note about the templated peer-loop unrolling instead of the multi-paragraph rationale block. - Rename the kernel template parameter from NRanksPerNode to NRanks. The IPC domain can span multiple physical hosts under MNNVL, so the 'PerNode' suffix is misleading; NRanks matches the runtime ipcDomainNranks parameter that drives template dispatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 22:16:08 +00:00
Binyang Li	f0c6ac081f	Fold validateIpcDomainSpansWorld into getIpcDomainNranks getIpcDomainNranks now performs the range / world-size / rank checks itself and throws on violation, so the separate validateIpcDomainSpansWorld helper is unnecessary. Update the 3 NVLS callsites (block_pipeline, warp_pipeline, nvls_zero_copy) to call getIpcDomainNranks directly. The non-NVLS callers also pick up the strict validation, which is fine because they are only invoked in single-host or multi-host MNNVL scenarios where worldSize == ipcDomainNranks (the NCCL adapter's multi-node path returns nullptr, falling back to NCCL/RCCL). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 21:49:48 +00:00
Binyang Li	307a471888	Shorten verbose comments and use THROW in validateIpcDomainSpansWorld - Collapse the duplicated 3-line warp-strided-load comment in 5 kernels (allgather_fullmesh, allreduce_fullmesh, allreduce_packet, allreduce_nvls_zero_copy, allreduce_nvls_warp_pipeline) into a single one-line 'Peer count may exceed WARP_SIZE on MNNVL.' note. - Drop the algName parameter from validateIpcDomainSpansWorld; switch its 3 throws to use the THROW logger macro (LogSubsys::ALGO), which already captures file/line/function. Update the 3 callsites (nvls_block_pipeline, nvls_warp_pipeline, nvls_zero_copy) and trim the Doxygen comment accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 21:37:09 +00:00
Binyang Li	4a0d5b29d5	Simplify torch-integration tuning example - Drop the multi_host_mnnvl-specific rsag fallback in _default_ar_config; fall through to default_allreduce_packet when NVLS is unavailable. - Add SYMMETRIC_MEMORY env var so the tuning sweep can include the zero-copy NVLS / RSAG candidates without editing the source. - Make _algo() raise on miss (direct dict lookup) and drop the defensive 'if a:' guards in _ar_candidates / _ag_candidates / _default_ar_config; merge existence checks into the platform conditions (self._nvls, self.symmetric_memory). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 21:14:36 +00:00
Binyang Li	905b23d9a8	Drop non-MNNVL multi_node regime from torch-integration example The example is now MNNVL-only: a run is either single-host (everything fits in one node) or multi-host MNNVL (one cross-host NVLink domain). Plain multi-node-without-MNNVL had its own algorithm branch that this example will never exercise, so remove the multi_node flag and the intermediate mnnvl_domain variable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 19:00:22 +00:00
Binyang Li	9aeeaf0f12	Simplify torch-integration tuning example for MPI-only multi-node testing Use mpi4py for bootstrap and local-rank discovery; drop the torchrun / gloo / manual MSCCLPP_MASTER_ADDR paths and the netifaces dependency. Add MNNVL/multi-node algorithm selection (rsag, rsag_zero_copy, nvls_zero_copy) and route barrier / timing-sync allreduces through the configured symmetric_memory flag so they work across hosts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-06 18:51:29 +00:00
Mahdieh Ghazi	822fbb2351	Adding necessary macros for enabling mrc support (#797 ) This PR adds necessary macros and instructions for enabling mrc support with no atomic.	2026-05-05 17:17:41 -04:00
Binyang Li	6296803d87	Make NVLS non-zero-copy allreduce algorithms MNNVL-ready Both default_allreduce_nvls_warp_pipeline and default_allreduce_nvls_block_pipeline were only partially MNNVL-aware: their kernels had been updated to use ipcDomainNranks (with shared-memory channel arrays sized for the global NVLink-domain bound), but the host-side context init still hard-coded ctx->ipcDomainNranks = bootstrap->getNranksPerNode(). On a fully populated MNNVL fabric (e.g. NVL72 where world == ipcDomainNranks but the per-physical-host nranksPerNode is much smaller), this mismatched the multicast group span and produced wrong/missing data plus out-of-bounds scratch indexing. Changes: - Rename MAX_NRANKS_PER_NODE -> MAX_IPC_DOMAIN_NRANKS to match the rest of the IPC-domain naming (getIpcDomainNranks, ipcDomainNranks, MSCCLPP_IPC_DOMAIN_NRANKS env var). Pure rename, no semantic change. - Add validateIpcDomainSpansWorld(comm, algName) helper in collective_utils that wraps getIpcDomainNranks() and asserts the IPC-domain == whole-comm invariant required by NVLS algorithms (worldSize == ipcDomainNranks, rank < ipcDomainNranks, ipcDomainNranks in [2, MAX_IPC_DOMAIN_NRANKS]), throwing Error(InvalidUsage) on violation and returning the validated value. - nvls_zero_copy / nvls_block_pipeline / nvls_warp_pipeline initialize() each now call the helper instead of repeating the same ~20-line check inline. - initAllreduceContext() in both pipelines now uses getIpcDomainNranks(comm) instead of bootstrap->getNranksPerNode(). - Per-peer base channel allocation (nBaseChannels_) is sized in initialize() as max(64, 4ipc) for block pipeline and max(64, 8ipc) for warp pipeline so the kernel's per-block channel addressing remains in-bounds at NVL72 scale. - Block pipeline initialize() also asserts 6ipcDomainNranks <= NUM_SEMAPHORES. - allreduceKernelFunc() in both pipelines now validates launch shape and the user-supplied scratch buffer size before launching, returning CommInvalidArgument with a clear WARN on mismatch: - Block: nBlocks must equal 5ipcDomainNranks (structurally required by the kernel's three-phase block partition), nThreads == 1024, inputSize aligned to (ipc * 16) bytes, scratchSizePerBlock >= unitSize. - Warp: nBlocks >= NUM_NVLS_CONNECTION and a multiple of it (kernel does nBlocks / NUM_NVLS_CONNECTION partitioning of the multicast handles), 2*nBlocks <= nBaseChannels_, nThreads == 1024 (32 warps hard-coded in the bar.sync member counts), inputSize divisible by ipcDomainNranks, scratchSizePerBlock >= copyPerIter. - Default nBlocks for warp pipeline is rounded up to a multiple of NUM_NVLS_CONNECTION so the structural constraint holds for any ipcDomainNranks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-05 04:41:14 +00:00
Binyang Li	987f80025a	Merge remote-tracking branch 'origin/main' into binyli/mnnvl	2026-05-04 23:53:25 +00:00
Binyang Li	9ec26fa4d1	Reset GPU tokens before reuse (#795 ) Fixes a token-reuse bug in `TokenPool` that's independent of MNNVL. ## Bug `TokenPool` hands out 8-byte device-memory slots used as device-semaphore counters. The deleter only cleared the bitmap — the underlying GPU memory was left as-is. When a token was freed and later re-allocated, the new semaphore inherited the previous counter value instead of starting at 0, breaking subsequent `signal()/wait()` math. ## Fix * Add a synchronous `gpuMemset` host helper (mirrors `gpuMemcpy` / `gpuMemcpyAsync`). * Zero the slot inside the `TokenPool` deleter so recycled tokens hand out a clean counter. The very-first allocation is already zeroed by `gpuCallocPhysical` (`src/core/gpu_utils.cc:227-228`), so first-time tokens are also clean — the deleter only has to handle the recycle case. ## Notes * Public wrapper is named `mscclpp::gpuMemset` (not `mscclpp::memset`) for symmetry with `gpuMemcpy` and to avoid shadowing `std::memset` in TUs that pull the namespace in. * Zeroing happens on release rather than acquire so the cost is paid in the typically less perf-sensitive teardown path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-04 15:11:47 -07:00
Binyang Li	9a36884369	Rename gpuMemset wrapper and zero TokenPool slots in deleter Two follow-ups to commit `7bc5e040`: * Rename mscclpp::memset to mscclpp::gpuMemset for symmetry with gpuMemcpy / gpuMemcpyAsync, and avoid shadowing std::memset for callers that pull the namespace in. Also add the missing doc comment. * Move the per-slot zeroing from getToken() into the deleter so the cost is paid on release rather than acquire. This is safe because gpuCallocPhysical already zeros the underlying buffer at TokenPool construction, so first-time tokens are clean and recycled tokens are scrubbed on release. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-02 04:28:36 +00:00
Binyang Li	7bc5e0406b	Reset GPU tokens before reuse Clear recycled TokenPool entries before handing them out so device-to-device semaphores start from a clean counter value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-02 03:19:31 +00:00
Binyang Li	1c29817566	Revert AllreduceRsAgZeroCopy non-symmetric ctx key tag back to ++tag Commit `533f3299` dropped the static tag counter from generateAllreduceContextKey, causing every non-symmetric call to return the same key (zero) and reuse a stale context. Restore the pre-MNNVL behavior of returning a unique key per non-symmetric call so the context cache rebuilds when buffers change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-01 23:40:11 +00:00

1 2 3 4 5 ...

994 Commits