Rename native collective context workSize to worldSize and use nRanksPerIpcDomain for allpair peer topology. Include the staged DSL signal/wait pairing validation changes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix the correctness issue for allreduce_allpair_packet algo. Make sure
no overwrite for input buffer. Use same tb for send/reduce/write-back.
- Check if nBlocks/nthreads validate for packet algorithm.
- Add more logs
- Modify flag update logic, make it work for the case: nthreadPerNBlock
< nflags
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- setupNvlsChannels now takes the Communicator and barriers internally
after binding all switch channels, replacing the explicit
bootstrap()->barrier() previously done only in AllreduceNvlsPacket.
- Demote nRanksPerIpcDomain_ / nBaseChannels_ to locals in
AllreduceNvlsBlockPipeline and AllreduceNvlsWarpPipeline; they were
never read outside initialize().
- Drive-by: pick up in-tree edits to switch_channel_device.hpp,
executor.cc, communicator.hpp, and allreduce_rsag.cu.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Avoid probing invalid packet allreduce configurations and reduce the default tuning sweep so the 8-rank tuning example completes reliably.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- Release the reference after last requests are ready.
- Keep ordered receive chaining for repeated rank/tag operations while
cleaning up completed receive bookkeeping.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a `PortChannel` requests `flush()`, the host-side proxy was being
blocked, which may cause head-of-line blocking of other parallel
`PortChannel`s' requests. Now the proxy handles `flush()` requests
asynchronously. This feature especially helps performance when we need
multiple IB QPs and need to flush QPs.
This pull request adds support for the `bfloat16` (bf16) data type to
the test executor, including both Python and CUDA components. The
changes ensure that `bfloat16` is handled consistently across argument
parsing, data type conversion, and test kernel implementations.
Additionally, the CUDA verification kernels are refactored to use
parameterized tolerances for improved numerical accuracy checks.
**Support for bfloat16 data type:**
* Added handling for `bfloat16`/`bf16` in the Python test executor's
argument parsing, data type conversion (`parse_dtype`,
`dtype_to_mscclpp_dtype`), and help text.
[[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28)
[[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135)
[[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251)
* Updated output to display the correct data type string for `bfloat16`.
**CUDA kernel and test improvements:**
* Included `bfloat16` headers and defined test data fill and gather
kernels for `bfloat16` on both CUDA and HIP platforms.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35)
[[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59)
[[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133)
* Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to
use an explicit tolerance parameter (`Eps`) and added correct tolerances
for each data type, including `bfloat16`.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113)
These changes ensure full support for `bfloat16` in the test executor
and improve the accuracy and maintainability of the CUDA test kernels.
---------
Co-authored-by: Caio Rocha <caiorocha@microsof.com>
Erase completed receive bookkeeping from the communicator once the deferred receive future finishes, while preserving ordered receive chaining for repeated rank/tag operations.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
multimem.ld_reduce on FP8 inputs accumulates in FP32 by default. The
ISA also exposes an .acc::f16 variant that keeps the reduction in
FP16, which is faster but lower precision. Plumb AccumT through:
- include/mscclpp/switch_channel_device.hpp:
Extend SwitchChannelDeviceHandle::multimemLoadReduce with an optional
AccumT template parameter. When VectorType is one of the FP8 vector
types (f8_e4m3x{4,8,16} / f8_e5m2x{4,8,16}) and AccumT is __half,
emit the .acc::f16 form of the instruction; otherwise unchanged.
- src/ext/collectives/include/allreduce/common.hpp:
Make handleMultiLoadReduceStore template on AccumT and forward it to
multimemLoadReduce<vectorType, AccumT>(...).
- src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu:
Template allreduceNvls and NvlsAdapter on AccumT and forward to
handleMultiLoadReduceStore<T, AccumT>; the existing dispatch<>
machinery already plumbs AccumT through from the algorithm context.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The non-symmetric rsag_zero_copy path uses an incrementing tag in its
context key, so cross-rank memory registration handshakes happen on
every call rather than being cached. At single-host x 8 GPUs and
sizes >= 512 KB this becomes the only candidate (since nvls_zero_copy
is filtered out without symmetric memory) and degrades into apparent
hang. Defaulting SYMMETRIC_MEMORY=1 lets a plain `mpirun ...`
invocation work out of the box; users can still override with
`SYMMETRIC_MEMORY=0` to exercise the non-symmetric path.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
allreduce_nvls_block_pipeline.cu and allreduce_nvls_warp_pipeline.cu
were carrying ~45 lines of per-call invariant-checking added during the
MNNVL work. Restore main's simple defaulting pattern (just `if
(==0) set defaults`); incorrect inputs will manifest as CUDA errors via
the existing error-handling path. Also drop the unreachable
`6 * ipcDomainNranks > NUM_SEMAPHORES` throw in the block_pipeline
initialize (max ipcDomainNranks=72, NUM_SEMAPHORES=512), the now-unused
`<mscclpp/errors.hpp>` include, and trim the verbose comments around
`nBaseChannels_` sizing in both files.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The hard-coded 72 was off by one from what the comment claims is the
minimum (MAX_IPC_DOMAIN_NRANKS - 1 = 71). Express the value via the
constant so the relationship is self-documenting and any future change
to MAX_IPC_DOMAIN_NRANKS propagates automatically.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 128-block default fires only when the caller passes nBlocks=0
(i.e. no tuning). Tuning explicitly drives nBlocks via the adapter, so
the historical default of 64 is fine. Keep nChannelsPerConnection_=128
so the tuner can still request up to 128 blocks for MNNVL configs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore the original two-line note about the templated peer-loop
unrolling instead of the multi-paragraph rationale block.
- Rename the kernel template parameter from NRanksPerNode to NRanks.
The IPC domain can span multiple physical hosts under MNNVL, so the
'PerNode' suffix is misleading; NRanks matches the runtime
ipcDomainNranks parameter that drives template dispatch.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
getIpcDomainNranks now performs the range / world-size / rank checks
itself and throws on violation, so the separate
validateIpcDomainSpansWorld helper is unnecessary. Update the 3 NVLS
callsites (block_pipeline, warp_pipeline, nvls_zero_copy) to call
getIpcDomainNranks directly. The non-NVLS callers also pick up the
strict validation, which is fine because they are only invoked in
single-host or multi-host MNNVL scenarios where worldSize ==
ipcDomainNranks (the NCCL adapter's multi-node path returns nullptr,
falling back to NCCL/RCCL).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Collapse the duplicated 3-line warp-strided-load comment in 5 kernels
(allgather_fullmesh, allreduce_fullmesh, allreduce_packet,
allreduce_nvls_zero_copy, allreduce_nvls_warp_pipeline) into a single
one-line 'Peer count may exceed WARP_SIZE on MNNVL.' note.
- Drop the algName parameter from validateIpcDomainSpansWorld; switch
its 3 throws to use the THROW logger macro (LogSubsys::ALGO), which
already captures file/line/function. Update the 3 callsites
(nvls_block_pipeline, nvls_warp_pipeline, nvls_zero_copy) and trim the
Doxygen comment accordingly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Drop the multi_host_mnnvl-specific rsag fallback in _default_ar_config;
fall through to default_allreduce_packet when NVLS is unavailable.
- Add SYMMETRIC_MEMORY env var so the tuning sweep can include the
zero-copy NVLS / RSAG candidates without editing the source.
- Make _algo() raise on miss (direct dict lookup) and drop the
defensive 'if a:' guards in _ar_candidates / _ag_candidates /
_default_ar_config; merge existence checks into the platform
conditions (self._nvls, self.symmetric_memory).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The example is now MNNVL-only: a run is either single-host (everything
fits in one node) or multi-host MNNVL (one cross-host NVLink domain).
Plain multi-node-without-MNNVL had its own algorithm branch that this
example will never exercise, so remove the multi_node flag and the
intermediate mnnvl_domain variable.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use mpi4py for bootstrap and local-rank discovery; drop the torchrun /
gloo / manual MSCCLPP_MASTER_ADDR paths and the netifaces dependency.
Add MNNVL/multi-node algorithm selection (rsag, rsag_zero_copy,
nvls_zero_copy) and route barrier / timing-sync allreduces through the
configured symmetric_memory flag so they work across hosts.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both default_allreduce_nvls_warp_pipeline and default_allreduce_nvls_block_pipeline
were only partially MNNVL-aware: their kernels had been updated to use
ipcDomainNranks (with shared-memory channel arrays sized for the global
NVLink-domain bound), but the host-side context init still hard-coded
ctx->ipcDomainNranks = bootstrap->getNranksPerNode(). On a fully populated MNNVL
fabric (e.g. NVL72 where world == ipcDomainNranks but the per-physical-host
nranksPerNode is much smaller), this mismatched the multicast group span and
produced wrong/missing data plus out-of-bounds scratch indexing.
Changes:
- Rename MAX_NRANKS_PER_NODE -> MAX_IPC_DOMAIN_NRANKS to match the rest of the
IPC-domain naming (getIpcDomainNranks, ipcDomainNranks,
MSCCLPP_IPC_DOMAIN_NRANKS env var). Pure rename, no semantic change.
- Add validateIpcDomainSpansWorld(comm, algName) helper in collective_utils
that wraps getIpcDomainNranks() and asserts the IPC-domain == whole-comm
invariant required by NVLS algorithms (worldSize == ipcDomainNranks,
rank < ipcDomainNranks, ipcDomainNranks in [2, MAX_IPC_DOMAIN_NRANKS]),
throwing Error(InvalidUsage) on violation and returning the validated value.
- nvls_zero_copy / nvls_block_pipeline / nvls_warp_pipeline initialize() each
now call the helper instead of repeating the same ~20-line check inline.
- initAllreduceContext() in both pipelines now uses getIpcDomainNranks(comm)
instead of bootstrap->getNranksPerNode().
- Per-peer base channel allocation (nBaseChannels_) is sized in initialize() as
max(64, 4*ipc) for block pipeline and max(64, 8*ipc) for warp pipeline so
the kernel's per-block channel addressing remains in-bounds at NVL72 scale.
- Block pipeline initialize() also asserts 6*ipcDomainNranks <= NUM_SEMAPHORES.
- allreduceKernelFunc() in both pipelines now validates launch shape and the
user-supplied scratch buffer size before launching, returning
CommInvalidArgument with a clear WARN on mismatch:
- Block: nBlocks must equal 5*ipcDomainNranks (structurally required by the
kernel's three-phase block partition), nThreads == 1024, inputSize aligned
to (ipc * 16) bytes, scratchSizePerBlock >= unitSize.
- Warp: nBlocks >= NUM_NVLS_CONNECTION and a multiple of it (kernel does
nBlocks / NUM_NVLS_CONNECTION partitioning of the multicast handles),
2*nBlocks <= nBaseChannels_, nThreads == 1024 (32 warps hard-coded in the
bar.sync member counts), inputSize divisible by ipcDomainNranks,
scratchSizePerBlock >= copyPerIter.
- Default nBlocks for warp pipeline is rounded up to a multiple of
NUM_NVLS_CONNECTION so the structural constraint holds for any
ipcDomainNranks.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes a token-reuse bug in `TokenPool` that's independent of MNNVL.
## Bug
`TokenPool` hands out 8-byte device-memory slots used as
device-semaphore counters. The deleter only cleared the bitmap — the
underlying GPU memory was left as-is. When a token was freed and later
re-allocated, the new semaphore inherited the previous counter value
instead of starting at 0, breaking subsequent `signal()/wait()` math.
## Fix
* Add a synchronous `gpuMemset` host helper (mirrors `gpuMemcpy` /
`gpuMemcpyAsync`).
* Zero the slot inside the `TokenPool` deleter so recycled tokens hand
out a clean counter. The very-first allocation is already zeroed by
`gpuCallocPhysical` (`src/core/gpu_utils.cc:227-228`), so first-time
tokens are also clean — the deleter only has to handle the recycle case.
## Notes
* Public wrapper is named `mscclpp::gpuMemset` (not `mscclpp::memset`)
for symmetry with `gpuMemcpy` and to avoid shadowing `std::memset` in
TUs that pull the namespace in.
* Zeroing happens on release rather than acquire so the cost is paid in
the typically less perf-sensitive teardown path.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two follow-ups to commit 7bc5e040:
* Rename mscclpp::memset to mscclpp::gpuMemset for symmetry with
gpuMemcpy / gpuMemcpyAsync, and avoid shadowing std::memset for
callers that pull the namespace in. Also add the missing doc
comment.
* Move the per-slot zeroing from getToken() into the deleter so the
cost is paid on release rather than acquire. This is safe because
gpuCallocPhysical already zeros the underlying buffer at TokenPool
construction, so first-time tokens are clean and recycled tokens
are scrubbed on release.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clear recycled TokenPool entries before handing them out so device-to-device semaphores start from a clean counter value.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Commit 533f3299 dropped the static tag counter from
generateAllreduceContextKey, causing every non-symmetric call to
return the same key (zero) and reuse a stale context. Restore the
pre-MNNVL behavior of returning a unique key per non-symmetric call
so the context cache rebuilds when buffers change.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>