mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-12 01:10:22 +00:00
Recovers the per-thread int4 register array + #pragma unroll for the
{4, 8} rank cases. All NPeers remote reads are issued in parallel so
their latency overlaps instead of being serialized by the runtime
fused load+reduce loop. The runtime-domain (NVL72) fallback is
removed; the algo now returns cudaErrorInvalidValue for unsupported
ipcDomainNranks, and rsag_zero_copy is dropped from the MNNVL
candidate list in the tuning example.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>