Files
mscclpp/examples
Binyang Li 2efda4d819 Restore compile-time templated NRanksPerNode for rsag_zero_copy
Recovers the per-thread int4 register array + #pragma unroll for the
{4, 8} rank cases. All NPeers remote reads are issued in parallel so
their latency overlaps instead of being serialized by the runtime
fused load+reduce loop. The runtime-domain (NVL72) fallback is
removed; the algo now returns cudaErrorInvalidValue for unsupported
ipcDomainNranks, and rsag_zero_copy is dropped from the MNNVL
candidate list in the tuning example.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 23:09:22 +00:00
..