Commit Graph

5 Commits

Author SHA1 Message Date
Qinghua Zhou
594dc79657 Address NVLS review feedback
Handle unsupported FP8 NVLS paths safely, tighten IPC-domain guards, align IPC-domain naming, and add IPC-domain fabric hash logging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-16 23:19:25 +00:00
Binyang Li
0744e806fc detect ipc domain automaticlly 2026-05-16 00:39:49 +00:00
Binyang Li
45a651b2c8 Decouple IPC-domain hint from bootstrap nRanksPerNode
Replace MSCCLPP_MNNVL_NRANKS_PER_NODE (which overrode TcpBootstrap and
silently changed getNranksPerNode() for every consumer) with a single
algorithm-level helper getIpcDomainNranks(comm) backed by a new
MSCCLPP_IPC_DOMAIN_NRANKS env. The neutral IPC name covers both NVLink/
MNNVL on NV and XGMI on AMD. Bootstrap is unchanged and continues to
report physical-host detection.

Collapse the two getCollectiveDomainNranksPerNode overloads into one
canonical helper and route all six allreduce algos (packet,
allpair_packet, nvls_packet, nvls_zero_copy, rsag, rsag_zero_copy)
through it. Update the standalone tuning example to use the new env
name; drop the undeclared MSCCLPP_ENABLE_MNNVL gate; fix
multi_host_mnnvl detection now that nranks_per_node is no longer
overridden by the bootstrap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-01 18:27:17 +00:00
Binyang Li
893a08e69c Enable MNNVL allreduce tuning
Add an MNNVL rank-domain override so MSCCL++ collectives can treat multi-host NVLink fabrics as a single CUDA IPC/NVLS peer group. Update packet, RSAG, and NVLS allreduce paths to use the collective domain size and teach the torch integration tuning example to select MNNVL-capable allreduce algorithms.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-28 05:38:59 +00:00
Binyang Li
a707273701 Torch integration (#692)
Reorganize current native algorithm implementation and DSL algorithm
implementation.
Provide unified API for DSL algo and native algo and provide interface
to tune the algo
Provide interface for pytorch integration with native API and DSL

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
2026-01-21 20:32:24 -08:00