mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 01:10:22 +00:00

Files

Binyang Li 2efda4d819 Restore compile-time templated NRanksPerNode for rsag_zero_copy

Recovers the per-thread int4 register array + #pragma unroll for the
{4, 8} rank cases. All NPeers remote reads are issued in parallel so
their latency overlaps instead of being serialized by the runtime
fused load+reduce loop. The runtime-domain (NVL72) fallback is
removed; the algo now returns cudaErrorInvalidValue for unsupported
ipcDomainNranks, and rsag_zero_copy is dropped from the MNNVL
candidate list in the tuning example.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-01 23:09:22 +00:00

core

Decouple IPC-domain hint from bootstrap nRanksPerNode

2026-05-01 18:27:17 +00:00

ext

Restore compile-time templated NRanksPerNode for rsag_zero_copy

2026-05-01 23:09:22 +00:00

.gitignore

[python] switch to setup.py to build package

2023-04-12 12:29:17 -07:00

CMakeLists.txt

Torch integration (#692 )

2026-01-21 20:32:24 -08:00