Files
mscclpp/python
Qinghua Zhou 9e96bf3b5d ep: document internode HT validation on 2x H100x8
Refresh status docs and comments now that internode HT dispatch and
combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via
test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover
their per-rank token payloads with zero diff).

- src/ext/ep/README.md: consolidate the previously duplicated README
  into a single document; mark intranode and internode HT dispatch and
  combine as validated in the status table; add a 'Running the tests'
  section with torchrun examples for both the intranode and the 2x8
  internode setups; record the dispatch->combine
  torch.cuda.synchronize() + dist.barrier() requirement under Known
  limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural
  port, untested.

- python/mscclpp/ext/ep/buffer.py: update the module docstring and the
  Buffer constructor docstring to say internode HT is validated and
  clarify that LL mode is untested on multi-node hardware.

- src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported'
  and 'low-latency paths still stubbed' comments. mscclpp_ep does not
  use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL
  paths are a structural port that is present but untested, not stubbed.
  Note validation on 2x H100x8 in the internode section header.
2026-04-22 03:56:16 +00:00
..
2026-04-16 21:24:45 -07:00
2026-04-16 21:24:45 -07:00