ep: document internode HT validation on 2x H100x8

Refresh status docs and comments now that internode HT dispatch and
combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via
test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover
their per-rank token payloads with zero diff).

- src/ext/ep/README.md: consolidate the previously duplicated README
  into a single document; mark intranode and internode HT dispatch and
  combine as validated in the status table; add a 'Running the tests'
  section with torchrun examples for both the intranode and the 2x8
  internode setups; record the dispatch->combine
  torch.cuda.synchronize() + dist.barrier() requirement under Known
  limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural
  port, untested.

- python/mscclpp/ext/ep/buffer.py: update the module docstring and the
  Buffer constructor docstring to say internode HT is validated and
  clarify that LL mode is untested on multi-node hardware.

- src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported'
  and 'low-latency paths still stubbed' comments. mscclpp_ep does not
  use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL
  paths are a structural port that is present but untested, not stubbed.
  Note validation on 2x H100x8 in the internode section header.
This commit is contained in:
Qinghua Zhou
2026-04-22 03:56:16 +00:00
parent c351b871a1
commit 9e96bf3b5d
3 changed files with 103 additions and 159 deletions

View File

@@ -11,11 +11,14 @@ DeepEP users can port with minimal changes.
Current status (see ``src/ext/ep/README.md``):
* Intranode (NVLink-only) dispatch and combine are fully ported.
* ``get_dispatch_layout`` is ported.
* Internode HT (MSCCL++ PortChannel + MemoryChannel) is ported.
* Internode low-latency kernels are ported structurally (NVSHMEM/IBGDA ->
MSCCL++ PortChannel) but **untested on multi-node H100**.
* Intranode (NVLink-only) dispatch and combine: ported and validated on
one node with 8 GPUs.
* ``get_dispatch_layout``: ported.
* Internode HT (MSCCL++ PortChannel + MemoryChannel) dispatch and combine:
ported and validated on 2 nodes x 8 H100 GPUs with
``test/python/ext/ep/test_internode_multirank.py``.
* Internode low-latency kernels: structural port (NVSHMEM/IBGDA ->
MSCCL++ PortChannel), **untested on multi-node H100**.
"""
from __future__ import annotations
@@ -54,7 +57,7 @@ class Buffer:
low-latency modes.
low_latency_mode:
Enable the low-latency dispatch/combine path (structural port,
untested).
untested on multi-node hardware).
num_qps_per_rank:
Ignored for intranode mode.
"""