ep: document internode HT validation on 2x H100x8

Refresh status docs and comments now that internode HT dispatch and combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover their per-rank token payloads with zero diff). - src/ext/ep/README.md: consolidate the previously duplicated README into a single document; mark intranode and internode HT dispatch and combine as validated in the status table; add a 'Running the tests' section with torchrun examples for both the intranode and the 2x8 internode setups; record the dispatch->combine torch.cuda.synchronize() + dist.barrier() requirement under Known limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural port, untested. - python/mscclpp/ext/ep/buffer.py: update the module docstring and the Buffer constructor docstring to say internode HT is validated and clarify that LL mode is untested on multi-node hardware. - src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported' and 'low-latency paths still stubbed' comments. mscclpp_ep does not use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL paths are a structural port that is present but untested, not stubbed. Note validation on 2x H100x8 in the internode section header.
2026-05-11 17:00:22 +00:00 · 2026-04-22 03:56:16 +00:00
parent c351b871a1
commit 9e96bf3b5d
3 changed files with 103 additions and 159 deletions
--- a/python/mscclpp/ext/ep/buffer.py
+++ b/python/mscclpp/ext/ep/buffer.py
@@ -11,11 +11,14 @@ DeepEP users can port with minimal changes.

 Current status (see ``src/ext/ep/README.md``):

-* Intranode (NVLink-only) dispatch and combine are fully ported.
-* ``get_dispatch_layout`` is ported.
-* Internode HT (MSCCL++ PortChannel + MemoryChannel) is ported.
-* Internode low-latency kernels are ported structurally (NVSHMEM/IBGDA ->
-  MSCCL++ PortChannel) but **untested on multi-node H100**.
+* Intranode (NVLink-only) dispatch and combine: ported and validated on
+  one node with 8 GPUs.
+* ``get_dispatch_layout``: ported.
+* Internode HT (MSCCL++ PortChannel + MemoryChannel) dispatch and combine:
+  ported and validated on 2 nodes x 8 H100 GPUs with
+  ``test/python/ext/ep/test_internode_multirank.py``.
+* Internode low-latency kernels: structural port (NVSHMEM/IBGDA ->
+  MSCCL++ PortChannel), **untested on multi-node H100**.
 """

 from __future__ import annotations
@@ -54,7 +57,7 @@ class Buffer:
        low-latency modes.
    low_latency_mode:
        Enable the low-latency dispatch/combine path (structural port,
-        untested).
+        untested on multi-node hardware).
    num_qps_per_rank:
        Ignored for intranode mode.
    """