ext/ep: unfilter LL sync + add LL multirank test (intra-node WIP)

- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode. DeepEP's original filter was safe because its LL path used NVSHMEM; this port drives LL via PortChannel so the kernel indexes port_channel_handles[local_expert*num_ranks + dst_rank] for every dst_rank. All peers now get a real memory/connection/semaphore/port channel entry. - Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine functional round-trip, BF16 only). Works cross-node in DeepEP's 1-GPU-per-node topology. - Known limitation documented in src/ext/ep/README.md and the test docstring: intra-node 8-GPU LL currently hangs because every peer transfer routes through the CPU proxy over IB loopback between distinct HCAs on the same host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which mis-aligns the 32-bit rdma_recv_count slots when used for same-node peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for same-node, PortChannel for cross-node) or 64-bit counters.
2026-05-11 17:00:22 +00:00 · 2026-04-22 06:11:30 +00:00
parent 9e96bf3b5d
commit f0a72263c8
4 changed files with 253 additions and 20 deletions
--- a/python/mscclpp/ext/ep/buffer.py
+++ b/python/mscclpp/ext/ep/buffer.py
@@ -56,8 +56,10 @@ class Buffer:
        Size of the RDMA scratch buffer. Required (>0) for internode HT and
        low-latency modes.
    low_latency_mode:
-        Enable the low-latency dispatch/combine path (structural port,
-        untested on multi-node hardware).
+        Enable the low-latency dispatch/combine path. This mode uses only
+        the RDMA buffer (``num_rdma_bytes``) and drives every peer through
+        MSCCL++ ``PortChannel``; consequently, it works cross-node with any
+        topology but is still pending H100 hardware validation.
    num_qps_per_rank:
        Ignored for intranode mode.
    """