ext/ep: unfilter LL sync + add LL multirank test (intra-node WIP)

- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode.
  DeepEP's original filter was safe because its LL path used NVSHMEM; this
  port drives LL via PortChannel so the kernel indexes
  port_channel_handles[local_expert*num_ranks + dst_rank] for every
  dst_rank. All peers now get a real memory/connection/semaphore/port
  channel entry.
- Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine
  functional round-trip, BF16 only). Works cross-node in DeepEP's
  1-GPU-per-node topology.
- Known limitation documented in src/ext/ep/README.md and the test docstring:
  intra-node 8-GPU LL currently hangs because every peer transfer routes
  through the CPU proxy over IB loopback between distinct HCAs on the same
  host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which
  mis-aligns the 32-bit rdma_recv_count slots when used for same-node
  peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for
  same-node, PortChannel for cross-node) or 64-bit counters.
This commit is contained in:
Qinghua Zhou
2026-04-22 06:11:30 +00:00
parent 9e96bf3b5d
commit f0a72263c8
4 changed files with 253 additions and 20 deletions

View File

@@ -56,8 +56,10 @@ class Buffer:
Size of the RDMA scratch buffer. Required (>0) for internode HT and
low-latency modes.
low_latency_mode:
Enable the low-latency dispatch/combine path (structural port,
untested on multi-node hardware).
Enable the low-latency dispatch/combine path. This mode uses only
the RDMA buffer (``num_rdma_bytes``) and drives every peer through
MSCCL++ ``PortChannel``; consequently, it works cross-node with any
topology but is still pending H100 hardware validation.
num_qps_per_rank:
Ignored for intranode mode.
"""