Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).
Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode.
DeepEP's original filter was safe because its LL path used NVSHMEM; this
port drives LL via PortChannel so the kernel indexes
port_channel_handles[local_expert*num_ranks + dst_rank] for every
dst_rank. All peers now get a real memory/connection/semaphore/port
channel entry.
- Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine
functional round-trip, BF16 only). Works cross-node in DeepEP's
1-GPU-per-node topology.
- Known limitation documented in src/ext/ep/README.md and the test docstring:
intra-node 8-GPU LL currently hangs because every peer transfer routes
through the CPU proxy over IB loopback between distinct HCAs on the same
host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which
mis-aligns the 32-bit rdma_recv_count slots when used for same-node
peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for
same-node, PortChannel for cross-node) or 64-bit counters.