mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
ext/ep: unfilter LL sync + add LL multirank test (intra-node WIP)
- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode. DeepEP's original filter was safe because its LL path used NVSHMEM; this port drives LL via PortChannel so the kernel indexes port_channel_handles[local_expert*num_ranks + dst_rank] for every dst_rank. All peers now get a real memory/connection/semaphore/port channel entry. - Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine functional round-trip, BF16 only). Works cross-node in DeepEP's 1-GPU-per-node topology. - Known limitation documented in src/ext/ep/README.md and the test docstring: intra-node 8-GPU LL currently hangs because every peer transfer routes through the CPU proxy over IB loopback between distinct HCAs on the same host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which mis-aligns the 32-bit rdma_recv_count slots when used for same-node peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for same-node, PortChannel for cross-node) or 64-bit counters.
This commit is contained in:
@@ -56,8 +56,10 @@ class Buffer:
|
||||
Size of the RDMA scratch buffer. Required (>0) for internode HT and
|
||||
low-latency modes.
|
||||
low_latency_mode:
|
||||
Enable the low-latency dispatch/combine path (structural port,
|
||||
untested on multi-node hardware).
|
||||
Enable the low-latency dispatch/combine path. This mode uses only
|
||||
the RDMA buffer (``num_rdma_bytes``) and drives every peer through
|
||||
MSCCL++ ``PortChannel``; consequently, it works cross-node with any
|
||||
topology but is still pending H100 hardware validation.
|
||||
num_qps_per_rank:
|
||||
Ignored for intranode mode.
|
||||
"""
|
||||
|
||||
Reference in New Issue
Block a user