Commit Graph

4 Commits

Author SHA1 Message Date
Qinghua Zhou
10cd0012f1 tests/ep: LL bench prints per_rank_bw and accepts size env vars
- Report both per-rank and aggregate BW to align with NCCL-EP's ep_bench
  (which reports per-rank GB/s).
- Accept MSCCLPP_EP_LL_TOKENS/HIDDEN/TOPK/EXPERTS_PER_RANK env overrides
  so we can match external benchmark problem sizes (NCCL-EP LL defaults
  are num_tokens=128, hidden=7168, top_k=8).
2026-04-23 22:20:40 +00:00
Qinghua Zhou
63afb25ab3 tests/ep: LL bench combine uses recv_tokens×hidden for payload bytes
Each local expert sends one copy per dispatched token back to its owner,
so the bytes actually on the wire during combine match dispatch. The
previous num_tokens×hidden under-counted by ~num_topk×, making combine
BW look artificially low next to dispatch.
2026-04-23 21:53:34 +00:00
Qinghua Zhou
c51a8a5305 ext/ep tests: time dispatch and combine separately in MSCCLPP_EP_BENCH
Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).

Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
2026-04-22 23:11:04 +00:00
Qinghua Zhou
f0a72263c8 ext/ep: unfilter LL sync + add LL multirank test (intra-node WIP)
- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode.
  DeepEP's original filter was safe because its LL path used NVSHMEM; this
  port drives LL via PortChannel so the kernel indexes
  port_channel_handles[local_expert*num_ranks + dst_rank] for every
  dst_rank. All peers now get a real memory/connection/semaphore/port
  channel entry.
- Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine
  functional round-trip, BF16 only). Works cross-node in DeepEP's
  1-GPU-per-node topology.
- Known limitation documented in src/ext/ep/README.md and the test docstring:
  intra-node 8-GPU LL currently hangs because every peer transfer routes
  through the CPU proxy over IB loopback between distinct HCAs on the same
  host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which
  mis-aligns the 32-bit rdma_recv_count slots when used for same-node
  peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for
  same-node, PortChannel for cross-node) or 64-bit counters.
2026-04-22 06:11:30 +00:00