Files
mscclpp/test/python/ext
qinghuazhou f9f0d0fcb7 test/ext/ep: intranode HT bench — cached-mode iter loop
Run one uncached dispatch to capture (rank_prefix_matrix,
channel_prefix_matrix, num_recv_tokens), then time iters in cached mode.
This replaces notify_dispatch + host busy-wait on mapped pinned counters
with the cheap cached_notify_dispatch (one barrier + memcpy + memset),
matching NCCL-EP ep_bench convention.

Cached mode forces num_experts=0 (buffer.cc:807), so topk_idx must be
None in iters; combine still works because recv_topk_weights is optional.

Per-iter dispatch latency drops ~21% (4247→3373µs). Confirms host-side
notify_dispatch overhead is only ~20% of total dispatch time; the
remaining 14.4× send-total asymmetry vs combine is intrinsic (3× recv/
send byte fan-out × 3.8× dispatch-kernel-vs-combine-kernel work).
2026-05-12 19:02:31 +00:00
..