mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-25 07:14:40 +00:00
Run one uncached dispatch to capture (rank_prefix_matrix, channel_prefix_matrix, num_recv_tokens), then time iters in cached mode. This replaces notify_dispatch + host busy-wait on mapped pinned counters with the cheap cached_notify_dispatch (one barrier + memcpy + memset), matching NCCL-EP ep_bench convention. Cached mode forces num_experts=0 (buffer.cc:807), so topk_idx must be None in iters; combine still works because recv_topk_weights is optional. Per-iter dispatch latency drops ~21% (4247→3373µs). Confirms host-side notify_dispatch overhead is only ~20% of total dispatch time; the remaining 14.4× send-total asymmetry vs combine is intrinsic (3× recv/ send byte fan-out × 3.8× dispatch-kernel-vs-combine-kernel work).