mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-07-11 09:49:39 +00:00

Files

qinghuazhou f9f0d0fcb7 test/ext/ep: intranode HT bench — cached-mode iter loop

Run one uncached dispatch to capture (rank_prefix_matrix,
channel_prefix_matrix, num_recv_tokens), then time iters in cached mode.
This replaces notify_dispatch + host busy-wait on mapped pinned counters
with the cheap cached_notify_dispatch (one barrier + memcpy + memset),
matching NCCL-EP ep_bench convention.

Cached mode forces num_experts=0 (buffer.cc:807), so topk_idx must be
None in iters; combine still works because recv_topk_weights is optional.

Per-iter dispatch latency drops ~21% (4247→3373µs). Confirms host-side
notify_dispatch overhead is only ~20% of total dispatch time; the
remaining 14.4× send-total asymmetry vs combine is intrinsic (3× recv/
send byte fan-out × 3.8× dispatch-kernel-vs-combine-kernel work).

2026-05-12 19:02:31 +00:00

test_internode_multirank.py

test/ext/ep: HT — scale combine tolerance with bf16 ulp

2026-05-12 05:37:42 +00:00

test_intranode_multirank.py

test/ext/ep: intranode HT bench — cached-mode iter loop

2026-05-12 19:02:31 +00:00

test_low_latency_multirank.py

ext/ep: apply clang-format and black to fix CI lint failures

2026-05-06 04:12:20 +00:00