Files
mscclpp/python
Qinghua Zhou 00e41b8976 ep(python): MoECommunicator mode="ht" (FLAT) + HT benchmarks via the high-level API
Unify the high-level MoECommunicator to select its backend from
MoECommunicatorConfig.mode:
- mode="ll": low-latency (EXPERT_MAJOR) via MoERuntime (reused from binyli/ep,
  PR #818). The LL runtime is built lazily, so a build that only binds the HT
  Buffer can still use mode="ht" without MoERuntime being present.
- mode="ht": high-throughput (FLAT) via the DeepEP-style Buffer; intranode vs
  internode is auto-selected from the RDMA buffer-size hint.

dispatch() gains an optional previous_handle that reuses the routing layout from
a prior dispatch with identical topk_ids (cached intranode dispatch also skips
notify_dispatch's host-side counter wait), letting a benchmark isolate the
on-GPU dispatch-kernel cost (NCCL-EP ep_bench convention).

Rewrite the intranode/internode HT benchmark loops to drive the public
MoECommunicator(mode="ht") API instead of raw Buffer calls. Export MoERuntime.

Validated on 1 node x 4 GB200 GPUs: correctness PASS; dispatch/combine match the
raw-Buffer baseline under identical env (no high-level overhead).
2026-06-26 02:44:35 +00:00
..