mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-06-29 10:57:27 +00:00
Unify the high-level MoECommunicator to select its backend from MoECommunicatorConfig.mode: - mode="ll": low-latency (EXPERT_MAJOR) via MoERuntime (reused from binyli/ep, PR #818). The LL runtime is built lazily, so a build that only binds the HT Buffer can still use mode="ht" without MoERuntime being present. - mode="ht": high-throughput (FLAT) via the DeepEP-style Buffer; intranode vs internode is auto-selected from the RDMA buffer-size hint. dispatch() gains an optional previous_handle that reuses the routing layout from a prior dispatch with identical topk_ids (cached intranode dispatch also skips notify_dispatch's host-side counter wait), letting a benchmark isolate the on-GPU dispatch-kernel cost (NCCL-EP ep_bench convention). Rewrite the intranode/internode HT benchmark loops to drive the public MoECommunicator(mode="ht") API instead of raw Buffer calls. Export MoERuntime. Validated on 1 node x 4 GB200 GPUs: correctness PASS; dispatch/combine match the raw-Buffer baseline under identical env (no high-level overhead).