mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-06-29 10:57:27 +00:00
Implement the high-level MoE API from python/mscclpp/ext/ep/README.md for mode="ht" on top of the existing low-level Buffer (DeepEP-style) runtime. The user passes the tensors the model owns (input, topk_ids, weights, scales); dispatch returns MLP-ready FLAT tokens + per-local-expert counts; combine reverses it from an opaque DispatchHandle. No kernel changes — this is a contract-preserving Python wrapper over the validated HT dispatch/combine kernels (TMA-staged combine gather + all-sender dispatch). New module python/mscclpp/ext/ep/communicator.py: - MoECommunicator / MoECommunicatorConfig: configure comm + expert placement + shape once; mode="ht" -> DispatchLayout.FLAT. - dispatch(): runs get_dispatch_layout (count/membership) then intranode or internode dispatch; returns DispatchOutput(tokens [total_recv_tokens, H] grouped by local expert, num_tokens_per_expert, expert_offsets=cumsum) + DispatchHandle. - combine(): reverses from the handle only; returns [T, H] token-major. - DispatchHandle carries a transport-tagged combine_meta bundle (intranode vs internode reverse-dispatch tensors differ), opaque to the MLP. - Intranode vs internode is auto-selected from get_rdma_buffer_size_hint (0 bytes <=> world_size <= NUM_MAX_NVL_PEERS), so the user never picks transport. - Optional dispatch_async/combine_async + create_overlap_config scaffolding. - DispatchOutput, DispatchHandle, QuantScales, DispatchLayout, CommOverlapConfig dataclasses matching the README contract. __init__.py now exports the high-level API alongside Buffer/Config/EventHandle. First version: BF16 only (FP8 scales + block-level overlap are follow-ups; they raise NotImplementedError and need no signature change). Imports + API shape + layout/cumsum/guards verified on a GB200 node; full multi-rank run pending.