Files
mscclpp/python
Qinghua Zhou e9a5acc7d4 ep(python): high-level MoECommunicator HT (FLAT) dispatch/combine API
Implement the high-level MoE API from python/mscclpp/ext/ep/README.md for
mode="ht" on top of the existing low-level Buffer (DeepEP-style) runtime.
The user passes the tensors the model owns (input, topk_ids, weights, scales);
dispatch returns MLP-ready FLAT tokens + per-local-expert counts; combine
reverses it from an opaque DispatchHandle. No kernel changes — this is a
contract-preserving Python wrapper over the validated HT dispatch/combine
kernels (TMA-staged combine gather + all-sender dispatch).

New module python/mscclpp/ext/ep/communicator.py:
- MoECommunicator / MoECommunicatorConfig: configure comm + expert placement +
  shape once; mode="ht" -> DispatchLayout.FLAT.
- dispatch(): runs get_dispatch_layout (count/membership) then intranode or
  internode dispatch; returns DispatchOutput(tokens [total_recv_tokens, H]
  grouped by local expert, num_tokens_per_expert, expert_offsets=cumsum) +
  DispatchHandle.
- combine(): reverses from the handle only; returns [T, H] token-major.
- DispatchHandle carries a transport-tagged combine_meta bundle (intranode vs
  internode reverse-dispatch tensors differ), opaque to the MLP.
- Intranode vs internode is auto-selected from get_rdma_buffer_size_hint (0
  bytes <=> world_size <= NUM_MAX_NVL_PEERS), so the user never picks transport.
- Optional dispatch_async/combine_async + create_overlap_config scaffolding.
- DispatchOutput, DispatchHandle, QuantScales, DispatchLayout, CommOverlapConfig
  dataclasses matching the README contract.

__init__.py now exports the high-level API alongside Buffer/Config/EventHandle.

First version: BF16 only (FP8 scales + block-level overlap are follow-ups; they
raise NotImplementedError and need no signature change). Imports + API shape +
layout/cumsum/guards verified on a GB200 node; full multi-rank run pending.
2026-06-25 02:35:16 +00:00
..