Files
mscclpp/test/python
Qinghua Zhou 1805ad0db6 ep(ncclep): inc6 flat all-sender dispatch (kEpFlat, dispatch validated)
Adds the FLAT all-sender dispatch path, gated behind MSCCLPP_EP_FLAT
(requires MSCCLPP_EP_DIRECT). Eliminates the forwarder + coordinator +
receiver roles so every SM block is a sender, and delivers per-token
metadata straight to the destination recv pool instead of via the
2-hop RDMA ring -> forwarder -> NVL receiver pipeline.

- config.hpp: append a per-token 128B metadata region after the
  worst-case hidden region in the recv pool (kEpRecvPoolMetaBytes,
  get_recv_pool_meta_base). Allocated unconditionally (cheap vs hidden),
  only touched when kEpFlat is set.
- internode_ncclep.cuh: kEpFlat __constant__ gate. Sender writes
  SourceMeta + scales + topk straight into each destination pool's meta
  region at the token's final recv slot. Under kEpFlat the launch is
  all-sender: num_channels = gridDim.x, channel_id = sm_id,
  is_forwarder = false; the sender-coordinator, forwarder, and NVL
  receiver roles early-return, and the sender skips the ring head/tail
  flow-control wait (no forwarder drains it).
- internode.cu / api.cuh: flat_meta_drain kernel + launcher copies the
  pool meta region into the recv_* output tensors, rebasing topk_idx to
  this rank's local expert range (weight 0 when out of range). Host grid
  launches num_channels blocks (not x2) when FLAT+DIRECT.
- buffer.cc: call flat_meta_drain on the comm stream after dispatch when
  ep_flat && ep_use_direct && uncached && topk present.
- test: MSCCLPP_EP_BENCH_DISPATCH_ONLY skips combine (which still needs
  the forwarder/receiver breadcrumbs) so the all-sender dispatch ceiling
  can be measured; dispatch correctness check is retained.

Validated 2-node GB200 (e256 / topk8, 8 ranks): dispatch correct
(recv 21820 tokens, per-source ranges exact). flat_meta_drain proven
byte-correct by overwriting the receiver's metadata output and still
passing combine (~1e-6). All-sender dispatch is 13-66% faster than the
2-hop path at equal block count and reaches better-than-baseline-peak
throughput with ~1/4 the blocks. Combine under the flat path (forwarder
breadcrumb rework) is the remaining follow-up.
2026-06-17 21:13:13 +00:00
..