mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-06-29 19:07:30 +00:00
Adds the FLAT all-sender dispatch path, gated behind MSCCLPP_EP_FLAT (requires MSCCLPP_EP_DIRECT). Eliminates the forwarder + coordinator + receiver roles so every SM block is a sender, and delivers per-token metadata straight to the destination recv pool instead of via the 2-hop RDMA ring -> forwarder -> NVL receiver pipeline. - config.hpp: append a per-token 128B metadata region after the worst-case hidden region in the recv pool (kEpRecvPoolMetaBytes, get_recv_pool_meta_base). Allocated unconditionally (cheap vs hidden), only touched when kEpFlat is set. - internode_ncclep.cuh: kEpFlat __constant__ gate. Sender writes SourceMeta + scales + topk straight into each destination pool's meta region at the token's final recv slot. Under kEpFlat the launch is all-sender: num_channels = gridDim.x, channel_id = sm_id, is_forwarder = false; the sender-coordinator, forwarder, and NVL receiver roles early-return, and the sender skips the ring head/tail flow-control wait (no forwarder drains it). - internode.cu / api.cuh: flat_meta_drain kernel + launcher copies the pool meta region into the recv_* output tensors, rebasing topk_idx to this rank's local expert range (weight 0 when out of range). Host grid launches num_channels blocks (not x2) when FLAT+DIRECT. - buffer.cc: call flat_meta_drain on the comm stream after dispatch when ep_flat && ep_use_direct && uncached && topk present. - test: MSCCLPP_EP_BENCH_DISPATCH_ONLY skips combine (which still needs the forwarder/receiver breadcrumbs) so the all-sender dispatch ceiling can be measured; dispatch correctness check is retained. Validated 2-node GB200 (e256 / topk8, 8 ranks): dispatch correct (recv 21820 tokens, per-source ranges exact). flat_meta_drain proven byte-correct by overwriting the receiver's metadata output and still passing combine (~1e-6). All-sender dispatch is 13-66% faster than the 2-hop path at equal block count and reaches better-than-baseline-peak throughput with ~1/4 the blocks. Combine under the flat path (forwarder breadcrumb rework) is the remaining follow-up.