mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-06-29 10:57:27 +00:00
Completes the flat all-sender path (MSCCLPP_EP_FLAT) end-to-end and makes its dispatch and combine SM counts independently tunable. - buffer.cc internode_combine: under the flat path, skip the pre-combine cached_notify + fifo-advance. The flat combine is the inc5 direct-gather (it already needs no forwarder), but cached_notify's sm_id>=3 branch indexes rdma_channel_prefix_matrix (the forwarder-produced recv_rdma_channel_prefix_matrix, never written under all-sender flat) to derive token ranges and then writes combined_nvl_head over that range -> out-of-bounds write / intermittent illegal memory access in the bench. Gated on ep_flat + recv_pool_global_ptrs + ep_combine_recv_idx; the 2-hop path is unchanged. - buffer.cc: decouple the flat dispatch/combine grids from config.num_sms. MSCCLPP_EP_DISPATCH_NSM caps the all-sender dispatch block count (clamped to [1, num_sms/2]); MSCCLPP_EP_COMBINE_NSM caps the combine block count (clamped to [2, num_sms]). Both flat-only. num_channels is the dispatch token-partitioning granularity (flows into notify_dispatch, the prefix-matrix allocations and the grid), so lowering it stays self-consistent; the now-vestigial combine prefix-matrix shape asserts are relaxed under flat. - test: unify the SM-count env var to MSCCLPP_EP_NSM for both the internode and intranode tests, with MSCCLPP_EP_NUM_SMS kept as a legacy fallback (internode default 152, intranode 20). - README: document MSCCLPP_EP_FLAT, MSCCLPP_EP_DISPATCH_NSM and MSCCLPP_EP_COMBINE_NSM, add a flat all-sender subsection, and correct the SM-count row to MSCCLPP_EP_NSM. Validated 2-node GB200 (e256/topk8, 8 ranks): full flat dispatch+combine PASS (~1e-6, no IMA), dispatch 500us / combine 451us. Dispatch and combine SM sweeps are independent (combine flat at ~451us while dispatch scales 698->495us, and vice versa). FLAT=0 inc5 baseline unchanged (568/488us).