mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 10:57:27 +00:00

Files

Qinghua Zhou 3b5270e5d5 ep(ncclep): inc6 flat combine + decoupled dispatch/combine SM knobs

Completes the flat all-sender path (MSCCLPP_EP_FLAT) end-to-end and makes
its dispatch and combine SM counts independently tunable.

- buffer.cc internode_combine: under the flat path, skip the pre-combine
  cached_notify + fifo-advance. The flat combine is the inc5 direct-gather
  (it already needs no forwarder), but cached_notify's sm_id>=3 branch
  indexes rdma_channel_prefix_matrix (the forwarder-produced
  recv_rdma_channel_prefix_matrix, never written under all-sender flat) to
  derive token ranges and then writes combined_nvl_head over that range ->
  out-of-bounds write / intermittent illegal memory access in the bench.
  Gated on ep_flat + recv_pool_global_ptrs + ep_combine_recv_idx; the
  2-hop path is unchanged.
- buffer.cc: decouple the flat dispatch/combine grids from config.num_sms.
  MSCCLPP_EP_DISPATCH_NSM caps the all-sender dispatch block count
  (clamped to [1, num_sms/2]); MSCCLPP_EP_COMBINE_NSM caps the combine
  block count (clamped to [2, num_sms]). Both flat-only. num_channels is
  the dispatch token-partitioning granularity (flows into notify_dispatch,
  the prefix-matrix allocations and the grid), so lowering it stays
  self-consistent; the now-vestigial combine prefix-matrix shape asserts
  are relaxed under flat.
- test: unify the SM-count env var to MSCCLPP_EP_NSM for both the
  internode and intranode tests, with MSCCLPP_EP_NUM_SMS kept as a legacy
  fallback (internode default 152, intranode 20).
- README: document MSCCLPP_EP_FLAT, MSCCLPP_EP_DISPATCH_NSM and
  MSCCLPP_EP_COMBINE_NSM, add a flat all-sender subsection, and correct
  the SM-count row to MSCCLPP_EP_NSM.

Validated 2-node GB200 (e256/topk8, 8 ranks): full flat dispatch+combine
PASS (~1e-6, no IMA), dispatch 500us / combine 451us. Dispatch and combine
SM sweeps are independent (combine flat at ~451us while dispatch scales
698->495us, and vice versa). FLAT=0 inc5 baseline unchanged (568/488us).

2026-06-17 23:52:44 +00:00

ext/ep

ep(ncclep): inc6 flat combine + decoupled dispatch/combine SM knobs

2026-06-17 23:52:44 +00:00