mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 10:57:27 +00:00

Files

qinghuazhou 2ebf81aa35 ep(ncclep): increment-3 de-risk - DRAIN_NOOP probe + SKIP_VERIFY test gate

internode_ncclep.cuh: add EP_NCCLEP_DRAIN_NOOP compile gate (default 0, inert) - when 1 the NVL receiver keeps all control flow but skips the data copies, to measure the dispatch-time upper bound of eliminating the cross-GPU receiver drain. Probe result (4-node {38,41,59,75}): dispatch inc2 1124us -> DRAIN_NOOP 1048us (-6.8%), agg_bw 836->896 GB/s => confirms real headroom for the cross-GPU peer-map direct-write rework (ceiling ~-16.6% cumulative vs baseline).

test_internode_multirank.py: gate the dispatch range-assert and combine assert behind MSCCLPP_EP_SKIP_VERIFY env so dispatch timing can be measured when recv_x is intentionally incomplete (perf probing).

2026-06-08 16:25:53 +00:00

ep(ncclep): increment-3 de-risk - DRAIN_NOOP probe + SKIP_VERIFY test gate

2026-06-08 16:25:53 +00:00