Unify the high-level MoECommunicator to select its backend from
MoECommunicatorConfig.mode:
- mode="ll": low-latency (EXPERT_MAJOR) via MoERuntime (reused from binyli/ep,
PR #818). The LL runtime is built lazily, so a build that only binds the HT
Buffer can still use mode="ht" without MoERuntime being present.
- mode="ht": high-throughput (FLAT) via the DeepEP-style Buffer; intranode vs
internode is auto-selected from the RDMA buffer-size hint.
dispatch() gains an optional previous_handle that reuses the routing layout from
a prior dispatch with identical topk_ids (cached intranode dispatch also skips
notify_dispatch's host-side counter wait), letting a benchmark isolate the
on-GPU dispatch-kernel cost (NCCL-EP ep_bench convention).
Rewrite the intranode/internode HT benchmark loops to drive the public
MoECommunicator(mode="ht") API instead of raw Buffer calls. Export MoERuntime.
Validated on 1 node x 4 GB200 GPUs: correctness PASS; dispatch/combine match the
raw-Buffer baseline under identical env (no high-level overhead).
Implement the high-level MoE API from python/mscclpp/ext/ep/README.md for
mode="ht" on top of the existing low-level Buffer (DeepEP-style) runtime.
The user passes the tensors the model owns (input, topk_ids, weights, scales);
dispatch returns MLP-ready FLAT tokens + per-local-expert counts; combine
reverses it from an opaque DispatchHandle. No kernel changes — this is a
contract-preserving Python wrapper over the validated HT dispatch/combine
kernels (TMA-staged combine gather + all-sender dispatch).
New module python/mscclpp/ext/ep/communicator.py:
- MoECommunicator / MoECommunicatorConfig: configure comm + expert placement +
shape once; mode="ht" -> DispatchLayout.FLAT.
- dispatch(): runs get_dispatch_layout (count/membership) then intranode or
internode dispatch; returns DispatchOutput(tokens [total_recv_tokens, H]
grouped by local expert, num_tokens_per_expert, expert_offsets=cumsum) +
DispatchHandle.
- combine(): reverses from the handle only; returns [T, H] token-major.
- DispatchHandle carries a transport-tagged combine_meta bundle (intranode vs
internode reverse-dispatch tensors differ), opaque to the MLP.
- Intranode vs internode is auto-selected from get_rdma_buffer_size_hint (0
bytes <=> world_size <= NUM_MAX_NVL_PEERS), so the user never picks transport.
- Optional dispatch_async/combine_async + create_overlap_config scaffolding.
- DispatchOutput, DispatchHandle, QuantScales, DispatchLayout, CommOverlapConfig
dataclasses matching the README contract.
__init__.py now exports the high-level API alongside Buffer/Config/EventHandle.
First version: BF16 only (FP8 scales + block-level overlap are follow-ups; they
raise NotImplementedError and need no signature change). Imports + API shape +
layout/cumsum/guards verified on a GB200 node; full multi-rank run pending.
Port the GB200 internode flat-path optimizations to the single-node (intranode,
NVLink/IPC) HT path and add independent dispatch/combine SM controls. All under
MSCCLPP_EP_INTRA_DIRECT; the legacy 2-hop kernels remain as byte-identical
fallbacks. Single node, 4 ranks, t=4096 d=7168 e=256 topk=8, all PASS.
TMA direct-gather combine (combine_intranode_gather_tma)
- Faithful port of combine_flat_gather_tma: for each output token, discover the
contributing ranks (send_head>=0) and gather each contributor's hidden row
straight from that rank's IPC-mapped recv pool (slot = ep_combine_recv_idx)
through a kStages-deep cp.async.bulk (TMA) SMEM pipeline, reduce from SMEM,
st_na_global. Token-parallel grid (no channel partitioning).
- Default ON under INTRA_DIRECT; MSCCLPP_EP_COMBINE_TMA=0 -> 2-hop ring fallback.
- combine: 1273 -> 474 us @ 16 SMs (2.7x), 703 -> 269 us @ 32 SMs (2.6x).
All-sender dispatch (dispatch_allsender + intranode_meta_drain)
- The 2-hop dispatch splits the grid 50/50 sender/receiver; under INTRA_DIRECT
the receiver blocks idle (hidden is written directly), wasting half the SMs.
Make every block a sender (one channel per block, num_channels=num_sms) so all
blocks move hidden directly to the dest pools, matching NCCL-EP's all-sender
block count. Per-token metadata (src_idx, rebased topk_idx, weights, scales)
is packed into the dest pool META region; intranode_meta_drain unpacks the
local META region into the recv_* tensors (mirrors internode flat_meta_drain).
- Default ON under INTRA_DIRECT + TMA combine (the layout needs the TMA combine);
auto-disables when the ring combine is forced. Override MSCCLPP_EP_INTRA_ALLSENDER.
- dispatch: 1006 -> 454 us @ 16 SMs (beats NCCL-EP's 464 us), 545 -> 287 us @ 32 SMs.
Independent SM knobs (intranode)
- MSCCLPP_EP_DISPATCH_NSM sets the dispatch block count (clamped to config.num_sms);
MSCCLPP_EP_COMBINE_NSM sets the token-parallel TMA combine grid independently.
- The dispatch gather map (ep_combine_recv_idx) is now also allocated + written on
the INTRA_DIRECT path (previously internode/MSCCLPP_EP_DIRECT only).
Select the TMA-staged flat-combine gather's warp (token) count per block by
channel count instead of a single fixed value: 14 warps (WIDE) up to 12
channels, 12 warps (NARROW) above. More warps add token-parallelism that wins
when the grid has few blocks (low SM, latency-bound), but at higher block
counts the marginal warp costs more scheduling than it buys.
This removes the only remaining regression of the flat-14-warp default: 14
warps was -2..-9% at <=10 channels but +3% at 16 channels (SMs=32). The
adaptive switch keeps the low-SM wins and takes the faster 12-warp path at
SMs=32, so combine is now Pareto-optimal across the SM range.
Combine (us) @ e256/topk8 hidden=7168 tokens=4096, vs the flat-14w default:
SMs=16 (8 ch -> 14w): 2n 650 4n 735 8n 776 (unchanged, all WIDE)
SMs=20 (10 ch -> 14w): 2n 526 4n 610 8n 675 (unchanged, all WIDE)
SMs=32 (16 ch -> 12w): 2n 442 4n 564 8n 645 (was 455/584/665 -> -3%)
All PASS, recv counts unchanged.
Both warp counts instantiate their own kernel + dynamic-SMEM attribute
(WIDE 14w = 229KB, fits the sm_100 cap). Threshold and both counts overridable
via -D (EP_CMB_TMA_WARPS_WIDE/_NARROW/_MAXCH). MSCCLPP_EP_COMBINE_TMA=0
fallback unchanged.
Raise the TMA-staged flat-combine gather's default warp (token) count per
block from 12 to 14. 14 is the throughput sweet spot at hidden=7168 bf16:
beyond it, the per-block SMEM budget forces a smaller TMA chunk, and the
smaller descriptors then cost more than the extra token-parallelism buys
(measured: 18-28 warps all regress).
Combine (us) @ e256/topk8 hidden=7168 tokens=4096, NSM16, vs 12 warps:
2n 695 -> 648 (-7%) 4n 748 -> 735 (-2%) 8n 790 -> 776 (-2%)
Also at 2n: NSM20 575 -> 526 (-9%); NSM32 442 -> 455 (+3%, already far ahead).
All PASS, recv counts unchanged.
One-line constant change (EP_CMB_TMA_WARPS); kernel logic, SMEM model
(229KB at 14w/2-stage/chunk-64, fits the dynamic-shared opt-in), and the
MSCCLPP_EP_COMBINE_TMA=0 fallback are unchanged. Still overridable via -D.
Make the flat direct-gather combine stage each contributor's hidden chunks
from the remote NVLink recv pools into shared memory via cp.async.bulk (TMA),
then reduce from SMEM, instead of the synchronous ld_nc_global register-MLP
gather. The async copy engine hides the remote-NVLink read latency at low
register cost, so latency hiding no longer competes with occupancy.
This wins at every channel count (no crossover) and every node scale, so it
replaces both the lean and the unified flat-gather paths and needs no channel
gate. New kernel combine_flat_gather_tma: token-per-warp, hidden split into
1KB (chunk=64 int4) TMA descriptors streamed through a 2-stage software
pipeline; SMEM is independent of hidden, so it opts in to >48KB dynamic shared
via cudaFuncSetAttribute. Defaults tuned on GB200 sm_100 @ hidden=7168 bf16.
Default on; MSCCLPP_EP_COMBINE_TMA=0 falls back to the prior synchronous lean
gather (<=14 channels) / unified flat branch.
Combine (us) @ e256/topk8 hidden=7168 tokens=4096, vs the prior lean gather:
2n NSM16 705 -> 695 NSM20 619 -> 575 NSM32 577 -> 442 (-23%)
4n NSM16 912 -> 748 NSM20 732 -> 635 NSM24 704 -> 594 NSM32 665 -> 565
8n NSM16 953 -> 790 NSM20 813 -> 697 NSM32 749 -> 645
All beat NCCL-EP where measured (2n/4n/8n NSM16: 747/800/959). All PASS.
Topology is single-hop NVLink (one MNNVL LSA domain), so this is an
apples-to-apples gather, not a 2-hop reduce. Default-unset (non-flat / non-direct)
paths are unchanged.
The unified `combine` kernel carries the heavy 2-hop forwarder path and
compiles to ~80 registers, which on sm_100 limits it to 1 resident block/SM
(25 warps, ~39% occupancy). Under the flat all-sender path only the direct
gather actually runs (the forwarder is dead code), yet it inherits that
register ceiling -- starving the NVLink pull-gather of the warp-level
parallelism it needs to hide the (latency-, not bandwidth-bound) remote reads.
Extract the flat direct-gather into its own lean kernel `combine_flat_gather`,
with logic byte-for-byte identical to the unified kernel's flat branch
(combine_token for <= 8 nodes, chunked-ballot discovery for > 8). Freed from
the forwarder code it compiles to 47 registers, so a full 1024-thread block
(32 warps) fits per SM at the same SM budget -> more warps in flight -> better
latency hiding. The launcher picks it on the existing flat-direct gate
(recv_pool_global_ptrs / ep_combine_recv_idx non-null, i.e. MSCCLPP_EP_DIRECT),
so the 2-hop path is untouched.
This is a CROSSOVER like the inc7 dispatch atomic path: the lean kernel's
bigger blocks win when blocks are few (low SM, many tokens/block) but add
scheduling overhead when blocks are many (high SM). Gate to num_channels <= 14
(== the inc7 dispatch threshold); above it, fall through to the unified kernel.
8n combine (flat, all PASS, recv 29815):
num_channels 8 (NSM16, lean) 1086 -> 953 us (-12%, beats NCCL-EP 959)
num_channels 16 (NSM32, unified) 753 -> 749 us (unchanged)
2n NSM16 1006 -> 707 us (-30%, beats NCCL-EP 747); 4n NSM16 1040 -> 912 us.
The flat all-sender dispatch launches one sender block per channel and has no
forwarder, yet num_channels (== the dispatch block count) was hard-capped at
config.num_sms/2 -- the 2-hop ring requirement. So at "num_sms=16" the flat
dispatch ran on only 8 blocks while NCCL-EP always uses num_sms (16) blocks,
which fully accounts for the apparent low-SM dispatch gap (the block count, not
the kernel, was the difference).
Raise the MSCCLPP_EP_DISPATCH_NSM clamp from [1, num_sms/2] to [1, num_sms] under
the flat path, and size the RDMA/NVL buffers for the resolved channel count so the
per-channel kernel offsets and clean range stay in bounds. A single knob now does
it: MSCCLPP_EP_NSM=16 + MSCCLPP_EP_DISPATCH_NSM=16 -> 16 dispatch blocks (matching
NCCL-EP), no num_sms*2 workaround. The combine grid still uses num_sms/2, so the
buffer is sized for max(dispatch_channels, num_sms/2). Unset -> num_sms/2 (the
shipped default is byte-identical). Non-flat 2-hop path unchanged.
config.hpp: add ep_flat_dispatch_channels() (resolved dispatch channel count) and
ep_buffer_channels() (buffer sizing = max of dispatch channels and num_sms/2); both
get_*_buffer_size_hint use the latter. buffer.cc: internode_dispatch resolves
num_channels via ep_flat_dispatch_channels(). README: document the raised clamp.
8n dispatch (flat, all PASS, recv 29815):
num_sms=16, DISPATCH_NSM unset (8 blocks) 1101 us (default, unchanged)
num_sms=16, DISPATCH_NSM=16 (16 blocks) 784 us (-29%, beats NCCL-EP 973)
Under the FLAT all-sender path (kEpFlat), the per-channel sequential lock
(rdma_send_next_token_idx spin-wait) serialized slot assignment for every
token within a channel by token_idx. At low channel count (num_channels =
num_sms/2) there are many tokens per channel, so this serialization
dominated the dispatch time -- the NSM16 cliff vs NCCL-EP.
Replace the spin-lock with atomicAdd on the shared slot counters
(rdma_send_channel_next_tail, ep_count) when num_channels <= 14. Slot
assignment becomes dense but order-independent, which is correct here
because (a) the receiver validates each source's contiguous recv_x block
by value range (order-agnostic), (b) combine gathers via the
self-consistent ep_combine_recv_idx map, and (c) rdma_send_channel_tail
is unused under kEpFlat (the coordinator warp returns early). The
per-token breadcrumb stores (ep_combine_recv_idx, send_rdma_head) are
also moved out of the critical section since their addresses are unique
per (token, dst).
This is a CROSSOVER optimization: at high channel count the lock was
already cheap and the 6 sender warps contend on the shared atomics, so
the proven spin-lock + plain-increment path is kept for num_channels > 14
(byte-equivalent to the previous behavior). The non-flat 2-hop path is
unchanged.
8n dispatch (vs prev full-token TMA), all PASS:
NSM16 (8ch) 1404 -> 1099 us (-21.7%)
NSM24 (12ch) 984 -> 777 us (-21.1%)
NSM28 (14ch) 862 -> 823 us (-4.5%)
NSM32 (16ch) 786 -> 780 us (lock path, ~unchanged)
2n dispatch (8ch) 830 -> 760 us (-8.5%). Cumulative 8n NSM16 vs pre-opt -42.9%.
Push the flat all-sender TMA S2G tile to the WHOLE bf16 token (7168B->14336B), so each destination is a single cp.async.bulk S2G per token (nchunks=1) -- matching NCCL-EP whole-token granularity (was 2 ops for half-token, 14 for the original 1024B sub-chunk). Full-token forces NSTAGE 4->2 to keep the 6x2x14336 = 168KB dynamic ring under the GB200 ~227KB/block opt-in cap (confirmed cudaDevAttrMaxSharedMemoryPerBlockOptin=232448); kEpTmaSndInFlight clamped to >=1 so NSTAGE=2 keeps 1 cross-token S2G in flight, plus a new InFlight<NStage drain-before-reuse static_assert. Kernel-only (host launcher already sizes dynamicSmemBytes from EP_TMA_SND_* macros). Static SMEM 18688->18560; REG 168. Dispatch-only; combine unchanged. All PASS (recv 21820 2n / 29815 8n). A/B vs half-token Bd3: 2n NSM16 866.7->829.8 (-4.3%); 8n NSM16 1464.5->1404.1 (-4.1%), NSM20 1223.6->1155.4 (-5.6%), NSM32 809.6->786.0 (-2.9%). Cumulative (StageA+Bd1+Bd3+full) vs pre-opt baseline: 8n NSM16 1924.2->1404.1 (-27.0%), NSM32 1030.3->786.0 (-23.7%, beats NCCL-EP 8n 973us by ~19%).
Carry the TMA S2G pipeline (in-flight groups + ring stage index) ACROSS token boundaries in the flat all-sender hidden-copy, removing the per-token full wait_group 0 drain. Last chunks of token N now overlap first chunks of token N+1 instead of a hard drain at every token seam (~1 NVLink RT/token, dominant at low SM). Global stage = (ep_tma_gbase + chunk) % NStage with ep_tma_gbase += nchunks per token; drain-before-reuse preserved (NStage 4 > InFlight 2) across seams; single wait_group 0 + fence.proxy.async.global at warp end. SMEM/REG byte-identical to Stage A (43328/168, pure control-flow). Dispatch-only; combine unchanged. All PASS (recv 21820 2n / 29815 8n). 8-node FLAT A/B vs Stage A: NSM16 1826.3->1781.8 (-2.4%), NSM20 1497.5->1452.8 (-3.0%), NSM32 983.3->957.6 (-2.6%, now beats NCCL-EP 8n 973us). 2n NSM16 1228.6->1195.6 (-2.7%). Cumulative w/ Stage A vs pre-opt baseline: 8n NSM16 -7.4%, NSM32 -7.1%.
Replace the per-chunk full-drain (cp.async.bulk.wait_group 0 after every chunk) in the flat all-sender TMA hidden-copy with an NCCL-EP-style produce-1-ahead + lazy in-flight drain (cp.async.bulk.wait_group.read N), so consecutive chunks S2G NVLink writes overlap each other and the next chunks G2S HBM load. Ring deepened NStage 2->4, chunk 2048->1024 (SMEM-neutral: 6*4*1024==6*2*2048==24576B; +96B for 2 extra mbarriers; REG unchanged at 168). Dispatch-only; combine byte-identical. 8-node FLAT A/B (all PASS recv29815): NSM16 1924.2->1826.3 (-5.1%), NSM20 1565.0->1497.5 (-4.3%), NSM32 1030.3->983.3 (-4.6%, ~= NCCL-EP 8n 973us). 2-node NSM16 1285.2->1228.6 (-4.4%).
Completes the flat all-sender path (MSCCLPP_EP_FLAT) end-to-end and makes
its dispatch and combine SM counts independently tunable.
- buffer.cc internode_combine: under the flat path, skip the pre-combine
cached_notify + fifo-advance. The flat combine is the inc5 direct-gather
(it already needs no forwarder), but cached_notify's sm_id>=3 branch
indexes rdma_channel_prefix_matrix (the forwarder-produced
recv_rdma_channel_prefix_matrix, never written under all-sender flat) to
derive token ranges and then writes combined_nvl_head over that range ->
out-of-bounds write / intermittent illegal memory access in the bench.
Gated on ep_flat + recv_pool_global_ptrs + ep_combine_recv_idx; the
2-hop path is unchanged.
- buffer.cc: decouple the flat dispatch/combine grids from config.num_sms.
MSCCLPP_EP_DISPATCH_NSM caps the all-sender dispatch block count
(clamped to [1, num_sms/2]); MSCCLPP_EP_COMBINE_NSM caps the combine
block count (clamped to [2, num_sms]). Both flat-only. num_channels is
the dispatch token-partitioning granularity (flows into notify_dispatch,
the prefix-matrix allocations and the grid), so lowering it stays
self-consistent; the now-vestigial combine prefix-matrix shape asserts
are relaxed under flat.
- test: unify the SM-count env var to MSCCLPP_EP_NSM for both the
internode and intranode tests, with MSCCLPP_EP_NUM_SMS kept as a legacy
fallback (internode default 152, intranode 20).
- README: document MSCCLPP_EP_FLAT, MSCCLPP_EP_DISPATCH_NSM and
MSCCLPP_EP_COMBINE_NSM, add a flat all-sender subsection, and correct
the SM-count row to MSCCLPP_EP_NSM.
Validated 2-node GB200 (e256/topk8, 8 ranks): full flat dispatch+combine
PASS (~1e-6, no IMA), dispatch 500us / combine 451us. Dispatch and combine
SM sweeps are independent (combine flat at ~451us while dispatch scales
698->495us, and vice versa). FLAT=0 inc5 baseline unchanged (568/488us).
Adds the FLAT all-sender dispatch path, gated behind MSCCLPP_EP_FLAT
(requires MSCCLPP_EP_DIRECT). Eliminates the forwarder + coordinator +
receiver roles so every SM block is a sender, and delivers per-token
metadata straight to the destination recv pool instead of via the
2-hop RDMA ring -> forwarder -> NVL receiver pipeline.
- config.hpp: append a per-token 128B metadata region after the
worst-case hidden region in the recv pool (kEpRecvPoolMetaBytes,
get_recv_pool_meta_base). Allocated unconditionally (cheap vs hidden),
only touched when kEpFlat is set.
- internode_ncclep.cuh: kEpFlat __constant__ gate. Sender writes
SourceMeta + scales + topk straight into each destination pool's meta
region at the token's final recv slot. Under kEpFlat the launch is
all-sender: num_channels = gridDim.x, channel_id = sm_id,
is_forwarder = false; the sender-coordinator, forwarder, and NVL
receiver roles early-return, and the sender skips the ring head/tail
flow-control wait (no forwarder drains it).
- internode.cu / api.cuh: flat_meta_drain kernel + launcher copies the
pool meta region into the recv_* output tensors, rebasing topk_idx to
this rank's local expert range (weight 0 when out of range). Host grid
launches num_channels blocks (not x2) when FLAT+DIRECT.
- buffer.cc: call flat_meta_drain on the comm stream after dispatch when
ep_flat && ep_use_direct && uncached && topk present.
- test: MSCCLPP_EP_BENCH_DISPATCH_ONLY skips combine (which still needs
the forwarder/receiver breadcrumbs) so the all-sender dispatch ceiling
can be measured; dispatch correctness check is retained.
Validated 2-node GB200 (e256 / topk8, 8 ranks): dispatch correct
(recv 21820 tokens, per-source ranges exact). flat_meta_drain proven
byte-correct by overwriting the receiver's metadata output and still
passing combine (~1e-6). All-sender dispatch is 13-66% faster than the
2-hop path at equal block count and reaches better-than-baseline-peak
throughput with ~1/4 the blocks. Combine under the flat path (forwarder
breadcrumb rework) is the remaining follow-up.
Port the NCCL-EP hybrid_ep.cuh TMA dispatch path into the inc5 sender-direct
kEpDirect branch, gated by EP_NCCLEP_TMA. kRDMASender streams each token hidden
HBM->SMEM via cp.async.bulk (G2S, mbarrier complete_tx) then fans the staged
tile S2G straight to every destination GPU recv-pool fabric VA.
Two correctness mechanisms from the standalone probe + NCCL-EP reference:
- per-CHUNK SMEM ring (kEpTmaSndNStage=2, chunkb=2048) so no two back-to-back
cp.async.bulk ops reuse a single tile+mbar to a cross-node fabric VA, which is
the unspecified-launch-failure trigger. SMEM-neutral vs the single-slot 4096
layout (6*2*2048 == 6*4096); dispatch static SHARED stays 43360B (< 48KB).
- LANE-STRIPED S2G broadcast (lane j -> dst pool j, j+=32) instead of a serial
lane-0 loop; lane 0 issues G2S, the whole warp waits on the G2S mbarrier, each
lane drives its own destination, drain wait_group 0 + syncwarp before tile reuse.
Validated 2 nodes x 4 GPU (e256/topk8, hidden=7168, tokens=4096, MSCCLPP_EP_DIRECT=1):
dispatch PASS (no launch failure, byte-correct combine), 569.5us vs 618.5us
warp-store baseline (-7.9%). Off by default (EP_NCCLEP_TMA=0); baseline byte-
identical when unset.
Add the two GB200 direct-path env flags to the EP README runtime-knobs
table plus a dedicated subsection:
- MSCCLPP_EP_DIRECT=1: internode sender direct-write dispatch + receiver
gather-direct combine (one master flag), with the MSCCLPP_EP_FABRIC_IPC
prerequisite and recv-pool input contract.
- MSCCLPP_EP_INTRA_DIRECT=1: separate single-node intra-node sender
direct-write flag.
Includes measured GB200 NVL72 numbers and the single-node launch note.
The kEpDirect combine-gather path used the flat combine_token<kEpNumRanks>
helper, whose discovery maps one warp lane per rank. That is correct only
while kEpNumRanks <= 32; at >= 16 nodes (kEpNumRanks > 32) the __shfl-by-rank
source lane wraps mod 32, so ranks >= 32 alias 0..31 and num_topk_ranks
overflows topk_ranks[8] -> combined_x came out all zeros.
Split the gather block with if constexpr (kEpNumRanks <= 32): <= 8 nodes
compile to exactly the original fast path (byte-identical, no perf change),
while > 32 ranks use a chunked-ballot discovery (scan ranks in 32-wide
chunks, compact set lanes via __ballot_sync) plus a pre-loaded register-array
reduction. Verified combine correctness 2/4/8/16n and combine timing
unchanged at <= 8n (2n 494us, 4n 634us, 8n 740us).
Eliminate the 2-hop ring + receiver hidden-drain in intranode dispatch. When
MSCCLPP_EP_INTRA_DIRECT is set, the sender computes each token's final recv_x
index (rank_prefix + channel_prefix + running count, identical to the receiver's
total_offset+chunk_idx) and writes hidden straight to the destination's
peer-mapped recv pool; recv_x becomes a zero-copy view of that pool and the
receiver skips the hidden drain (metadata still flows through the ring, keeping
indices aligned). Reuses the Increment-4 recv pool (recv_pool_ptrs_gpu).
Baseline byte-identical when unset.
Single-node 4-GPU A/B @4096/7168/256-exp/top8, 152 SMs: dispatch 3829.7us ->
283.9us (-92.6%, 13.5x), now SM-scaling; correctness exact-match (0.0) at 128
and 4096 token scale.
cached_notify serves both cached dispatch (shrunk ring, is_cached_dispatch=true)
and combine (is_cached_dispatch=false). Combine still ships hidden through the
rdma ring, so its clean must stay full-slot; only shrink the dispatch clean.
Fixes combine desync/timeout under MSCCLPP_EP_DIRECT in the bench loop.
Combine A/B (SKIP_COMBINE=0, correctness PASS): 4n/4096 inc5 vs inc4a
combine 1423 vs 1435us; 8n/4096 2281 vs 2233us (combine path unchanged).
Under kEpDirect the rdma ring no longer carries hidden (ring_hidden_int4=0);
hidden goes straight to the dest pool. Clean meta (notify_dispatch +
cached_notify) matches the small slot so the bench loop does not desync.
Sender reads x once and broadcasts to all destination pools (gather-pointer
st_pool_broadcast) instead of re-reading per destination.
Dispatch A/B (4 nodes, SKIP_COMBINE): 2048 557.8 vs 586.9us (-5.0%);
4096 925.2 vs 1033.4us (-10.5%).
On the NVL72 single NVLink domain, eliminate the 2nd NVLink hop in HT
dispatch: the RDMA sender writes each token's hidden straight into the
destination GPU's domain-wide recv_x pool over fabric VA, instead of
staging through the rdma_channel ring + forwarder transpose.
Gated by env MSCCLPP_EP_DIRECT (default off => byte-for-byte inc4a;
verified 584.6us == inc4a 583.3us with the gate off).
- buffer.{cc,hpp}: domain-wide recv-pool base exchange
(recv_pool_global_ptrs, registerMemory all_transport + send/recvMemory,
env-gated). Validated the cuMem pool is fabric-mappable cross-node
(all 16 peer bases imported at 4-node).
- kernels/internode_ncclep.cuh: kEpDirect gate. Per-destination-GPU base
index = peer-read recv_gbl_rank_prefix_sum[rank-1] + channel start;
seq-locked per-GPU running count (matches the receiver drain order).
Sender writes hidden direct to recv_pool_global_ptrs[dst_global];
forwarder skips the hidden copy (fuse + xnode) under kEpDirect.
__threadfence_system at the sender epilogue for cross-node visibility.
- kernels/api.cuh, kernels/internode.cu: thread recv_pool_global_ptrs
through the dispatch signature/macro/launch + kEpDirect upload.
Dispatch correctness validated at 4-node (recv_x per-source block check
PASS, recv 13465, MSCCLPP_EP_DIRECT=1 SKIP_COMBINE=1).
WIP / not yet a perf win: the coordinator still RDMAs the full ring slot
(including the now-unused hidden region), so hidden is shipped twice.
Pending: shrink the ring slot to metadata-only under kEpDirect (the perf
win), and the parallel combine rework (combine still expects the 2-hop
breadcrumbs, so inc5 is dispatch-only for now -> run with SKIP_COMBINE).
Replace the increment-3 cudaMalloc-carve + cudaIpc recv-output pool with a
VMM pool (mscclpp::detail::gpuCallocPhysical, cuMem FABRIC/POSIX-FD) exchanged
via registerMemory(ipc)+sendMemory/recvMemory so peers import directly
dereferenceable, TMA-eligible fabric VAs (recv_pool_ptrs[]) instead of the
cudaIpc carve. Kernel param recv_pool_offset(int64 into buffer_ptrs) ->
recv_pool_ptrs(void** peer bases). Falls back to the inc3 inline carve when
fabric-IPC is unsupported. PASS, dispatch 1052us (== inc3, perf-neutral);
removes the ~1GB inline pool from the buffer_ptrs allocation.
Also lands a gated (EP_NCCLEP_TMA, default OFF) cross-GPU forwarder TMA write
path targeting the VMM peer pool. NOTE: enabling it currently faults
(cudaErrorLaunchFailure) in-kernel even to local VMM; single-process VMM+TMA
probe was a false positive. Kept gated off for future investigation; the
default build is unaffected.
Carve a fixed-capacity recv-output pool onto the END of the NVL cudaMalloc (covered by the same IPC handle as buffer_ptrs, NOT added to num_nvl_bytes so it dodges the INT_MAX registered-size cap). recv_x becomes a zero-copy from_blob view of the local pool; the pool header holds this rank recv_gbl_rank_prefix_sum (peer-readable). The cross-GPU forwarder computes the destination final recv_x index from the peer-read prefix + channel-local start_sum + per-src count and writes hidden STRAIGHT to the destination peer pool; the receiver skips its hidden drain (keeps scales/topk/meta). Same-GPU path (inc1) unchanged (local pool == recv_x). Falls back to torch::empty + receiver when num_recv_tokens > pool cap.
Files: config.hpp (pool sizing static helpers), buffer.hpp (recv_pool_off_), buffer.cc (enlarge ctor alloc, from_blob recv_x, publish prefix header), api.cuh + internode.cu (recv_pool_offset param, guarded EP_DISPATCH_EXTRA_ARGS), internode_ncclep.cuh (xnode_direct forwarder + receiver hidden-skip).
Clean interleaved 4-node A/B {38,41,59,75}: inc2 ~1132us -> inc3 ~1065us (-5.9%), order-independent, FULL PASS (recv 26809, combine correct). Captures ~the DRAIN_NOOP ceiling (1048us). Cumulative vs baseline 1257us: -15.4%. Gap to NCCL-EP 902us now ~156us (was 355us).
internode_ncclep.cuh: add EP_NCCLEP_DRAIN_NOOP compile gate (default 0, inert) - when 1 the NVL receiver keeps all control flow but skips the data copies, to measure the dispatch-time upper bound of eliminating the cross-GPU receiver drain. Probe result (4-node {38,41,59,75}): dispatch inc2 1124us -> DRAIN_NOOP 1048us (-6.8%), agg_bw 836->896 GB/s => confirms real headroom for the cross-GPU peer-map direct-write rework (ceiling ~-16.6% cumulative vs baseline).
test_internode_multirank.py: gate the dispatch range-assert and combine assert behind MSCCLPP_EP_SKIP_VERIFY env so dispatch timing can be measured when recv_x is intentionally incomplete (perf probing).
Full-token outstanding int4 loads (hidden_int4=896 => 28*32) on the 3 dest-side drain copies: same-GPU direct-write, forwarder cross-GPU staging write, receiver staging->recv_x drain. Sender packing copy left at 5. Hides HBM/fabric store-completion latency (more memory-level parallelism). This is the proven Lever-C MLP lever; its prior net-regression came from combine register inflation in the SHARED translation unit, which cannot happen here (dispatch_ncclep is a separate TU; combine untouched). Clean interleaved 4-node A/B {38,41,59,75}: inc1 ~1176us -> inc2 ~1124us (-4.4%), full PASS. Cumulative vs baseline 1257us: -10.6%.
Forwarder warps whose dst_nvl==nvl_rank (token arrived on its destination GPU, no NVLink transpose) write straight to final recv_x, computing the same index the receiver would (recv_gbl_rank_prefix_sum base + channel-local start_sum + per-src running count). The matching receiver warp (src_nvl==nvl_rank) is skipped. Eliminates the dest-side HBM double-bounce (nvl_channel_x staging + receiver copy) for the same-GPU token fraction (~25%). recv_x contents + send_nvl_head breadcrumb preserved bit-identical => combine unaffected. 4-node {38,41,59,75}: dispatch 1257->1174-1185us (-6%), full PASS (recv 26809).
internode_ncclep.cuh: dispatch_ncclep<> = production dispatch<> kernel under EP_DISPATCH_NCCLEP, to be restructured into NCCL-EP warp-specialized overlap (concurrent N2N fabric-put + intra-node G2S/S2G drain via MemoryChannel put/signal). internode.cu: guarded include + EP_DISPATCH_KERNEL alias selects dispatch_ncclep when guard ON, else production dispatch (byte-identical). Builds clean with -DMSCCLPP_EP_DISPATCH_NCCLEP=ON.
- src/core/atomicadd_kernel.cu: restore the legacy 3-arg
cuCtxCreate(&proxyAtomicCtx_, 0, cuDevice) in the
'#else' branch of the CUDA_VERSION >= 12050 guard. A prior
edit had corrupted it to 'cuCtxCreate(&proxyAtomicCtx_vice)',
which broke the CUDA 11.8 build (CodeQL CUDA cuda11.8 and
MSCCLPPLang cuda11.8 jobs).
- Apply clang-format to src/ext/ep/* (no logic changes,
fixes the cpplint CI job).
- Apply black to test/python/ext/ep/test_internode_multirank.py
and test_intranode_multirank.py (no logic changes, fixes
the pylint CI job).
## Summary
- Release the reference after last requests are ready.
- Keep ordered receive chaining for repeated rank/tag operations while
cleaning up completed receive bookkeeping.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a `PortChannel` requests `flush()`, the host-side proxy was being
blocked, which may cause head-of-line blocking of other parallel
`PortChannel`s' requests. Now the proxy handles `flush()` requests
asynchronously. This feature especially helps performance when we need
multiple IB QPs and need to flush QPs.
The NVLS HT B2 path introduced in 3ab2e43b activated whenever
isNvlsSupported() && num_rdma_ranks > 1. On H100 NDv5 / Azure CX-7 RoCE
that is true (H100 has intra-node NVLink multicast), but there is no
cross-host NVSwitch fabric. mscclpp's GpuIpcMem::create then falls back
to CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR whose handle exchange routes
through /tmp/mscclpp_bootstrap_<pid>.sock -- a master-rank-0 unix-domain
socket worker ranks cannot reach. Symptom on every commit since 3ab2e43b:
RuntimeError: connect() failed for unix socket to
/tmp/mscclpp_bootstrap_<pid>.sock
MSCCLPP_EP_FABRIC_IPC=0 was being silently ignored.
src/ext/ep/buffer.cc: add resolve_fabric_ipc_supported() helper.
Resolution:
1. MSCCLPP_EP_FABRIC_IPC env var (0/off/false/no => off,
1/on/true/yes/force => on, otherwise auto).
2. Auto-detect: requires both
- CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED == 1
- device compute capability >= sm_100 (Blackwell+).
Gate both use_fabric_ipc_alloc (RDMA buffer allocator) and nvls_ht_enabled
(HT B2 multicast region) on fabric_ipc_supported. On H100 both fall back
to cudaMalloc + legacy PortChannel; on GB200 NVL72 both remain enabled.
Diagnostic prints now show fabric_ipc=.
test/python/ext/ep/test_internode_multirank.py: replace hardcoded
NUM_MAX_NVL_PEERS=4 with a runtime _detect_local_world_size() helper
that reads MSCCLPP_EP_LOCAL_WORLD_SIZE / LOCAL_WORLD_SIZE /
OMPI_COMM_WORLD_LOCAL_SIZE, falling back to torch.cuda.device_count().
Makes the test correct on both H100 (8 GPUs/node) and GB200 (4 GPUs/node)
without code changes.
src/core/atomicadd_kernel.cu: use cuCtxCreate_v4 for CUDA >= 12.5 (the
underlying symbol was renamed); preserve legacy 3-arg cuCtxCreate for
older toolkits.
Verified on 2x H100 NDv5 at HEAD:
LL intranode (8 GPUs) PASS
LL internode (16 GPUs, 2 nodes) PASS
HT intranode (8 GPUs) PASS
HT internode (16 GPUs, 2 nodes) PASS
Diagnostic on H100:
[mscclpp_ep] rdma_buffer allocator: cudaMalloc (low_latency=0, nvls=1, fabric_ipc=0)
[mscclpp_ep] NVLS HT multicast: disabled (low_latency=0, num_rdma_ranks=2, nvls_supported=1, fabric_ipc=0)
This pull request adds support for the `bfloat16` (bf16) data type to
the test executor, including both Python and CUDA components. The
changes ensure that `bfloat16` is handled consistently across argument
parsing, data type conversion, and test kernel implementations.
Additionally, the CUDA verification kernels are refactored to use
parameterized tolerances for improved numerical accuracy checks.
**Support for bfloat16 data type:**
* Added handling for `bfloat16`/`bf16` in the Python test executor's
argument parsing, data type conversion (`parse_dtype`,
`dtype_to_mscclpp_dtype`), and help text.
[[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28)
[[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135)
[[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251)
* Updated output to display the correct data type string for `bfloat16`.
**CUDA kernel and test improvements:**
* Included `bfloat16` headers and defined test data fill and gather
kernels for `bfloat16` on both CUDA and HIP platforms.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35)
[[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59)
[[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133)
* Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to
use an explicit tolerance parameter (`Eps`) and added correct tolerances
for each data type, including `bfloat16`.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113)
These changes ensure full support for `bfloat16` in the test executor
and improve the accuracy and maintainability of the CUDA test kernels.
---------
Co-authored-by: Caio Rocha <caiorocha@microsof.com>