Commit Graph

1071 Commits

Author SHA1 Message Date
Qinghua Zhou
00e41b8976 ep(python): MoECommunicator mode="ht" (FLAT) + HT benchmarks via the high-level API
Unify the high-level MoECommunicator to select its backend from
MoECommunicatorConfig.mode:
- mode="ll": low-latency (EXPERT_MAJOR) via MoERuntime (reused from binyli/ep,
  PR #818). The LL runtime is built lazily, so a build that only binds the HT
  Buffer can still use mode="ht" without MoERuntime being present.
- mode="ht": high-throughput (FLAT) via the DeepEP-style Buffer; intranode vs
  internode is auto-selected from the RDMA buffer-size hint.

dispatch() gains an optional previous_handle that reuses the routing layout from
a prior dispatch with identical topk_ids (cached intranode dispatch also skips
notify_dispatch's host-side counter wait), letting a benchmark isolate the
on-GPU dispatch-kernel cost (NCCL-EP ep_bench convention).

Rewrite the intranode/internode HT benchmark loops to drive the public
MoECommunicator(mode="ht") API instead of raw Buffer calls. Export MoERuntime.

Validated on 1 node x 4 GB200 GPUs: correctness PASS; dispatch/combine match the
raw-Buffer baseline under identical env (no high-level overhead).
2026-06-26 02:44:35 +00:00
Qinghua Zhou
e9a5acc7d4 ep(python): high-level MoECommunicator HT (FLAT) dispatch/combine API
Implement the high-level MoE API from python/mscclpp/ext/ep/README.md for
mode="ht" on top of the existing low-level Buffer (DeepEP-style) runtime.
The user passes the tensors the model owns (input, topk_ids, weights, scales);
dispatch returns MLP-ready FLAT tokens + per-local-expert counts; combine
reverses it from an opaque DispatchHandle. No kernel changes — this is a
contract-preserving Python wrapper over the validated HT dispatch/combine
kernels (TMA-staged combine gather + all-sender dispatch).

New module python/mscclpp/ext/ep/communicator.py:
- MoECommunicator / MoECommunicatorConfig: configure comm + expert placement +
  shape once; mode="ht" -> DispatchLayout.FLAT.
- dispatch(): runs get_dispatch_layout (count/membership) then intranode or
  internode dispatch; returns DispatchOutput(tokens [total_recv_tokens, H]
  grouped by local expert, num_tokens_per_expert, expert_offsets=cumsum) +
  DispatchHandle.
- combine(): reverses from the handle only; returns [T, H] token-major.
- DispatchHandle carries a transport-tagged combine_meta bundle (intranode vs
  internode reverse-dispatch tensors differ), opaque to the MLP.
- Intranode vs internode is auto-selected from get_rdma_buffer_size_hint (0
  bytes <=> world_size <= NUM_MAX_NVL_PEERS), so the user never picks transport.
- Optional dispatch_async/combine_async + create_overlap_config scaffolding.
- DispatchOutput, DispatchHandle, QuantScales, DispatchLayout, CommOverlapConfig
  dataclasses matching the README contract.

__init__.py now exports the high-level API alongside Buffer/Config/EventHandle.

First version: BF16 only (FP8 scales + block-level overlap are follow-ups; they
raise NotImplementedError and need no signature change). Imports + API shape +
layout/cumsum/guards verified on a GB200 node; full multi-rank run pending.
2026-06-25 02:35:16 +00:00
Qinghua Zhou
c7a2df6885 ep(intranode): TMA direct-gather combine + all-sender dispatch + per-phase SM knobs
Port the GB200 internode flat-path optimizations to the single-node (intranode,
NVLink/IPC) HT path and add independent dispatch/combine SM controls. All under
MSCCLPP_EP_INTRA_DIRECT; the legacy 2-hop kernels remain as byte-identical
fallbacks. Single node, 4 ranks, t=4096 d=7168 e=256 topk=8, all PASS.

TMA direct-gather combine (combine_intranode_gather_tma)
- Faithful port of combine_flat_gather_tma: for each output token, discover the
  contributing ranks (send_head>=0) and gather each contributor's hidden row
  straight from that rank's IPC-mapped recv pool (slot = ep_combine_recv_idx)
  through a kStages-deep cp.async.bulk (TMA) SMEM pipeline, reduce from SMEM,
  st_na_global. Token-parallel grid (no channel partitioning).
- Default ON under INTRA_DIRECT; MSCCLPP_EP_COMBINE_TMA=0 -> 2-hop ring fallback.
- combine: 1273 -> 474 us @ 16 SMs (2.7x), 703 -> 269 us @ 32 SMs (2.6x).

All-sender dispatch (dispatch_allsender + intranode_meta_drain)
- The 2-hop dispatch splits the grid 50/50 sender/receiver; under INTRA_DIRECT
  the receiver blocks idle (hidden is written directly), wasting half the SMs.
  Make every block a sender (one channel per block, num_channels=num_sms) so all
  blocks move hidden directly to the dest pools, matching NCCL-EP's all-sender
  block count. Per-token metadata (src_idx, rebased topk_idx, weights, scales)
  is packed into the dest pool META region; intranode_meta_drain unpacks the
  local META region into the recv_* tensors (mirrors internode flat_meta_drain).
- Default ON under INTRA_DIRECT + TMA combine (the layout needs the TMA combine);
  auto-disables when the ring combine is forced. Override MSCCLPP_EP_INTRA_ALLSENDER.
- dispatch: 1006 -> 454 us @ 16 SMs (beats NCCL-EP's 464 us), 545 -> 287 us @ 32 SMs.

Independent SM knobs (intranode)
- MSCCLPP_EP_DISPATCH_NSM sets the dispatch block count (clamped to config.num_sms);
  MSCCLPP_EP_COMBINE_NSM sets the token-parallel TMA combine grid independently.
- The dispatch gather map (ep_combine_recv_idx) is now also allocated + written on
  the INTRA_DIRECT path (previously internode/MSCCLPP_EP_DIRECT only).
2026-06-24 03:23:48 +00:00
Qinghua Zhou
efdbd4313c ep(ncclep): channel-adaptive warp count for combine TMA gather
Select the TMA-staged flat-combine gather's warp (token) count per block by
channel count instead of a single fixed value: 14 warps (WIDE) up to 12
channels, 12 warps (NARROW) above. More warps add token-parallelism that wins
when the grid has few blocks (low SM, latency-bound), but at higher block
counts the marginal warp costs more scheduling than it buys.

This removes the only remaining regression of the flat-14-warp default: 14
warps was -2..-9% at <=10 channels but +3% at 16 channels (SMs=32). The
adaptive switch keeps the low-SM wins and takes the faster 12-warp path at
SMs=32, so combine is now Pareto-optimal across the SM range.

Combine (us) @ e256/topk8 hidden=7168 tokens=4096, vs the flat-14w default:
  SMs=16 (8 ch  -> 14w): 2n 650  4n 735  8n 776   (unchanged, all WIDE)
  SMs=20 (10 ch -> 14w): 2n 526  4n 610  8n 675   (unchanged, all WIDE)
  SMs=32 (16 ch -> 12w): 2n 442  4n 564  8n 645   (was 455/584/665 -> -3%)
All PASS, recv counts unchanged.

Both warp counts instantiate their own kernel + dynamic-SMEM attribute
(WIDE 14w = 229KB, fits the sm_100 cap). Threshold and both counts overridable
via -D (EP_CMB_TMA_WARPS_WIDE/_NARROW/_MAXCH). MSCCLPP_EP_COMBINE_TMA=0
fallback unchanged.
2026-06-22 02:20:28 +00:00
Qinghua Zhou
c6c1492679 ep(ncclep): combine TMA gather default 12->14 warps (-2 to -7% combine)
Raise the TMA-staged flat-combine gather's default warp (token) count per
block from 12 to 14. 14 is the throughput sweet spot at hidden=7168 bf16:
beyond it, the per-block SMEM budget forces a smaller TMA chunk, and the
smaller descriptors then cost more than the extra token-parallelism buys
(measured: 18-28 warps all regress).

Combine (us) @ e256/topk8 hidden=7168 tokens=4096, NSM16, vs 12 warps:
  2n 695 -> 648 (-7%)   4n 748 -> 735 (-2%)   8n 790 -> 776 (-2%)
Also at 2n: NSM20 575 -> 526 (-9%); NSM32 442 -> 455 (+3%, already far ahead).
All PASS, recv counts unchanged.

One-line constant change (EP_CMB_TMA_WARPS); kernel logic, SMEM model
(229KB at 14w/2-stage/chunk-64, fits the dynamic-shared opt-in), and the
MSCCLPP_EP_COMBINE_TMA=0 fallback are unchanged. Still overridable via -D.
2026-06-21 07:34:39 +00:00
Qinghua Zhou
42f00577cf ep(ncclep): TMA-staged flat-combine gather (default; -7 to -23% combine)
Make the flat direct-gather combine stage each contributor's hidden chunks
from the remote NVLink recv pools into shared memory via cp.async.bulk (TMA),
then reduce from SMEM, instead of the synchronous ld_nc_global register-MLP
gather. The async copy engine hides the remote-NVLink read latency at low
register cost, so latency hiding no longer competes with occupancy.

This wins at every channel count (no crossover) and every node scale, so it
replaces both the lean and the unified flat-gather paths and needs no channel
gate. New kernel combine_flat_gather_tma: token-per-warp, hidden split into
1KB (chunk=64 int4) TMA descriptors streamed through a 2-stage software
pipeline; SMEM is independent of hidden, so it opts in to >48KB dynamic shared
via cudaFuncSetAttribute. Defaults tuned on GB200 sm_100 @ hidden=7168 bf16.

Default on; MSCCLPP_EP_COMBINE_TMA=0 falls back to the prior synchronous lean
gather (<=14 channels) / unified flat branch.

Combine (us) @ e256/topk8 hidden=7168 tokens=4096, vs the prior lean gather:
  2n  NSM16 705 -> 695   NSM20 619 -> 575   NSM32 577 -> 442 (-23%)
  4n  NSM16 912 -> 748   NSM20 732 -> 635   NSM24 704 -> 594   NSM32 665 -> 565
  8n  NSM16 953 -> 790   NSM20 813 -> 697   NSM32 749 -> 645
All beat NCCL-EP where measured (2n/4n/8n NSM16: 747/800/959). All PASS.

Topology is single-hop NVLink (one MNNVL LSA domain), so this is an
apples-to-apples gather, not a 2-hop reduce. Default-unset (non-flat / non-direct)
paths are unchanged.
2026-06-21 05:17:46 +00:00
Qinghua Zhou
c3a4b641ac ep(ncclep): extract lean flat-combine gather kernel (-12 to -30% combine at low SM)
The unified `combine` kernel carries the heavy 2-hop forwarder path and
compiles to ~80 registers, which on sm_100 limits it to 1 resident block/SM
(25 warps, ~39% occupancy). Under the flat all-sender path only the direct
gather actually runs (the forwarder is dead code), yet it inherits that
register ceiling -- starving the NVLink pull-gather of the warp-level
parallelism it needs to hide the (latency-, not bandwidth-bound) remote reads.

Extract the flat direct-gather into its own lean kernel `combine_flat_gather`,
with logic byte-for-byte identical to the unified kernel's flat branch
(combine_token for <= 8 nodes, chunked-ballot discovery for > 8). Freed from
the forwarder code it compiles to 47 registers, so a full 1024-thread block
(32 warps) fits per SM at the same SM budget -> more warps in flight -> better
latency hiding. The launcher picks it on the existing flat-direct gate
(recv_pool_global_ptrs / ep_combine_recv_idx non-null, i.e. MSCCLPP_EP_DIRECT),
so the 2-hop path is untouched.

This is a CROSSOVER like the inc7 dispatch atomic path: the lean kernel's
bigger blocks win when blocks are few (low SM, many tokens/block) but add
scheduling overhead when blocks are many (high SM). Gate to num_channels <= 14
(== the inc7 dispatch threshold); above it, fall through to the unified kernel.

8n combine (flat, all PASS, recv 29815):
  num_channels  8 (NSM16, lean)     1086 -> 953 us  (-12%, beats NCCL-EP 959)
  num_channels 16 (NSM32, unified)   753 -> 749 us  (unchanged)
2n NSM16 1006 -> 707 us (-30%, beats NCCL-EP 747); 4n NSM16 1040 -> 912 us.
2026-06-20 07:20:37 +00:00
Qinghua Zhou
49a046396a ep(ncclep): raise MSCCLPP_EP_DISPATCH_NSM clamp to [1, num_sms] for the flat path
The flat all-sender dispatch launches one sender block per channel and has no
forwarder, yet num_channels (== the dispatch block count) was hard-capped at
config.num_sms/2 -- the 2-hop ring requirement. So at "num_sms=16" the flat
dispatch ran on only 8 blocks while NCCL-EP always uses num_sms (16) blocks,
which fully accounts for the apparent low-SM dispatch gap (the block count, not
the kernel, was the difference).

Raise the MSCCLPP_EP_DISPATCH_NSM clamp from [1, num_sms/2] to [1, num_sms] under
the flat path, and size the RDMA/NVL buffers for the resolved channel count so the
per-channel kernel offsets and clean range stay in bounds. A single knob now does
it: MSCCLPP_EP_NSM=16 + MSCCLPP_EP_DISPATCH_NSM=16 -> 16 dispatch blocks (matching
NCCL-EP), no num_sms*2 workaround. The combine grid still uses num_sms/2, so the
buffer is sized for max(dispatch_channels, num_sms/2). Unset -> num_sms/2 (the
shipped default is byte-identical). Non-flat 2-hop path unchanged.

config.hpp: add ep_flat_dispatch_channels() (resolved dispatch channel count) and
ep_buffer_channels() (buffer sizing = max of dispatch channels and num_sms/2); both
get_*_buffer_size_hint use the latter. buffer.cc: internode_dispatch resolves
num_channels via ep_flat_dispatch_channels(). README: document the raised clamp.

8n dispatch (flat, all PASS, recv 29815):
  num_sms=16, DISPATCH_NSM unset (8 blocks)  1101 us  (default, unchanged)
  num_sms=16, DISPATCH_NSM=16  (16 blocks)    784 us  (-29%, beats NCCL-EP 973)
2026-06-20 04:07:57 +00:00
Qinghua Zhou
d212966c8d ep(ncclep): inc7 atomic routing-slot assignment at low channel count (-21% dispatch @ NSM16)
Under the FLAT all-sender path (kEpFlat), the per-channel sequential lock
(rdma_send_next_token_idx spin-wait) serialized slot assignment for every
token within a channel by token_idx. At low channel count (num_channels =
num_sms/2) there are many tokens per channel, so this serialization
dominated the dispatch time -- the NSM16 cliff vs NCCL-EP.

Replace the spin-lock with atomicAdd on the shared slot counters
(rdma_send_channel_next_tail, ep_count) when num_channels <= 14. Slot
assignment becomes dense but order-independent, which is correct here
because (a) the receiver validates each source's contiguous recv_x block
by value range (order-agnostic), (b) combine gathers via the
self-consistent ep_combine_recv_idx map, and (c) rdma_send_channel_tail
is unused under kEpFlat (the coordinator warp returns early). The
per-token breadcrumb stores (ep_combine_recv_idx, send_rdma_head) are
also moved out of the critical section since their addresses are unique
per (token, dst).

This is a CROSSOVER optimization: at high channel count the lock was
already cheap and the 6 sender warps contend on the shared atomics, so
the proven spin-lock + plain-increment path is kept for num_channels > 14
(byte-equivalent to the previous behavior). The non-flat 2-hop path is
unchanged.

8n dispatch (vs prev full-token TMA), all PASS:
  NSM16 (8ch)  1404 -> 1099 us  (-21.7%)
  NSM24 (12ch)  984 ->  777 us  (-21.1%)
  NSM28 (14ch)  862 ->  823 us  (-4.5%)
  NSM32 (16ch)  786 ->  780 us  (lock path, ~unchanged)
2n dispatch (8ch) 830 -> 760 us (-8.5%). Cumulative 8n NSM16 vs pre-opt -42.9%.
2026-06-19 21:39:10 +00:00
Qinghua Zhou
683a5a7648 ep(ncclep): inc6 B-depth-3 full-token TMA tile (1 S2G/dst, -3 to -6% more dispatch)
Push the flat all-sender TMA S2G tile to the WHOLE bf16 token (7168B->14336B), so each destination is a single cp.async.bulk S2G per token (nchunks=1) -- matching NCCL-EP whole-token granularity (was 2 ops for half-token, 14 for the original 1024B sub-chunk). Full-token forces NSTAGE 4->2 to keep the 6x2x14336 = 168KB dynamic ring under the GB200 ~227KB/block opt-in cap (confirmed cudaDevAttrMaxSharedMemoryPerBlockOptin=232448); kEpTmaSndInFlight clamped to >=1 so NSTAGE=2 keeps 1 cross-token S2G in flight, plus a new InFlight<NStage drain-before-reuse static_assert. Kernel-only (host launcher already sizes dynamicSmemBytes from EP_TMA_SND_* macros). Static SMEM 18688->18560; REG 168. Dispatch-only; combine unchanged. All PASS (recv 21820 2n / 29815 8n). A/B vs half-token Bd3: 2n NSM16 866.7->829.8 (-4.3%); 8n NSM16 1464.5->1404.1 (-4.1%), NSM20 1223.6->1155.4 (-5.6%), NSM32 809.6->786.0 (-2.9%). Cumulative (StageA+Bd1+Bd3+full) vs pre-opt baseline: 8n NSM16 1924.2->1404.1 (-27.0%), NSM32 1030.3->786.0 (-23.7%, beats NCCL-EP 8n 973us by ~19%).
2026-06-18 23:19:43 +00:00
Qinghua Zhou
aca50d56fd ep(ncclep): inc6 Stage B-depth-3 whole-token TMA tiles via dynamic SMEM (-16 to -27% dispatch)
Enlarge the flat all-sender TMA S2G tile from 1024B sub-chunks to HALF-token 7168B, cutting cp.async.bulk S2G descriptors from 14 to 2 per destination per token (~7x fewer TMA issues -- NCCL-EP issues 1 whole-token op). The 6x4x7168 = 168KB ring exceeds the 48KB static cap, so it moves to DYNAMIC shared memory: kernel uses extern __shared__ ep_tma_tile_snd_dyn indexed (warp_id*NStage+stage)*chunk; host opts in via cudaFuncSetAttribute(MaxDynamicSharedMemorySize) + cfg.dynamicSmemBytes (geometry shared through EP_TMA_SND_CHUNK_BYTES/NSTAGE macros). Static SMEM drops 43328->18688; REG unchanged 168. Dispatch-only; combine unchanged. All PASS (recv 21820 2n / 29815 8n). A/B vs B-depth-1: 2n NSM16 1195.6->866.7 (-27.5%); 8n NSM16 1781.8->1464.5 (-17.8%), NSM20 1452.8->1223.6 (-15.8%), NSM32 957.6->809.6 (-15.5%, beats NCCL-EP 8n 973us by 17%). Cumulative (StageA+Bd1+Bd3) vs pre-opt baseline: 8n NSM16 1924.2->1464.5 (-23.9%), NSM32 1030.3->809.6 (-21.4%).
2026-06-18 22:20:29 +00:00
Qinghua Zhou
5daea8f8cb ep(ncclep): inc6 Stage B-depth-1 cross-token TMA pipeline (-2.5% dispatch)
Carry the TMA S2G pipeline (in-flight groups + ring stage index) ACROSS token boundaries in the flat all-sender hidden-copy, removing the per-token full wait_group 0 drain. Last chunks of token N now overlap first chunks of token N+1 instead of a hard drain at every token seam (~1 NVLink RT/token, dominant at low SM). Global stage = (ep_tma_gbase + chunk) % NStage with ep_tma_gbase += nchunks per token; drain-before-reuse preserved (NStage 4 > InFlight 2) across seams; single wait_group 0 + fence.proxy.async.global at warp end. SMEM/REG byte-identical to Stage A (43328/168, pure control-flow). Dispatch-only; combine unchanged. All PASS (recv 21820 2n / 29815 8n). 8-node FLAT A/B vs Stage A: NSM16 1826.3->1781.8 (-2.4%), NSM20 1497.5->1452.8 (-3.0%), NSM32 983.3->957.6 (-2.6%, now beats NCCL-EP 8n 973us). 2n NSM16 1228.6->1195.6 (-2.7%). Cumulative w/ Stage A vs pre-opt baseline: 8n NSM16 -7.4%, NSM32 -7.1%.
2026-06-18 21:10:12 +00:00
Qinghua Zhou
8386ed2a1f ep(ncclep): inc6 Stage A pipelined TMA sender ring (-4-5% dispatch)
Replace the per-chunk full-drain (cp.async.bulk.wait_group 0 after every chunk) in the flat all-sender TMA hidden-copy with an NCCL-EP-style produce-1-ahead + lazy in-flight drain (cp.async.bulk.wait_group.read N), so consecutive chunks S2G NVLink writes overlap each other and the next chunks G2S HBM load. Ring deepened NStage 2->4, chunk 2048->1024 (SMEM-neutral: 6*4*1024==6*2*2048==24576B; +96B for 2 extra mbarriers; REG unchanged at 168). Dispatch-only; combine byte-identical. 8-node FLAT A/B (all PASS recv29815): NSM16 1924.2->1826.3 (-5.1%), NSM20 1565.0->1497.5 (-4.3%), NSM32 1030.3->983.3 (-4.6%, ~= NCCL-EP 8n 973us). 2-node NSM16 1285.2->1228.6 (-4.4%).
2026-06-18 16:59:11 +00:00
Qinghua Zhou
3b5270e5d5 ep(ncclep): inc6 flat combine + decoupled dispatch/combine SM knobs
Completes the flat all-sender path (MSCCLPP_EP_FLAT) end-to-end and makes
its dispatch and combine SM counts independently tunable.

- buffer.cc internode_combine: under the flat path, skip the pre-combine
  cached_notify + fifo-advance. The flat combine is the inc5 direct-gather
  (it already needs no forwarder), but cached_notify's sm_id>=3 branch
  indexes rdma_channel_prefix_matrix (the forwarder-produced
  recv_rdma_channel_prefix_matrix, never written under all-sender flat) to
  derive token ranges and then writes combined_nvl_head over that range ->
  out-of-bounds write / intermittent illegal memory access in the bench.
  Gated on ep_flat + recv_pool_global_ptrs + ep_combine_recv_idx; the
  2-hop path is unchanged.
- buffer.cc: decouple the flat dispatch/combine grids from config.num_sms.
  MSCCLPP_EP_DISPATCH_NSM caps the all-sender dispatch block count
  (clamped to [1, num_sms/2]); MSCCLPP_EP_COMBINE_NSM caps the combine
  block count (clamped to [2, num_sms]). Both flat-only. num_channels is
  the dispatch token-partitioning granularity (flows into notify_dispatch,
  the prefix-matrix allocations and the grid), so lowering it stays
  self-consistent; the now-vestigial combine prefix-matrix shape asserts
  are relaxed under flat.
- test: unify the SM-count env var to MSCCLPP_EP_NSM for both the
  internode and intranode tests, with MSCCLPP_EP_NUM_SMS kept as a legacy
  fallback (internode default 152, intranode 20).
- README: document MSCCLPP_EP_FLAT, MSCCLPP_EP_DISPATCH_NSM and
  MSCCLPP_EP_COMBINE_NSM, add a flat all-sender subsection, and correct
  the SM-count row to MSCCLPP_EP_NSM.

Validated 2-node GB200 (e256/topk8, 8 ranks): full flat dispatch+combine
PASS (~1e-6, no IMA), dispatch 500us / combine 451us. Dispatch and combine
SM sweeps are independent (combine flat at ~451us while dispatch scales
698->495us, and vice versa). FLAT=0 inc5 baseline unchanged (568/488us).
2026-06-17 23:52:44 +00:00
Qinghua Zhou
1805ad0db6 ep(ncclep): inc6 flat all-sender dispatch (kEpFlat, dispatch validated)
Adds the FLAT all-sender dispatch path, gated behind MSCCLPP_EP_FLAT
(requires MSCCLPP_EP_DIRECT). Eliminates the forwarder + coordinator +
receiver roles so every SM block is a sender, and delivers per-token
metadata straight to the destination recv pool instead of via the
2-hop RDMA ring -> forwarder -> NVL receiver pipeline.

- config.hpp: append a per-token 128B metadata region after the
  worst-case hidden region in the recv pool (kEpRecvPoolMetaBytes,
  get_recv_pool_meta_base). Allocated unconditionally (cheap vs hidden),
  only touched when kEpFlat is set.
- internode_ncclep.cuh: kEpFlat __constant__ gate. Sender writes
  SourceMeta + scales + topk straight into each destination pool's meta
  region at the token's final recv slot. Under kEpFlat the launch is
  all-sender: num_channels = gridDim.x, channel_id = sm_id,
  is_forwarder = false; the sender-coordinator, forwarder, and NVL
  receiver roles early-return, and the sender skips the ring head/tail
  flow-control wait (no forwarder drains it).
- internode.cu / api.cuh: flat_meta_drain kernel + launcher copies the
  pool meta region into the recv_* output tensors, rebasing topk_idx to
  this rank's local expert range (weight 0 when out of range). Host grid
  launches num_channels blocks (not x2) when FLAT+DIRECT.
- buffer.cc: call flat_meta_drain on the comm stream after dispatch when
  ep_flat && ep_use_direct && uncached && topk present.
- test: MSCCLPP_EP_BENCH_DISPATCH_ONLY skips combine (which still needs
  the forwarder/receiver breadcrumbs) so the all-sender dispatch ceiling
  can be measured; dispatch correctness check is retained.

Validated 2-node GB200 (e256 / topk8, 8 ranks): dispatch correct
(recv 21820 tokens, per-source ranges exact). flat_meta_drain proven
byte-correct by overwriting the receiver's metadata output and still
passing combine (~1e-6). All-sender dispatch is 13-66% faster than the
2-hop path at equal block count and reaches better-than-baseline-peak
throughput with ~1/4 the blocks. Combine under the flat path (forwarder
breadcrumb rework) is the remaining follow-up.
2026-06-17 21:13:13 +00:00
Qinghua Zhou
183dcb5daa ep(ncclep): inc5 sender-direct TMA dispatch (lane-striped ring)
Port the NCCL-EP hybrid_ep.cuh TMA dispatch path into the inc5 sender-direct
kEpDirect branch, gated by EP_NCCLEP_TMA. kRDMASender streams each token hidden
HBM->SMEM via cp.async.bulk (G2S, mbarrier complete_tx) then fans the staged
tile S2G straight to every destination GPU recv-pool fabric VA.

Two correctness mechanisms from the standalone probe + NCCL-EP reference:
- per-CHUNK SMEM ring (kEpTmaSndNStage=2, chunkb=2048) so no two back-to-back
  cp.async.bulk ops reuse a single tile+mbar to a cross-node fabric VA, which is
  the unspecified-launch-failure trigger. SMEM-neutral vs the single-slot 4096
  layout (6*2*2048 == 6*4096); dispatch static SHARED stays 43360B (< 48KB).
- LANE-STRIPED S2G broadcast (lane j -> dst pool j, j+=32) instead of a serial
  lane-0 loop; lane 0 issues G2S, the whole warp waits on the G2S mbarrier, each
  lane drives its own destination, drain wait_group 0 + syncwarp before tile reuse.

Validated 2 nodes x 4 GPU (e256/topk8, hidden=7168, tokens=4096, MSCCLPP_EP_DIRECT=1):
dispatch PASS (no launch failure, byte-correct combine), 569.5us vs 618.5us
warp-store baseline (-7.9%). Off by default (EP_NCCLEP_TMA=0); baseline byte-
identical when unset.
2026-06-17 07:12:22 +00:00
Qinghua Zhou
462ab1661a docs(ep): document MSCCLPP_EP_DIRECT and MSCCLPP_EP_INTRA_DIRECT for GB200
Add the two GB200 direct-path env flags to the EP README runtime-knobs
table plus a dedicated subsection:
- MSCCLPP_EP_DIRECT=1: internode sender direct-write dispatch + receiver
  gather-direct combine (one master flag), with the MSCCLPP_EP_FABRIC_IPC
  prerequisite and recv-pool input contract.
- MSCCLPP_EP_INTRA_DIRECT=1: separate single-node intra-node sender
  direct-write flag.
Includes measured GB200 NVL72 numbers and the single-node launch note.
2026-06-15 23:08:43 +00:00
Qinghua Zhou
14f131407b ep(ncclep): inc5 combine-gather correctness at >32 ranks (16n)
The kEpDirect combine-gather path used the flat combine_token<kEpNumRanks>
helper, whose discovery maps one warp lane per rank. That is correct only
while kEpNumRanks <= 32; at >= 16 nodes (kEpNumRanks > 32) the __shfl-by-rank
source lane wraps mod 32, so ranks >= 32 alias 0..31 and num_topk_ranks
overflows topk_ranks[8] -> combined_x came out all zeros.

Split the gather block with if constexpr (kEpNumRanks <= 32): <= 8 nodes
compile to exactly the original fast path (byte-identical, no perf change),
while > 32 ranks use a chunked-ballot discovery (scan ranks in 32-wide
chunks, compact set lanes via __ballot_sync) plus a pre-loaded register-array
reduction. Verified combine correctness 2/4/8/16n and combine timing
unchanged at <= 8n (2n 494us, 4n 634us, 8n 740us).
2026-06-12 05:45:36 +00:00
qinghuazhou
3b6b2ac303 ep(intranode): sender direct-write dispatch (MSCCLPP_EP_INTRA_DIRECT)
Eliminate the 2-hop ring + receiver hidden-drain in intranode dispatch. When
MSCCLPP_EP_INTRA_DIRECT is set, the sender computes each token's final recv_x
index (rank_prefix + channel_prefix + running count, identical to the receiver's
total_offset+chunk_idx) and writes hidden straight to the destination's
peer-mapped recv pool; recv_x becomes a zero-copy view of that pool and the
receiver skips the hidden drain (metadata still flows through the ring, keeping
indices aligned). Reuses the Increment-4 recv pool (recv_pool_ptrs_gpu).
Baseline byte-identical when unset.

Single-node 4-GPU A/B @4096/7168/256-exp/top8, 152 SMs: dispatch 3829.7us ->
283.9us (-92.6%, 13.5x), now SM-scaling; correctness exact-match (0.0) at 128
and 4096 token scale.
2026-06-11 23:07:28 +00:00
qinghuazhou
cc34e72b64 ep(ncclep): inc5 combine-direct gather (kEpDirect)
Stage 1: dispatch persists its per-(token,dst) recv-pool slot (= ep_my_idx)
into a Buffer-owned gather map (ep_combine_recv_idx_gpu), no Python API change.

Stage 2: under MSCCLPP_EP_DIRECT, combine gathers each token contributions
DIRECTLY from the peer recv pools (recv_pool_global_ptrs[r] + header +
recv_idx*hidden) and reduces locally (reusing combine_token<> with a
pool-reading recv_fn), skipping the nvl_channel + forwarder + rdma_channel
2-hop. Receiver-direct dual of the dispatch sender-direct write.

Combine A/B (same .so, env MSCCLPP_EP_DIRECT, SKIP_COMBINE=0, all PASS):
  2n/4096 gather 494.3 vs 2-hop 983.8us (-49.8%)
  4n/4096 gather 646.5 vs 2-hop 1456.2us (-55.6%)
  8n/4096 gather 754.3 vs 2-hop 2358.8us (-68.0%)
Round-trip 8n/4096: 1983us vs inc4a 4141us (-52%), ~= NCCL-EP.
2026-06-11 06:37:25 +00:00
qinghuazhou
1f7942a804 ep(ncclep): inc5 keep full rdma clean for combine under kEpDirect
cached_notify serves both cached dispatch (shrunk ring, is_cached_dispatch=true)
and combine (is_cached_dispatch=false). Combine still ships hidden through the
rdma ring, so its clean must stay full-slot; only shrink the dispatch clean.
Fixes combine desync/timeout under MSCCLPP_EP_DIRECT in the bench loop.

Combine A/B (SKIP_COMBINE=0, correctness PASS): 4n/4096 inc5 vs inc4a
combine 1423 vs 1435us; 8n/4096 2281 vs 2233us (combine path unchanged).
2026-06-11 04:07:58 +00:00
qinghuazhou
ca829f6e8f ep(ncclep): inc5 ring-slot shrink + read-once direct write
Under kEpDirect the rdma ring no longer carries hidden (ring_hidden_int4=0);
hidden goes straight to the dest pool. Clean meta (notify_dispatch +
cached_notify) matches the small slot so the bench loop does not desync.
Sender reads x once and broadcasts to all destination pools (gather-pointer
st_pool_broadcast) instead of re-reading per destination.

Dispatch A/B (4 nodes, SKIP_COMBINE): 2048 557.8 vs 586.9us (-5.0%);
4096 925.2 vs 1033.4us (-10.5%).
2026-06-11 02:31:19 +00:00
qinghuazhou
b6140b0229 ep(ncclep): increment 5 - sender direct-write dispatch (kEpDirect, dispatch-only WIP)
On the NVL72 single NVLink domain, eliminate the 2nd NVLink hop in HT
dispatch: the RDMA sender writes each token's hidden straight into the
destination GPU's domain-wide recv_x pool over fabric VA, instead of
staging through the rdma_channel ring + forwarder transpose.

Gated by env MSCCLPP_EP_DIRECT (default off => byte-for-byte inc4a;
verified 584.6us == inc4a 583.3us with the gate off).

- buffer.{cc,hpp}: domain-wide recv-pool base exchange
  (recv_pool_global_ptrs, registerMemory all_transport + send/recvMemory,
  env-gated). Validated the cuMem pool is fabric-mappable cross-node
  (all 16 peer bases imported at 4-node).
- kernels/internode_ncclep.cuh: kEpDirect gate. Per-destination-GPU base
  index = peer-read recv_gbl_rank_prefix_sum[rank-1] + channel start;
  seq-locked per-GPU running count (matches the receiver drain order).
  Sender writes hidden direct to recv_pool_global_ptrs[dst_global];
  forwarder skips the hidden copy (fuse + xnode) under kEpDirect.
  __threadfence_system at the sender epilogue for cross-node visibility.
- kernels/api.cuh, kernels/internode.cu: thread recv_pool_global_ptrs
  through the dispatch signature/macro/launch + kEpDirect upload.

Dispatch correctness validated at 4-node (recv_x per-source block check
PASS, recv 13465, MSCCLPP_EP_DIRECT=1 SKIP_COMBINE=1).

WIP / not yet a perf win: the coordinator still RDMAs the full ring slot
(including the now-unused hidden region), so hidden is shipped twice.
Pending: shrink the ring slot to metadata-only under kEpDirect (the perf
win), and the parallel combine rework (combine still expects the 2-hop
breadcrumbs, so inc5 is dispatch-only for now -> run with SKIP_COMBINE).
2026-06-11 00:33:36 +00:00
qinghuazhou
31c930d8c5 ep(ncclep): increment 4a - VMM unicast recv pool (TMA-eligible peer mapping)
Replace the increment-3 cudaMalloc-carve + cudaIpc recv-output pool with a
VMM pool (mscclpp::detail::gpuCallocPhysical, cuMem FABRIC/POSIX-FD) exchanged
via registerMemory(ipc)+sendMemory/recvMemory so peers import directly
dereferenceable, TMA-eligible fabric VAs (recv_pool_ptrs[]) instead of the
cudaIpc carve. Kernel param recv_pool_offset(int64 into buffer_ptrs) ->
recv_pool_ptrs(void** peer bases). Falls back to the inc3 inline carve when
fabric-IPC is unsupported. PASS, dispatch 1052us (== inc3, perf-neutral);
removes the ~1GB inline pool from the buffer_ptrs allocation.

Also lands a gated (EP_NCCLEP_TMA, default OFF) cross-GPU forwarder TMA write
path targeting the VMM peer pool. NOTE: enabling it currently faults
(cudaErrorLaunchFailure) in-kernel even to local VMM; single-process VMM+TMA
probe was a false positive. Kept gated off for future investigation; the
default build is unaffected.
2026-06-08 22:16:05 +00:00
qinghuazhou
ac25cf18b6 ep(ncclep): increment 3 - cross-GPU peer-map direct-write (eliminate receiver hidden drain)
Carve a fixed-capacity recv-output pool onto the END of the NVL cudaMalloc (covered by the same IPC handle as buffer_ptrs, NOT added to num_nvl_bytes so it dodges the INT_MAX registered-size cap). recv_x becomes a zero-copy from_blob view of the local pool; the pool header holds this rank recv_gbl_rank_prefix_sum (peer-readable). The cross-GPU forwarder computes the destination final recv_x index from the peer-read prefix + channel-local start_sum + per-src count and writes hidden STRAIGHT to the destination peer pool; the receiver skips its hidden drain (keeps scales/topk/meta). Same-GPU path (inc1) unchanged (local pool == recv_x). Falls back to torch::empty + receiver when num_recv_tokens > pool cap.

Files: config.hpp (pool sizing static helpers), buffer.hpp (recv_pool_off_), buffer.cc (enlarge ctor alloc, from_blob recv_x, publish prefix header), api.cuh + internode.cu (recv_pool_offset param, guarded EP_DISPATCH_EXTRA_ARGS), internode_ncclep.cuh (xnode_direct forwarder + receiver hidden-skip).

Clean interleaved 4-node A/B {38,41,59,75}: inc2 ~1132us -> inc3 ~1065us (-5.9%), order-independent, FULL PASS (recv 26809, combine correct). Captures ~the DRAIN_NOOP ceiling (1048us). Cumulative vs baseline 1257us: -15.4%. Gap to NCCL-EP 902us now ~156us (was 355us).
2026-06-08 16:49:11 +00:00
qinghuazhou
2ebf81aa35 ep(ncclep): increment-3 de-risk - DRAIN_NOOP probe + SKIP_VERIFY test gate
internode_ncclep.cuh: add EP_NCCLEP_DRAIN_NOOP compile gate (default 0, inert) - when 1 the NVL receiver keeps all control flow but skips the data copies, to measure the dispatch-time upper bound of eliminating the cross-GPU receiver drain. Probe result (4-node {38,41,59,75}): dispatch inc2 1124us -> DRAIN_NOOP 1048us (-6.8%), agg_bw 836->896 GB/s => confirms real headroom for the cross-GPU peer-map direct-write rework (ceiling ~-16.6% cumulative vs baseline).

test_internode_multirank.py: gate the dispatch range-assert and combine assert behind MSCCLPP_EP_SKIP_VERIFY env so dispatch timing can be measured when recv_x is intentionally incomplete (perf probing).
2026-06-08 16:25:53 +00:00
qinghuazhou
fa07b496ae ep(ncclep): increment 2 - deepen drain-copy MLP unroll 5->28
Full-token outstanding int4 loads (hidden_int4=896 => 28*32) on the 3 dest-side drain copies: same-GPU direct-write, forwarder cross-GPU staging write, receiver staging->recv_x drain. Sender packing copy left at 5. Hides HBM/fabric store-completion latency (more memory-level parallelism). This is the proven Lever-C MLP lever; its prior net-regression came from combine register inflation in the SHARED translation unit, which cannot happen here (dispatch_ncclep is a separate TU; combine untouched). Clean interleaved 4-node A/B {38,41,59,75}: inc1 ~1176us -> inc2 ~1124us (-4.4%), full PASS. Cumulative vs baseline 1257us: -10.6%.
2026-06-08 15:51:12 +00:00
qinghuazhou
dbef7a5f31 ep(ncclep): increment 1 - same-GPU fused direct-write to recv_x
Forwarder warps whose dst_nvl==nvl_rank (token arrived on its destination GPU, no NVLink transpose) write straight to final recv_x, computing the same index the receiver would (recv_gbl_rank_prefix_sum base + channel-local start_sum + per-src running count). The matching receiver warp (src_nvl==nvl_rank) is skipped. Eliminates the dest-side HBM double-bounce (nvl_channel_x staging + receiver copy) for the same-GPU token fraction (~25%). recv_x contents + send_nvl_head breadcrumb preserved bit-identical => combine unaffected. 4-node {38,41,59,75}: dispatch 1257->1174-1185us (-6%), full PASS (recv 26809).
2026-06-08 15:43:53 +00:00
qinghuazhou
3ad6b70d7d ep: scaffold guarded dispatch_ncclep kernel (NCCL-EP port baseline) + launch-site selection
internode_ncclep.cuh: dispatch_ncclep<> = production dispatch<> kernel under EP_DISPATCH_NCCLEP, to be restructured into NCCL-EP warp-specialized overlap (concurrent N2N fabric-put + intra-node G2S/S2G drain via MemoryChannel put/signal). internode.cu: guarded include + EP_DISPATCH_KERNEL alias selects dispatch_ncclep when guard ON, else production dispatch (byte-identical). Builds clean with -DMSCCLPP_EP_DISPATCH_NCCLEP=ON.
2026-06-08 07:08:55 +00:00
qinghuazhou
3a9ca157f3 ep: add MSCCLPP_EP_DISPATCH_NCCLEP build guard (EP_DISPATCH_NCCLEP) for NCCL-EP-ported warp-specialized HT dispatch (default OFF) 2026-06-08 06:58:13 +00:00
qinghuazhou
ef4f8421a2 ep/internode: P4b - tune kNumDispatchRDMASenderWarps 7->6 (HT 2n dispatch 728->689us on GB200) 2026-05-27 03:46:14 +00:00
qinghuazhou
c58e9b7408 ext/ep/buffer: P2 - shard proxy on Blackwell multi-node by default (num_proxies = local_world_size when num_rdma_ranks > 1) 2026-05-27 02:31:19 +00:00
qinghuazhou
2e73318c89 Revert "Clean up completed communicator receives (#804)" - thread-safety regression causing cudaErrorIllegalAddress on HT internode GB200 2026-05-27 02:15:39 +00:00
copilot-swe-agent[bot]
04ea24da8d Fix python lint formatting in internode multirank test
Agent-Logs-Url: https://github.com/microsoft/mscclpp/sessions/f5220581-e26c-49d8-98fa-e1b8ab011898

Co-authored-by: seagater <7475084+seagater@users.noreply.github.com>
2026-05-20 18:04:00 +00:00
Qinghua Zhou
757c5ec831 Merge qinghuazhou/expert_parallel_gb200 2026-05-20 01:56:34 +00:00
Qinghua Zhou
394b36cd37 Merge branch 'qinghuazhou/expert_parallel_gb200' into qinghuazhou/expert_parallel 2026-05-20 01:50:13 +00:00
Qinghua Zhou
cb93dd585b tests/ep: Unify the name of EP benchmark variables 2026-05-20 01:48:29 +00:00
Qinghua Zhou
009929174f Merge remote-tracking branch 'origin/main' into qinghuazhou/expert_parallel_merge_main_test
# Conflicts:
#	src/core/connection.cc
#	test/mp_unit/port_channel_tests.cu
2026-05-18 22:02:50 +00:00
Qinghua Zhou
20bd1ec55b ext/ep: fix CUDA 11.8 build + apply clang-format/black
- src/core/atomicadd_kernel.cu: restore the legacy 3-arg
  cuCtxCreate(&proxyAtomicCtx_, 0, cuDevice) in the
  '#else' branch of the CUDA_VERSION >= 12050 guard. A prior
  edit had corrupted it to 'cuCtxCreate(&proxyAtomicCtx_vice)',
  which broke the CUDA 11.8 build (CodeQL CUDA cuda11.8 and
  MSCCLPPLang cuda11.8 jobs).

- Apply clang-format to src/ext/ep/* (no logic changes,
  fixes the cpplint CI job).

- Apply black to test/python/ext/ep/test_internode_multirank.py
  and test_intranode_multirank.py (no logic changes, fixes
  the pylint CI job).
2026-05-18 21:44:20 +00:00
Binyang Li
60a6d7219f Clean up completed communicator receives (#804)
## Summary
- Release the reference after last requests are ready.
- Keep ordered receive chaining for repeated rank/tag operations while
cleaning up completed receive bookkeeping.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-15 21:06:50 +00:00
Changho Hwang
252a422030 Handle PortChannel flush asynchronously from the host proxy (#802)
When a `PortChannel` requests `flush()`, the host-side proxy was being
blocked, which may cause head-of-line blocking of other parallel
`PortChannel`s' requests. Now the proxy handles `flush()` requests
asynchronously. This feature especially helps performance when we need
multiple IB QPs and need to flush QPs.
2026-05-15 11:50:43 -07:00
Qinghua Zhou
98f6b1a936 Merge branch 'qinghuazhou/expert_parallel_fix_h100' into qinghuazhou/expert_parallel 2026-05-14 22:28:11 +00:00
Qinghua Zhou
1f0948c8e4 Merge branch 'qinghuazhou/expert_parallel' into qinghuazhou/expert_parallel_fix_h100 2026-05-14 21:37:44 +00:00
Qinghua Zhou
5911998181 ext/ep: gate NVLS HT B2 on cross-host fabric IPC support (H100 fix)
The NVLS HT B2 path introduced in 3ab2e43b activated whenever
isNvlsSupported() && num_rdma_ranks > 1. On H100 NDv5 / Azure CX-7 RoCE
that is true (H100 has intra-node NVLink multicast), but there is no
cross-host NVSwitch fabric. mscclpp's GpuIpcMem::create then falls back
to CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR whose handle exchange routes
through /tmp/mscclpp_bootstrap_<pid>.sock -- a master-rank-0 unix-domain
socket worker ranks cannot reach. Symptom on every commit since 3ab2e43b:

    RuntimeError: connect() failed for unix socket to
    /tmp/mscclpp_bootstrap_<pid>.sock

MSCCLPP_EP_FABRIC_IPC=0 was being silently ignored.

src/ext/ep/buffer.cc: add resolve_fabric_ipc_supported() helper.
Resolution:
  1. MSCCLPP_EP_FABRIC_IPC env var (0/off/false/no => off,
     1/on/true/yes/force => on, otherwise auto).
  2. Auto-detect: requires both
       - CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED == 1
       - device compute capability >= sm_100 (Blackwell+).

Gate both use_fabric_ipc_alloc (RDMA buffer allocator) and nvls_ht_enabled
(HT B2 multicast region) on fabric_ipc_supported. On H100 both fall back
to cudaMalloc + legacy PortChannel; on GB200 NVL72 both remain enabled.
Diagnostic prints now show fabric_ipc=.

test/python/ext/ep/test_internode_multirank.py: replace hardcoded
NUM_MAX_NVL_PEERS=4 with a runtime _detect_local_world_size() helper
that reads MSCCLPP_EP_LOCAL_WORLD_SIZE / LOCAL_WORLD_SIZE /
OMPI_COMM_WORLD_LOCAL_SIZE, falling back to torch.cuda.device_count().
Makes the test correct on both H100 (8 GPUs/node) and GB200 (4 GPUs/node)
without code changes.

src/core/atomicadd_kernel.cu: use cuCtxCreate_v4 for CUDA >= 12.5 (the
underlying symbol was renamed); preserve legacy 3-arg cuCtxCreate for
older toolkits.

Verified on 2x H100 NDv5 at HEAD:
  LL intranode  (8 GPUs)            PASS
  LL internode  (16 GPUs, 2 nodes)  PASS
  HT intranode  (8 GPUs)            PASS
  HT internode  (16 GPUs, 2 nodes)  PASS

Diagnostic on H100:
  [mscclpp_ep] rdma_buffer allocator: cudaMalloc (low_latency=0, nvls=1, fabric_ipc=0)
  [mscclpp_ep] NVLS HT multicast: disabled (low_latency=0, num_rdma_ranks=2, nvls_supported=1, fabric_ipc=0)
2026-05-14 21:29:10 +00:00
Changho Hwang
5d608feaa5 Enhance cross-node CudaIpc availability check (#803) 2026-05-14 14:06:12 -07:00
Caio Rocha
40295df4c4 Adding Support to bf16 Executor Tests (#801)
This pull request adds support for the `bfloat16` (bf16) data type to
the test executor, including both Python and CUDA components. The
changes ensure that `bfloat16` is handled consistently across argument
parsing, data type conversion, and test kernel implementations.
Additionally, the CUDA verification kernels are refactored to use
parameterized tolerances for improved numerical accuracy checks.

**Support for bfloat16 data type:**

* Added handling for `bfloat16`/`bf16` in the Python test executor's
argument parsing, data type conversion (`parse_dtype`,
`dtype_to_mscclpp_dtype`), and help text.
[[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28)
[[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135)
[[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251)
* Updated output to display the correct data type string for `bfloat16`.

**CUDA kernel and test improvements:**

* Included `bfloat16` headers and defined test data fill and gather
kernels for `bfloat16` on both CUDA and HIP platforms.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35)
[[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59)
[[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133)
* Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to
use an explicit tolerance parameter (`Eps`) and added correct tolerances
for each data type, including `bfloat16`.
[[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85)
[[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113)

These changes ensure full support for `bfloat16` in the test executor
and improve the accuracy and maintainability of the CUDA test kernels.

---------

Co-authored-by: Caio Rocha <caiorocha@microsof.com>
2026-05-14 09:56:11 -07:00
qinghuazhou
7650e699a0 Merge qinghuazhou/expert_parallel_gb200 (Migration Phase 5 GB200) 2026-05-13 22:48:22 +00:00
qinghuazhou
f7cbf1fe7a ext/ep: README - add Migration Phase 5 (Azure GB200 NVL72 port) 2026-05-13 22:48:16 +00:00
qinghuazhou
399745f4c2 Merge qinghuazhou/expert_parallel_gb200 (LL NVLS table) 2026-05-13 22:41:14 +00:00
qinghuazhou
4c13937474 ext/ep: README - document GB200 NVLS variants in LL transport table + GB200 validation rows 2026-05-13 22:41:08 +00:00