mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 02:47:23 +00:00

Author	SHA1	Message	Date
Qinghua Zhou	00e41b8976	ep(python): MoECommunicator mode="ht" (FLAT) + HT benchmarks via the high-level API Unify the high-level MoECommunicator to select its backend from MoECommunicatorConfig.mode: - mode="ll": low-latency (EXPERT_MAJOR) via MoERuntime (reused from binyli/ep, PR #818). The LL runtime is built lazily, so a build that only binds the HT Buffer can still use mode="ht" without MoERuntime being present. - mode="ht": high-throughput (FLAT) via the DeepEP-style Buffer; intranode vs internode is auto-selected from the RDMA buffer-size hint. dispatch() gains an optional previous_handle that reuses the routing layout from a prior dispatch with identical topk_ids (cached intranode dispatch also skips notify_dispatch's host-side counter wait), letting a benchmark isolate the on-GPU dispatch-kernel cost (NCCL-EP ep_bench convention). Rewrite the intranode/internode HT benchmark loops to drive the public MoECommunicator(mode="ht") API instead of raw Buffer calls. Export MoERuntime. Validated on 1 node x 4 GB200 GPUs: correctness PASS; dispatch/combine match the raw-Buffer baseline under identical env (no high-level overhead).	2026-06-26 02:44:35 +00:00
Qinghua Zhou	e9a5acc7d4	ep(python): high-level MoECommunicator HT (FLAT) dispatch/combine API Implement the high-level MoE API from python/mscclpp/ext/ep/README.md for mode="ht" on top of the existing low-level Buffer (DeepEP-style) runtime. The user passes the tensors the model owns (input, topk_ids, weights, scales); dispatch returns MLP-ready FLAT tokens + per-local-expert counts; combine reverses it from an opaque DispatchHandle. No kernel changes — this is a contract-preserving Python wrapper over the validated HT dispatch/combine kernels (TMA-staged combine gather + all-sender dispatch). New module python/mscclpp/ext/ep/communicator.py: - MoECommunicator / MoECommunicatorConfig: configure comm + expert placement + shape once; mode="ht" -> DispatchLayout.FLAT. - dispatch(): runs get_dispatch_layout (count/membership) then intranode or internode dispatch; returns DispatchOutput(tokens [total_recv_tokens, H] grouped by local expert, num_tokens_per_expert, expert_offsets=cumsum) + DispatchHandle. - combine(): reverses from the handle only; returns [T, H] token-major. - DispatchHandle carries a transport-tagged combine_meta bundle (intranode vs internode reverse-dispatch tensors differ), opaque to the MLP. - Intranode vs internode is auto-selected from get_rdma_buffer_size_hint (0 bytes <=> world_size <= NUM_MAX_NVL_PEERS), so the user never picks transport. - Optional dispatch_async/combine_async + create_overlap_config scaffolding. - DispatchOutput, DispatchHandle, QuantScales, DispatchLayout, CommOverlapConfig dataclasses matching the README contract. __init__.py now exports the high-level API alongside Buffer/Config/EventHandle. First version: BF16 only (FP8 scales + block-level overlap are follow-ups; they raise NotImplementedError and need no signature change). Imports + API shape + layout/cumsum/guards verified on a GB200 node; full multi-rank run pending.	2026-06-25 02:35:16 +00:00
Qinghua Zhou	c7a2df6885	ep(intranode): TMA direct-gather combine + all-sender dispatch + per-phase SM knobs Port the GB200 internode flat-path optimizations to the single-node (intranode, NVLink/IPC) HT path and add independent dispatch/combine SM controls. All under MSCCLPP_EP_INTRA_DIRECT; the legacy 2-hop kernels remain as byte-identical fallbacks. Single node, 4 ranks, t=4096 d=7168 e=256 topk=8, all PASS. TMA direct-gather combine (combine_intranode_gather_tma) - Faithful port of combine_flat_gather_tma: for each output token, discover the contributing ranks (send_head>=0) and gather each contributor's hidden row straight from that rank's IPC-mapped recv pool (slot = ep_combine_recv_idx) through a kStages-deep cp.async.bulk (TMA) SMEM pipeline, reduce from SMEM, st_na_global. Token-parallel grid (no channel partitioning). - Default ON under INTRA_DIRECT; MSCCLPP_EP_COMBINE_TMA=0 -> 2-hop ring fallback. - combine: 1273 -> 474 us @ 16 SMs (2.7x), 703 -> 269 us @ 32 SMs (2.6x). All-sender dispatch (dispatch_allsender + intranode_meta_drain) - The 2-hop dispatch splits the grid 50/50 sender/receiver; under INTRA_DIRECT the receiver blocks idle (hidden is written directly), wasting half the SMs. Make every block a sender (one channel per block, num_channels=num_sms) so all blocks move hidden directly to the dest pools, matching NCCL-EP's all-sender block count. Per-token metadata (src_idx, rebased topk_idx, weights, scales) is packed into the dest pool META region; intranode_meta_drain unpacks the local META region into the recv_* tensors (mirrors internode flat_meta_drain). - Default ON under INTRA_DIRECT + TMA combine (the layout needs the TMA combine); auto-disables when the ring combine is forced. Override MSCCLPP_EP_INTRA_ALLSENDER. - dispatch: 1006 -> 454 us @ 16 SMs (beats NCCL-EP's 464 us), 545 -> 287 us @ 32 SMs. Independent SM knobs (intranode) - MSCCLPP_EP_DISPATCH_NSM sets the dispatch block count (clamped to config.num_sms); MSCCLPP_EP_COMBINE_NSM sets the token-parallel TMA combine grid independently. - The dispatch gather map (ep_combine_recv_idx) is now also allocated + written on the INTRA_DIRECT path (previously internode/MSCCLPP_EP_DIRECT only).	2026-06-24 03:23:48 +00:00
Qinghua Zhou	efdbd4313c	ep(ncclep): channel-adaptive warp count for combine TMA gather Select the TMA-staged flat-combine gather's warp (token) count per block by channel count instead of a single fixed value: 14 warps (WIDE) up to 12 channels, 12 warps (NARROW) above. More warps add token-parallelism that wins when the grid has few blocks (low SM, latency-bound), but at higher block counts the marginal warp costs more scheduling than it buys. This removes the only remaining regression of the flat-14-warp default: 14 warps was -2..-9% at <=10 channels but +3% at 16 channels (SMs=32). The adaptive switch keeps the low-SM wins and takes the faster 12-warp path at SMs=32, so combine is now Pareto-optimal across the SM range. Combine (us) @ e256/topk8 hidden=7168 tokens=4096, vs the flat-14w default: SMs=16 (8 ch -> 14w): 2n 650 4n 735 8n 776 (unchanged, all WIDE) SMs=20 (10 ch -> 14w): 2n 526 4n 610 8n 675 (unchanged, all WIDE) SMs=32 (16 ch -> 12w): 2n 442 4n 564 8n 645 (was 455/584/665 -> -3%) All PASS, recv counts unchanged. Both warp counts instantiate their own kernel + dynamic-SMEM attribute (WIDE 14w = 229KB, fits the sm_100 cap). Threshold and both counts overridable via -D (EP_CMB_TMA_WARPS_WIDE/_NARROW/_MAXCH). MSCCLPP_EP_COMBINE_TMA=0 fallback unchanged.	2026-06-22 02:20:28 +00:00
Qinghua Zhou	c6c1492679	ep(ncclep): combine TMA gather default 12->14 warps (-2 to -7% combine) Raise the TMA-staged flat-combine gather's default warp (token) count per block from 12 to 14. 14 is the throughput sweet spot at hidden=7168 bf16: beyond it, the per-block SMEM budget forces a smaller TMA chunk, and the smaller descriptors then cost more than the extra token-parallelism buys (measured: 18-28 warps all regress). Combine (us) @ e256/topk8 hidden=7168 tokens=4096, NSM16, vs 12 warps: 2n 695 -> 648 (-7%) 4n 748 -> 735 (-2%) 8n 790 -> 776 (-2%) Also at 2n: NSM20 575 -> 526 (-9%); NSM32 442 -> 455 (+3%, already far ahead). All PASS, recv counts unchanged. One-line constant change (EP_CMB_TMA_WARPS); kernel logic, SMEM model (229KB at 14w/2-stage/chunk-64, fits the dynamic-shared opt-in), and the MSCCLPP_EP_COMBINE_TMA=0 fallback are unchanged. Still overridable via -D.	2026-06-21 07:34:39 +00:00
Qinghua Zhou	42f00577cf	ep(ncclep): TMA-staged flat-combine gather (default; -7 to -23% combine) Make the flat direct-gather combine stage each contributor's hidden chunks from the remote NVLink recv pools into shared memory via cp.async.bulk (TMA), then reduce from SMEM, instead of the synchronous ld_nc_global register-MLP gather. The async copy engine hides the remote-NVLink read latency at low register cost, so latency hiding no longer competes with occupancy. This wins at every channel count (no crossover) and every node scale, so it replaces both the lean and the unified flat-gather paths and needs no channel gate. New kernel combine_flat_gather_tma: token-per-warp, hidden split into 1KB (chunk=64 int4) TMA descriptors streamed through a 2-stage software pipeline; SMEM is independent of hidden, so it opts in to >48KB dynamic shared via cudaFuncSetAttribute. Defaults tuned on GB200 sm_100 @ hidden=7168 bf16. Default on; MSCCLPP_EP_COMBINE_TMA=0 falls back to the prior synchronous lean gather (<=14 channels) / unified flat branch. Combine (us) @ e256/topk8 hidden=7168 tokens=4096, vs the prior lean gather: 2n NSM16 705 -> 695 NSM20 619 -> 575 NSM32 577 -> 442 (-23%) 4n NSM16 912 -> 748 NSM20 732 -> 635 NSM24 704 -> 594 NSM32 665 -> 565 8n NSM16 953 -> 790 NSM20 813 -> 697 NSM32 749 -> 645 All beat NCCL-EP where measured (2n/4n/8n NSM16: 747/800/959). All PASS. Topology is single-hop NVLink (one MNNVL LSA domain), so this is an apples-to-apples gather, not a 2-hop reduce. Default-unset (non-flat / non-direct) paths are unchanged.	2026-06-21 05:17:46 +00:00
Qinghua Zhou	c3a4b641ac	ep(ncclep): extract lean flat-combine gather kernel (-12 to -30% combine at low SM) The unified `combine` kernel carries the heavy 2-hop forwarder path and compiles to ~80 registers, which on sm_100 limits it to 1 resident block/SM (25 warps, ~39% occupancy). Under the flat all-sender path only the direct gather actually runs (the forwarder is dead code), yet it inherits that register ceiling -- starving the NVLink pull-gather of the warp-level parallelism it needs to hide the (latency-, not bandwidth-bound) remote reads. Extract the flat direct-gather into its own lean kernel `combine_flat_gather`, with logic byte-for-byte identical to the unified kernel's flat branch (combine_token for <= 8 nodes, chunked-ballot discovery for > 8). Freed from the forwarder code it compiles to 47 registers, so a full 1024-thread block (32 warps) fits per SM at the same SM budget -> more warps in flight -> better latency hiding. The launcher picks it on the existing flat-direct gate (recv_pool_global_ptrs / ep_combine_recv_idx non-null, i.e. MSCCLPP_EP_DIRECT), so the 2-hop path is untouched. This is a CROSSOVER like the inc7 dispatch atomic path: the lean kernel's bigger blocks win when blocks are few (low SM, many tokens/block) but add scheduling overhead when blocks are many (high SM). Gate to num_channels <= 14 (== the inc7 dispatch threshold); above it, fall through to the unified kernel. 8n combine (flat, all PASS, recv 29815): num_channels 8 (NSM16, lean) 1086 -> 953 us (-12%, beats NCCL-EP 959) num_channels 16 (NSM32, unified) 753 -> 749 us (unchanged) 2n NSM16 1006 -> 707 us (-30%, beats NCCL-EP 747); 4n NSM16 1040 -> 912 us.	2026-06-20 07:20:37 +00:00
Qinghua Zhou	49a046396a	ep(ncclep): raise MSCCLPP_EP_DISPATCH_NSM clamp to [1, num_sms] for the flat path The flat all-sender dispatch launches one sender block per channel and has no forwarder, yet num_channels (== the dispatch block count) was hard-capped at config.num_sms/2 -- the 2-hop ring requirement. So at "num_sms=16" the flat dispatch ran on only 8 blocks while NCCL-EP always uses num_sms (16) blocks, which fully accounts for the apparent low-SM dispatch gap (the block count, not the kernel, was the difference). Raise the MSCCLPP_EP_DISPATCH_NSM clamp from [1, num_sms/2] to [1, num_sms] under the flat path, and size the RDMA/NVL buffers for the resolved channel count so the per-channel kernel offsets and clean range stay in bounds. A single knob now does it: MSCCLPP_EP_NSM=16 + MSCCLPP_EP_DISPATCH_NSM=16 -> 16 dispatch blocks (matching NCCL-EP), no num_sms2 workaround. The combine grid still uses num_sms/2, so the buffer is sized for max(dispatch_channels, num_sms/2). Unset -> num_sms/2 (the shipped default is byte-identical). Non-flat 2-hop path unchanged. config.hpp: add ep_flat_dispatch_channels() (resolved dispatch channel count) and ep_buffer_channels() (buffer sizing = max of dispatch channels and num_sms/2); both get__buffer_size_hint use the latter. buffer.cc: internode_dispatch resolves num_channels via ep_flat_dispatch_channels(). README: document the raised clamp. 8n dispatch (flat, all PASS, recv 29815): num_sms=16, DISPATCH_NSM unset (8 blocks) 1101 us (default, unchanged) num_sms=16, DISPATCH_NSM=16 (16 blocks) 784 us (-29%, beats NCCL-EP 973)	2026-06-20 04:07:57 +00:00
Qinghua Zhou	d212966c8d	ep(ncclep): inc7 atomic routing-slot assignment at low channel count (-21% dispatch @ NSM16) Under the FLAT all-sender path (kEpFlat), the per-channel sequential lock (rdma_send_next_token_idx spin-wait) serialized slot assignment for every token within a channel by token_idx. At low channel count (num_channels = num_sms/2) there are many tokens per channel, so this serialization dominated the dispatch time -- the NSM16 cliff vs NCCL-EP. Replace the spin-lock with atomicAdd on the shared slot counters (rdma_send_channel_next_tail, ep_count) when num_channels <= 14. Slot assignment becomes dense but order-independent, which is correct here because (a) the receiver validates each source's contiguous recv_x block by value range (order-agnostic), (b) combine gathers via the self-consistent ep_combine_recv_idx map, and (c) rdma_send_channel_tail is unused under kEpFlat (the coordinator warp returns early). The per-token breadcrumb stores (ep_combine_recv_idx, send_rdma_head) are also moved out of the critical section since their addresses are unique per (token, dst). This is a CROSSOVER optimization: at high channel count the lock was already cheap and the 6 sender warps contend on the shared atomics, so the proven spin-lock + plain-increment path is kept for num_channels > 14 (byte-equivalent to the previous behavior). The non-flat 2-hop path is unchanged. 8n dispatch (vs prev full-token TMA), all PASS: NSM16 (8ch) 1404 -> 1099 us (-21.7%) NSM24 (12ch) 984 -> 777 us (-21.1%) NSM28 (14ch) 862 -> 823 us (-4.5%) NSM32 (16ch) 786 -> 780 us (lock path, ~unchanged) 2n dispatch (8ch) 830 -> 760 us (-8.5%). Cumulative 8n NSM16 vs pre-opt -42.9%.	2026-06-19 21:39:10 +00:00
Qinghua Zhou	683a5a7648	ep(ncclep): inc6 B-depth-3 full-token TMA tile (1 S2G/dst, -3 to -6% more dispatch) Push the flat all-sender TMA S2G tile to the WHOLE bf16 token (7168B->14336B), so each destination is a single cp.async.bulk S2G per token (nchunks=1) -- matching NCCL-EP whole-token granularity (was 2 ops for half-token, 14 for the original 1024B sub-chunk). Full-token forces NSTAGE 4->2 to keep the 6x2x14336 = 168KB dynamic ring under the GB200 ~227KB/block opt-in cap (confirmed cudaDevAttrMaxSharedMemoryPerBlockOptin=232448); kEpTmaSndInFlight clamped to >=1 so NSTAGE=2 keeps 1 cross-token S2G in flight, plus a new InFlight<NStage drain-before-reuse static_assert. Kernel-only (host launcher already sizes dynamicSmemBytes from EP_TMA_SND_* macros). Static SMEM 18688->18560; REG 168. Dispatch-only; combine unchanged. All PASS (recv 21820 2n / 29815 8n). A/B vs half-token Bd3: 2n NSM16 866.7->829.8 (-4.3%); 8n NSM16 1464.5->1404.1 (-4.1%), NSM20 1223.6->1155.4 (-5.6%), NSM32 809.6->786.0 (-2.9%). Cumulative (StageA+Bd1+Bd3+full) vs pre-opt baseline: 8n NSM16 1924.2->1404.1 (-27.0%), NSM32 1030.3->786.0 (-23.7%, beats NCCL-EP 8n 973us by ~19%).	2026-06-18 23:19:43 +00:00
Qinghua Zhou	aca50d56fd	ep(ncclep): inc6 Stage B-depth-3 whole-token TMA tiles via dynamic SMEM (-16 to -27% dispatch) Enlarge the flat all-sender TMA S2G tile from 1024B sub-chunks to HALF-token 7168B, cutting cp.async.bulk S2G descriptors from 14 to 2 per destination per token (~7x fewer TMA issues -- NCCL-EP issues 1 whole-token op). The 6x4x7168 = 168KB ring exceeds the 48KB static cap, so it moves to DYNAMIC shared memory: kernel uses extern __shared__ ep_tma_tile_snd_dyn indexed (warp_idNStage+stage)chunk; host opts in via cudaFuncSetAttribute(MaxDynamicSharedMemorySize) + cfg.dynamicSmemBytes (geometry shared through EP_TMA_SND_CHUNK_BYTES/NSTAGE macros). Static SMEM drops 43328->18688; REG unchanged 168. Dispatch-only; combine unchanged. All PASS (recv 21820 2n / 29815 8n). A/B vs B-depth-1: 2n NSM16 1195.6->866.7 (-27.5%); 8n NSM16 1781.8->1464.5 (-17.8%), NSM20 1452.8->1223.6 (-15.8%), NSM32 957.6->809.6 (-15.5%, beats NCCL-EP 8n 973us by 17%). Cumulative (StageA+Bd1+Bd3) vs pre-opt baseline: 8n NSM16 1924.2->1464.5 (-23.9%), NSM32 1030.3->809.6 (-21.4%).	2026-06-18 22:20:29 +00:00
Qinghua Zhou	5daea8f8cb	ep(ncclep): inc6 Stage B-depth-1 cross-token TMA pipeline (-2.5% dispatch) Carry the TMA S2G pipeline (in-flight groups + ring stage index) ACROSS token boundaries in the flat all-sender hidden-copy, removing the per-token full wait_group 0 drain. Last chunks of token N now overlap first chunks of token N+1 instead of a hard drain at every token seam (~1 NVLink RT/token, dominant at low SM). Global stage = (ep_tma_gbase + chunk) % NStage with ep_tma_gbase += nchunks per token; drain-before-reuse preserved (NStage 4 > InFlight 2) across seams; single wait_group 0 + fence.proxy.async.global at warp end. SMEM/REG byte-identical to Stage A (43328/168, pure control-flow). Dispatch-only; combine unchanged. All PASS (recv 21820 2n / 29815 8n). 8-node FLAT A/B vs Stage A: NSM16 1826.3->1781.8 (-2.4%), NSM20 1497.5->1452.8 (-3.0%), NSM32 983.3->957.6 (-2.6%, now beats NCCL-EP 8n 973us). 2n NSM16 1228.6->1195.6 (-2.7%). Cumulative w/ Stage A vs pre-opt baseline: 8n NSM16 -7.4%, NSM32 -7.1%.	2026-06-18 21:10:12 +00:00
Qinghua Zhou	8386ed2a1f	ep(ncclep): inc6 Stage A pipelined TMA sender ring (-4-5% dispatch) Replace the per-chunk full-drain (cp.async.bulk.wait_group 0 after every chunk) in the flat all-sender TMA hidden-copy with an NCCL-EP-style produce-1-ahead + lazy in-flight drain (cp.async.bulk.wait_group.read N), so consecutive chunks S2G NVLink writes overlap each other and the next chunks G2S HBM load. Ring deepened NStage 2->4, chunk 2048->1024 (SMEM-neutral: 641024==622048==24576B; +96B for 2 extra mbarriers; REG unchanged at 168). Dispatch-only; combine byte-identical. 8-node FLAT A/B (all PASS recv29815): NSM16 1924.2->1826.3 (-5.1%), NSM20 1565.0->1497.5 (-4.3%), NSM32 1030.3->983.3 (-4.6%, ~= NCCL-EP 8n 973us). 2-node NSM16 1285.2->1228.6 (-4.4%).	2026-06-18 16:59:11 +00:00
Qinghua Zhou	3b5270e5d5	ep(ncclep): inc6 flat combine + decoupled dispatch/combine SM knobs Completes the flat all-sender path (MSCCLPP_EP_FLAT) end-to-end and makes its dispatch and combine SM counts independently tunable. - buffer.cc internode_combine: under the flat path, skip the pre-combine cached_notify + fifo-advance. The flat combine is the inc5 direct-gather (it already needs no forwarder), but cached_notify's sm_id>=3 branch indexes rdma_channel_prefix_matrix (the forwarder-produced recv_rdma_channel_prefix_matrix, never written under all-sender flat) to derive token ranges and then writes combined_nvl_head over that range -> out-of-bounds write / intermittent illegal memory access in the bench. Gated on ep_flat + recv_pool_global_ptrs + ep_combine_recv_idx; the 2-hop path is unchanged. - buffer.cc: decouple the flat dispatch/combine grids from config.num_sms. MSCCLPP_EP_DISPATCH_NSM caps the all-sender dispatch block count (clamped to [1, num_sms/2]); MSCCLPP_EP_COMBINE_NSM caps the combine block count (clamped to [2, num_sms]). Both flat-only. num_channels is the dispatch token-partitioning granularity (flows into notify_dispatch, the prefix-matrix allocations and the grid), so lowering it stays self-consistent; the now-vestigial combine prefix-matrix shape asserts are relaxed under flat. - test: unify the SM-count env var to MSCCLPP_EP_NSM for both the internode and intranode tests, with MSCCLPP_EP_NUM_SMS kept as a legacy fallback (internode default 152, intranode 20). - README: document MSCCLPP_EP_FLAT, MSCCLPP_EP_DISPATCH_NSM and MSCCLPP_EP_COMBINE_NSM, add a flat all-sender subsection, and correct the SM-count row to MSCCLPP_EP_NSM. Validated 2-node GB200 (e256/topk8, 8 ranks): full flat dispatch+combine PASS (~1e-6, no IMA), dispatch 500us / combine 451us. Dispatch and combine SM sweeps are independent (combine flat at ~451us while dispatch scales 698->495us, and vice versa). FLAT=0 inc5 baseline unchanged (568/488us).	2026-06-17 23:52:44 +00:00
Qinghua Zhou	1805ad0db6	ep(ncclep): inc6 flat all-sender dispatch (kEpFlat, dispatch validated) Adds the FLAT all-sender dispatch path, gated behind MSCCLPP_EP_FLAT (requires MSCCLPP_EP_DIRECT). Eliminates the forwarder + coordinator + receiver roles so every SM block is a sender, and delivers per-token metadata straight to the destination recv pool instead of via the 2-hop RDMA ring -> forwarder -> NVL receiver pipeline. - config.hpp: append a per-token 128B metadata region after the worst-case hidden region in the recv pool (kEpRecvPoolMetaBytes, get_recv_pool_meta_base). Allocated unconditionally (cheap vs hidden), only touched when kEpFlat is set. - internode_ncclep.cuh: kEpFlat __constant__ gate. Sender writes SourceMeta + scales + topk straight into each destination pool's meta region at the token's final recv slot. Under kEpFlat the launch is all-sender: num_channels = gridDim.x, channel_id = sm_id, is_forwarder = false; the sender-coordinator, forwarder, and NVL receiver roles early-return, and the sender skips the ring head/tail flow-control wait (no forwarder drains it). - internode.cu / api.cuh: flat_meta_drain kernel + launcher copies the pool meta region into the recv_* output tensors, rebasing topk_idx to this rank's local expert range (weight 0 when out of range). Host grid launches num_channels blocks (not x2) when FLAT+DIRECT. - buffer.cc: call flat_meta_drain on the comm stream after dispatch when ep_flat && ep_use_direct && uncached && topk present. - test: MSCCLPP_EP_BENCH_DISPATCH_ONLY skips combine (which still needs the forwarder/receiver breadcrumbs) so the all-sender dispatch ceiling can be measured; dispatch correctness check is retained. Validated 2-node GB200 (e256 / topk8, 8 ranks): dispatch correct (recv 21820 tokens, per-source ranges exact). flat_meta_drain proven byte-correct by overwriting the receiver's metadata output and still passing combine (~1e-6). All-sender dispatch is 13-66% faster than the 2-hop path at equal block count and reaches better-than-baseline-peak throughput with ~1/4 the blocks. Combine under the flat path (forwarder breadcrumb rework) is the remaining follow-up.	2026-06-17 21:13:13 +00:00
Qinghua Zhou	183dcb5daa	ep(ncclep): inc5 sender-direct TMA dispatch (lane-striped ring) Port the NCCL-EP hybrid_ep.cuh TMA dispatch path into the inc5 sender-direct kEpDirect branch, gated by EP_NCCLEP_TMA. kRDMASender streams each token hidden HBM->SMEM via cp.async.bulk (G2S, mbarrier complete_tx) then fans the staged tile S2G straight to every destination GPU recv-pool fabric VA. Two correctness mechanisms from the standalone probe + NCCL-EP reference: - per-CHUNK SMEM ring (kEpTmaSndNStage=2, chunkb=2048) so no two back-to-back cp.async.bulk ops reuse a single tile+mbar to a cross-node fabric VA, which is the unspecified-launch-failure trigger. SMEM-neutral vs the single-slot 4096 layout (622048 == 6*4096); dispatch static SHARED stays 43360B (< 48KB). - LANE-STRIPED S2G broadcast (lane j -> dst pool j, j+=32) instead of a serial lane-0 loop; lane 0 issues G2S, the whole warp waits on the G2S mbarrier, each lane drives its own destination, drain wait_group 0 + syncwarp before tile reuse. Validated 2 nodes x 4 GPU (e256/topk8, hidden=7168, tokens=4096, MSCCLPP_EP_DIRECT=1): dispatch PASS (no launch failure, byte-correct combine), 569.5us vs 618.5us warp-store baseline (-7.9%). Off by default (EP_NCCLEP_TMA=0); baseline byte- identical when unset.	2026-06-17 07:12:22 +00:00
Qinghua Zhou	462ab1661a	docs(ep): document MSCCLPP_EP_DIRECT and MSCCLPP_EP_INTRA_DIRECT for GB200 Add the two GB200 direct-path env flags to the EP README runtime-knobs table plus a dedicated subsection: - MSCCLPP_EP_DIRECT=1: internode sender direct-write dispatch + receiver gather-direct combine (one master flag), with the MSCCLPP_EP_FABRIC_IPC prerequisite and recv-pool input contract. - MSCCLPP_EP_INTRA_DIRECT=1: separate single-node intra-node sender direct-write flag. Includes measured GB200 NVL72 numbers and the single-node launch note.	2026-06-15 23:08:43 +00:00
Qinghua Zhou	14f131407b	ep(ncclep): inc5 combine-gather correctness at >32 ranks (16n) The kEpDirect combine-gather path used the flat combine_token<kEpNumRanks> helper, whose discovery maps one warp lane per rank. That is correct only while kEpNumRanks <= 32; at >= 16 nodes (kEpNumRanks > 32) the __shfl-by-rank source lane wraps mod 32, so ranks >= 32 alias 0..31 and num_topk_ranks overflows topk_ranks[8] -> combined_x came out all zeros. Split the gather block with if constexpr (kEpNumRanks <= 32): <= 8 nodes compile to exactly the original fast path (byte-identical, no perf change), while > 32 ranks use a chunked-ballot discovery (scan ranks in 32-wide chunks, compact set lanes via __ballot_sync) plus a pre-loaded register-array reduction. Verified combine correctness 2/4/8/16n and combine timing unchanged at <= 8n (2n 494us, 4n 634us, 8n 740us).	2026-06-12 05:45:36 +00:00
qinghuazhou	3b6b2ac303	ep(intranode): sender direct-write dispatch (MSCCLPP_EP_INTRA_DIRECT) Eliminate the 2-hop ring + receiver hidden-drain in intranode dispatch. When MSCCLPP_EP_INTRA_DIRECT is set, the sender computes each token's final recv_x index (rank_prefix + channel_prefix + running count, identical to the receiver's total_offset+chunk_idx) and writes hidden straight to the destination's peer-mapped recv pool; recv_x becomes a zero-copy view of that pool and the receiver skips the hidden drain (metadata still flows through the ring, keeping indices aligned). Reuses the Increment-4 recv pool (recv_pool_ptrs_gpu). Baseline byte-identical when unset. Single-node 4-GPU A/B @4096/7168/256-exp/top8, 152 SMs: dispatch 3829.7us -> 283.9us (-92.6%, 13.5x), now SM-scaling; correctness exact-match (0.0) at 128 and 4096 token scale.	2026-06-11 23:07:28 +00:00
qinghuazhou	cc34e72b64	ep(ncclep): inc5 combine-direct gather (kEpDirect) Stage 1: dispatch persists its per-(token,dst) recv-pool slot (= ep_my_idx) into a Buffer-owned gather map (ep_combine_recv_idx_gpu), no Python API change. Stage 2: under MSCCLPP_EP_DIRECT, combine gathers each token contributions DIRECTLY from the peer recv pools (recv_pool_global_ptrs[r] + header + recv_idx*hidden) and reduces locally (reusing combine_token<> with a pool-reading recv_fn), skipping the nvl_channel + forwarder + rdma_channel 2-hop. Receiver-direct dual of the dispatch sender-direct write. Combine A/B (same .so, env MSCCLPP_EP_DIRECT, SKIP_COMBINE=0, all PASS): 2n/4096 gather 494.3 vs 2-hop 983.8us (-49.8%) 4n/4096 gather 646.5 vs 2-hop 1456.2us (-55.6%) 8n/4096 gather 754.3 vs 2-hop 2358.8us (-68.0%) Round-trip 8n/4096: 1983us vs inc4a 4141us (-52%), ~= NCCL-EP.	2026-06-11 06:37:25 +00:00
qinghuazhou	1f7942a804	ep(ncclep): inc5 keep full rdma clean for combine under kEpDirect cached_notify serves both cached dispatch (shrunk ring, is_cached_dispatch=true) and combine (is_cached_dispatch=false). Combine still ships hidden through the rdma ring, so its clean must stay full-slot; only shrink the dispatch clean. Fixes combine desync/timeout under MSCCLPP_EP_DIRECT in the bench loop. Combine A/B (SKIP_COMBINE=0, correctness PASS): 4n/4096 inc5 vs inc4a combine 1423 vs 1435us; 8n/4096 2281 vs 2233us (combine path unchanged).	2026-06-11 04:07:58 +00:00
qinghuazhou	ca829f6e8f	ep(ncclep): inc5 ring-slot shrink + read-once direct write Under kEpDirect the rdma ring no longer carries hidden (ring_hidden_int4=0); hidden goes straight to the dest pool. Clean meta (notify_dispatch + cached_notify) matches the small slot so the bench loop does not desync. Sender reads x once and broadcasts to all destination pools (gather-pointer st_pool_broadcast) instead of re-reading per destination. Dispatch A/B (4 nodes, SKIP_COMBINE): 2048 557.8 vs 586.9us (-5.0%); 4096 925.2 vs 1033.4us (-10.5%).	2026-06-11 02:31:19 +00:00
qinghuazhou	b6140b0229	ep(ncclep): increment 5 - sender direct-write dispatch (kEpDirect, dispatch-only WIP) On the NVL72 single NVLink domain, eliminate the 2nd NVLink hop in HT dispatch: the RDMA sender writes each token's hidden straight into the destination GPU's domain-wide recv_x pool over fabric VA, instead of staging through the rdma_channel ring + forwarder transpose. Gated by env MSCCLPP_EP_DIRECT (default off => byte-for-byte inc4a; verified 584.6us == inc4a 583.3us with the gate off). - buffer.{cc,hpp}: domain-wide recv-pool base exchange (recv_pool_global_ptrs, registerMemory all_transport + send/recvMemory, env-gated). Validated the cuMem pool is fabric-mappable cross-node (all 16 peer bases imported at 4-node). - kernels/internode_ncclep.cuh: kEpDirect gate. Per-destination-GPU base index = peer-read recv_gbl_rank_prefix_sum[rank-1] + channel start; seq-locked per-GPU running count (matches the receiver drain order). Sender writes hidden direct to recv_pool_global_ptrs[dst_global]; forwarder skips the hidden copy (fuse + xnode) under kEpDirect. __threadfence_system at the sender epilogue for cross-node visibility. - kernels/api.cuh, kernels/internode.cu: thread recv_pool_global_ptrs through the dispatch signature/macro/launch + kEpDirect upload. Dispatch correctness validated at 4-node (recv_x per-source block check PASS, recv 13465, MSCCLPP_EP_DIRECT=1 SKIP_COMBINE=1). WIP / not yet a perf win: the coordinator still RDMAs the full ring slot (including the now-unused hidden region), so hidden is shipped twice. Pending: shrink the ring slot to metadata-only under kEpDirect (the perf win), and the parallel combine rework (combine still expects the 2-hop breadcrumbs, so inc5 is dispatch-only for now -> run with SKIP_COMBINE).	2026-06-11 00:33:36 +00:00
qinghuazhou	31c930d8c5	ep(ncclep): increment 4a - VMM unicast recv pool (TMA-eligible peer mapping) Replace the increment-3 cudaMalloc-carve + cudaIpc recv-output pool with a VMM pool (mscclpp::detail::gpuCallocPhysical, cuMem FABRIC/POSIX-FD) exchanged via registerMemory(ipc)+sendMemory/recvMemory so peers import directly dereferenceable, TMA-eligible fabric VAs (recv_pool_ptrs[]) instead of the cudaIpc carve. Kernel param recv_pool_offset(int64 into buffer_ptrs) -> recv_pool_ptrs(void** peer bases). Falls back to the inc3 inline carve when fabric-IPC is unsupported. PASS, dispatch 1052us (== inc3, perf-neutral); removes the ~1GB inline pool from the buffer_ptrs allocation. Also lands a gated (EP_NCCLEP_TMA, default OFF) cross-GPU forwarder TMA write path targeting the VMM peer pool. NOTE: enabling it currently faults (cudaErrorLaunchFailure) in-kernel even to local VMM; single-process VMM+TMA probe was a false positive. Kept gated off for future investigation; the default build is unaffected.	2026-06-08 22:16:05 +00:00
qinghuazhou	ac25cf18b6	ep(ncclep): increment 3 - cross-GPU peer-map direct-write (eliminate receiver hidden drain) Carve a fixed-capacity recv-output pool onto the END of the NVL cudaMalloc (covered by the same IPC handle as buffer_ptrs, NOT added to num_nvl_bytes so it dodges the INT_MAX registered-size cap). recv_x becomes a zero-copy from_blob view of the local pool; the pool header holds this rank recv_gbl_rank_prefix_sum (peer-readable). The cross-GPU forwarder computes the destination final recv_x index from the peer-read prefix + channel-local start_sum + per-src count and writes hidden STRAIGHT to the destination peer pool; the receiver skips its hidden drain (keeps scales/topk/meta). Same-GPU path (inc1) unchanged (local pool == recv_x). Falls back to torch::empty + receiver when num_recv_tokens > pool cap. Files: config.hpp (pool sizing static helpers), buffer.hpp (recv_pool_off_), buffer.cc (enlarge ctor alloc, from_blob recv_x, publish prefix header), api.cuh + internode.cu (recv_pool_offset param, guarded EP_DISPATCH_EXTRA_ARGS), internode_ncclep.cuh (xnode_direct forwarder + receiver hidden-skip). Clean interleaved 4-node A/B {38,41,59,75}: inc2 ~1132us -> inc3 ~1065us (-5.9%), order-independent, FULL PASS (recv 26809, combine correct). Captures ~the DRAIN_NOOP ceiling (1048us). Cumulative vs baseline 1257us: -15.4%. Gap to NCCL-EP 902us now ~156us (was 355us).	2026-06-08 16:49:11 +00:00
qinghuazhou	2ebf81aa35	ep(ncclep): increment-3 de-risk - DRAIN_NOOP probe + SKIP_VERIFY test gate internode_ncclep.cuh: add EP_NCCLEP_DRAIN_NOOP compile gate (default 0, inert) - when 1 the NVL receiver keeps all control flow but skips the data copies, to measure the dispatch-time upper bound of eliminating the cross-GPU receiver drain. Probe result (4-node {38,41,59,75}): dispatch inc2 1124us -> DRAIN_NOOP 1048us (-6.8%), agg_bw 836->896 GB/s => confirms real headroom for the cross-GPU peer-map direct-write rework (ceiling ~-16.6% cumulative vs baseline). test_internode_multirank.py: gate the dispatch range-assert and combine assert behind MSCCLPP_EP_SKIP_VERIFY env so dispatch timing can be measured when recv_x is intentionally incomplete (perf probing).	2026-06-08 16:25:53 +00:00
qinghuazhou	fa07b496ae	ep(ncclep): increment 2 - deepen drain-copy MLP unroll 5->28 Full-token outstanding int4 loads (hidden_int4=896 => 28*32) on the 3 dest-side drain copies: same-GPU direct-write, forwarder cross-GPU staging write, receiver staging->recv_x drain. Sender packing copy left at 5. Hides HBM/fabric store-completion latency (more memory-level parallelism). This is the proven Lever-C MLP lever; its prior net-regression came from combine register inflation in the SHARED translation unit, which cannot happen here (dispatch_ncclep is a separate TU; combine untouched). Clean interleaved 4-node A/B {38,41,59,75}: inc1 ~1176us -> inc2 ~1124us (-4.4%), full PASS. Cumulative vs baseline 1257us: -10.6%.	2026-06-08 15:51:12 +00:00
qinghuazhou	dbef7a5f31	ep(ncclep): increment 1 - same-GPU fused direct-write to recv_x Forwarder warps whose dst_nvl==nvl_rank (token arrived on its destination GPU, no NVLink transpose) write straight to final recv_x, computing the same index the receiver would (recv_gbl_rank_prefix_sum base + channel-local start_sum + per-src running count). The matching receiver warp (src_nvl==nvl_rank) is skipped. Eliminates the dest-side HBM double-bounce (nvl_channel_x staging + receiver copy) for the same-GPU token fraction (~25%). recv_x contents + send_nvl_head breadcrumb preserved bit-identical => combine unaffected. 4-node {38,41,59,75}: dispatch 1257->1174-1185us (-6%), full PASS (recv 26809).	2026-06-08 15:43:53 +00:00
qinghuazhou	3ad6b70d7d	ep: scaffold guarded dispatch_ncclep kernel (NCCL-EP port baseline) + launch-site selection internode_ncclep.cuh: dispatch_ncclep<> = production dispatch<> kernel under EP_DISPATCH_NCCLEP, to be restructured into NCCL-EP warp-specialized overlap (concurrent N2N fabric-put + intra-node G2S/S2G drain via MemoryChannel put/signal). internode.cu: guarded include + EP_DISPATCH_KERNEL alias selects dispatch_ncclep when guard ON, else production dispatch (byte-identical). Builds clean with -DMSCCLPP_EP_DISPATCH_NCCLEP=ON.	2026-06-08 07:08:55 +00:00
qinghuazhou	3a9ca157f3	ep: add MSCCLPP_EP_DISPATCH_NCCLEP build guard (EP_DISPATCH_NCCLEP) for NCCL-EP-ported warp-specialized HT dispatch (default OFF)	2026-06-08 06:58:13 +00:00
qinghuazhou	ef4f8421a2	ep/internode: P4b - tune kNumDispatchRDMASenderWarps 7->6 (HT 2n dispatch 728->689us on GB200)	2026-05-27 03:46:14 +00:00
qinghuazhou	c58e9b7408	ext/ep/buffer: P2 - shard proxy on Blackwell multi-node by default (num_proxies = local_world_size when num_rdma_ranks > 1)	2026-05-27 02:31:19 +00:00
qinghuazhou	2e73318c89	Revert "Clean up completed communicator receives (#804 )" - thread-safety regression causing cudaErrorIllegalAddress on HT internode GB200	2026-05-27 02:15:39 +00:00
copilot-swe-agent[bot]	04ea24da8d	Fix python lint formatting in internode multirank test Agent-Logs-Url: https://github.com/microsoft/mscclpp/sessions/f5220581-e26c-49d8-98fa-e1b8ab011898 Co-authored-by: seagater <7475084+seagater@users.noreply.github.com>	2026-05-20 18:04:00 +00:00
Qinghua Zhou	757c5ec831	Merge qinghuazhou/expert_parallel_gb200	2026-05-20 01:56:34 +00:00
Qinghua Zhou	394b36cd37	Merge branch 'qinghuazhou/expert_parallel_gb200' into qinghuazhou/expert_parallel	2026-05-20 01:50:13 +00:00
Qinghua Zhou	cb93dd585b	tests/ep: Unify the name of EP benchmark variables	2026-05-20 01:48:29 +00:00
Qinghua Zhou	009929174f	Merge remote-tracking branch 'origin/main' into qinghuazhou/expert_parallel_merge_main_test # Conflicts: # src/core/connection.cc # test/mp_unit/port_channel_tests.cu	2026-05-18 22:02:50 +00:00
Qinghua Zhou	20bd1ec55b	ext/ep: fix CUDA 11.8 build + apply clang-format/black - src/core/atomicadd_kernel.cu: restore the legacy 3-arg cuCtxCreate(&proxyAtomicCtx_, 0, cuDevice) in the '#else' branch of the CUDA_VERSION >= 12050 guard. A prior edit had corrupted it to 'cuCtxCreate(&proxyAtomicCtx_vice)', which broke the CUDA 11.8 build (CodeQL CUDA cuda11.8 and MSCCLPPLang cuda11.8 jobs). - Apply clang-format to src/ext/ep/* (no logic changes, fixes the cpplint CI job). - Apply black to test/python/ext/ep/test_internode_multirank.py and test_intranode_multirank.py (no logic changes, fixes the pylint CI job).	2026-05-18 21:44:20 +00:00
Binyang Li	60a6d7219f	Clean up completed communicator receives (#804 ) ## Summary - Release the reference after last requests are ready. - Keep ordered receive chaining for repeated rank/tag operations while cleaning up completed receive bookkeeping. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 21:06:50 +00:00
Changho Hwang	252a422030	Handle PortChannel flush asynchronously from the host proxy (#802 ) When a `PortChannel` requests `flush()`, the host-side proxy was being blocked, which may cause head-of-line blocking of other parallel `PortChannel`s' requests. Now the proxy handles `flush()` requests asynchronously. This feature especially helps performance when we need multiple IB QPs and need to flush QPs.	2026-05-15 11:50:43 -07:00
Qinghua Zhou	98f6b1a936	Merge branch 'qinghuazhou/expert_parallel_fix_h100' into qinghuazhou/expert_parallel	2026-05-14 22:28:11 +00:00
Qinghua Zhou	1f0948c8e4	Merge branch 'qinghuazhou/expert_parallel' into qinghuazhou/expert_parallel_fix_h100	2026-05-14 21:37:44 +00:00
Qinghua Zhou	5911998181	ext/ep: gate NVLS HT B2 on cross-host fabric IPC support (H100 fix) The NVLS HT B2 path introduced in `3ab2e43b` activated whenever isNvlsSupported() && num_rdma_ranks > 1. On H100 NDv5 / Azure CX-7 RoCE that is true (H100 has intra-node NVLink multicast), but there is no cross-host NVSwitch fabric. mscclpp's GpuIpcMem::create then falls back to CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR whose handle exchange routes through /tmp/mscclpp_bootstrap_<pid>.sock -- a master-rank-0 unix-domain socket worker ranks cannot reach. Symptom on every commit since `3ab2e43b`: RuntimeError: connect() failed for unix socket to /tmp/mscclpp_bootstrap_<pid>.sock MSCCLPP_EP_FABRIC_IPC=0 was being silently ignored. src/ext/ep/buffer.cc: add resolve_fabric_ipc_supported() helper. Resolution: 1. MSCCLPP_EP_FABRIC_IPC env var (0/off/false/no => off, 1/on/true/yes/force => on, otherwise auto). 2. Auto-detect: requires both - CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED == 1 - device compute capability >= sm_100 (Blackwell+). Gate both use_fabric_ipc_alloc (RDMA buffer allocator) and nvls_ht_enabled (HT B2 multicast region) on fabric_ipc_supported. On H100 both fall back to cudaMalloc + legacy PortChannel; on GB200 NVL72 both remain enabled. Diagnostic prints now show fabric_ipc=. test/python/ext/ep/test_internode_multirank.py: replace hardcoded NUM_MAX_NVL_PEERS=4 with a runtime _detect_local_world_size() helper that reads MSCCLPP_EP_LOCAL_WORLD_SIZE / LOCAL_WORLD_SIZE / OMPI_COMM_WORLD_LOCAL_SIZE, falling back to torch.cuda.device_count(). Makes the test correct on both H100 (8 GPUs/node) and GB200 (4 GPUs/node) without code changes. src/core/atomicadd_kernel.cu: use cuCtxCreate_v4 for CUDA >= 12.5 (the underlying symbol was renamed); preserve legacy 3-arg cuCtxCreate for older toolkits. Verified on 2x H100 NDv5 at HEAD: LL intranode (8 GPUs) PASS LL internode (16 GPUs, 2 nodes) PASS HT intranode (8 GPUs) PASS HT internode (16 GPUs, 2 nodes) PASS Diagnostic on H100: [mscclpp_ep] rdma_buffer allocator: cudaMalloc (low_latency=0, nvls=1, fabric_ipc=0) [mscclpp_ep] NVLS HT multicast: disabled (low_latency=0, num_rdma_ranks=2, nvls_supported=1, fabric_ipc=0)	2026-05-14 21:29:10 +00:00
Changho Hwang	5d608feaa5	Enhance cross-node CudaIpc availability check (#803 )	2026-05-14 14:06:12 -07:00
Caio Rocha	40295df4c4	Adding Support to bf16 Executor Tests (#801 ) This pull request adds support for the `bfloat16` (bf16) data type to the test executor, including both Python and CUDA components. The changes ensure that `bfloat16` is handled consistently across argument parsing, data type conversion, and test kernel implementations. Additionally, the CUDA verification kernels are refactored to use parameterized tolerances for improved numerical accuracy checks. Support for bfloat16 data type: * Added handling for `bfloat16`/`bf16` in the Python test executor's argument parsing, data type conversion (`parse_dtype`, `dtype_to_mscclpp_dtype`), and help text. [[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28) [[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135) [[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251) * Updated output to display the correct data type string for `bfloat16`. CUDA kernel and test improvements: * Included `bfloat16` headers and defined test data fill and gather kernels for `bfloat16` on both CUDA and HIP platforms. [[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11) [[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35) [[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59) [[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133) * Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to use an explicit tolerance parameter (`Eps`) and added correct tolerances for each data type, including `bfloat16`. [[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85) [[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113) These changes ensure full support for `bfloat16` in the test executor and improve the accuracy and maintainability of the CUDA test kernels. --------- Co-authored-by: Caio Rocha <caiorocha@microsof.com>	2026-05-14 09:56:11 -07:00
qinghuazhou	7650e699a0	Merge qinghuazhou/expert_parallel_gb200 (Migration Phase 5 GB200)	2026-05-13 22:48:22 +00:00
qinghuazhou	f7cbf1fe7a	ext/ep: README - add Migration Phase 5 (Azure GB200 NVL72 port)	2026-05-13 22:48:16 +00:00
qinghuazhou	399745f4c2	Merge qinghuazhou/expert_parallel_gb200 (LL NVLS table)	2026-05-13 22:41:14 +00:00
qinghuazhou	4c13937474	ext/ep: README - document GB200 NVLS variants in LL transport table + GB200 validation rows	2026-05-13 22:41:08 +00:00

1 2 3 4 5 ...

1071 Commits