Commit Graph

987 Commits

Author SHA1 Message Date
Qinghua Zhou
00ca2a2051 ext/ep: lift cached_notify nc cap, strided-warp head fixup (HT 442/399 GB/s)
Two latent bugs in internode cached_notify blocked num_channels scaling:

1) Launch thread count exceeded per-block limit.
   num_threads = max(128, 32 * num_channels) produced >1024 at nc>=33,
   above GB200 cudaDevAttrMaxThreadsPerBlock=1024. cudaLaunchKernelEx
   returned cudaErrorInvalidValue, which silently corrupted the
   buffer-clean path at nc<=64 and hard-failed at nc>=68. Cap launch
   at 1024 and decouple sizing from work distribution.

2) Hardcoded warp_id < num_channels in sm_id>=2 branches.
   The head-fixup work required num_warps >= num_channels. Replaced
   with a strided per-warp loop (for ch = warp_id; ch < num_channels;
   ch += num_warps) so any thread count covers any channel count.

With both fixes, num_channels scales from 20 to 152 (= SM count, the
cooperative-grid ceiling at 2x SMs). HT throughput grows from 78/94
GB/s (dispatch/combine) at nc=20 to 442/399 GB/s at nc=152, a 5.7x /
4.2x speedup. Gap to LL closes from 9.0x / 7.6x down to 1.6x / 1.8x.

No effect on correctness at nc<=32 (path was always exercising
num_warps >= num_channels there); changes are purely additive at low
nc and remove the silent-failure ceiling at high nc.
2026-05-11 05:32:57 +00:00
Qinghua Zhou
ce1554bd1f ext/ep: fix kRDMASender epilogue tail-write race (unblocks chunk_send>16)
The kRDMASender warp loop processes tokens in a warp-stride pattern under
a per-channel sequential lock. In-loop tail writes are monotonic by
construction (each warp owns a strictly-increasing slot range while it
holds the lock) so a plain st_release_cta is correct and avoids the L2-
serialized atomicMax codepath that the compiler emits for global-space
atomics.

The epilogue tail write, however, sits outside that monotonicity
contract: when multiple sender warps reach the epilogue out of order, a
later-exiting warp owning a smaller last_rdma_tail_idx can clobber the
tail with a smaller value, leaving the kRDMASenderCoordinator wedged
waiting on a slot that is already produced.

This is invisible at deep receive windows (chunk_send=16,
num_chunked_recv=128 -> depth 8) but deterministically hangs at shallow
ones (chunk_send=32 -> depth 4). Diagnostic instrumentation in
kRDMASenderCoordinator caught peer=1 last_issued_tail=384
processed_tail=408 to_send=25, i.e. exactly one missing token at the
boundary.

Fix: replace the epilogue st_release_cta with atomicMax. Confirmed
chunk_send=32 PASS (previously deterministic hang) and chunk_send=16
unaffected (no perf regression vs the previous lazy-head fix).
2026-05-11 03:40:55 +00:00
Qinghua Zhou
01a10e00de ext/ep: HT perf - lower lazy head-feedback threshold to chunk/4
Problem: kForwarderCoordinator only published min_head when
min_head >= last_head + num_max_rdma_chunked_send_tokens. With
4096 tokens / 10 channels / 2 peers ~= 205 tokens per (channel,peer),
the receive-buffer-space window advanced only every full chunk, and
the last partial chunk never triggered an update, serializing
handshakes and capping HT throughput.

Fix: lower threshold to max(1, chunk/4); when any forwarder channel
has retired, drop to 1 so partial tail chunks always publish.

Result on 2-node GB200 NVL72 (cfg=20,8,256,16,128, 4096 tok / 7168 hidden):
  dispatch agg: 36 -> 78 GB/s  (2.16x)
  combine  agg: 45 -> 95 GB/s  (2.11x)
  PASS, max diff = 0

Also strips Phase 4 diagnostic printfs.
2026-05-10 19:21:25 +00:00
Qinghua Zhou
e0a1bb2c42 ext/ep: WIP Phase 4 fix NVLS self-overcount + cached_notify NVLS barrier
Two related root causes prevented Phase 4 (internode dispatch+combine) from
completing on Azure GB200 NVL72 with the IB control-plane disabled.

1) NVLS self-loop over-count
   The sender/forwarder counter publishes used multimem.red.add, which
   multicasts to every NVL peer that has the buffer bound. When a single
   logical writer issues an add, every peer adds it to its own slot, so
   self-loop counters (where one rank is both writer and reader on the
   same (P,C) pair) over-count by N = number of NVL peers.

   Fix: replace all NVLS-based self-counter sites in dispatch+combine
   with plain local mscclpp::atomicFetchAdd. Cross-node visibility was
   already handled separately via direct fabric-VA st.release.sys.global
   on peer_rdma_bases.

2) cached_notify barrier hang on Azure CX-7 RoCE
   The two port_channel.signal/wait pairs in cached_notify hang on this
   platform because RoCE control-plane traffic is broken.

   Fix: add an NVLS multimem.red barrier path (barrier slots +24 / +32)
   that mirrors the working notify_dispatch pattern. Threaded the
   nvls_mc_ptr / nvls_dev_ptr / nvls_off_barrier / nvls_epoch params
   through api.cuh + buffer.cc; introduced a separate
   nvls_ht_cached_epoch member because slots +24 / +32 are only touched
   when the cached path is taken — sharing the global nvls_ht_epoch
   would mismatch slot increments and wait expectations.

End-to-end test_internode.py: dispatch + combine PASS with
max|got - expected| = 0.0 across all ranks.
2026-05-10 07:23:36 +00:00
Qinghua Zhou
f2228b07bb ext/ep: WIP Phase 4 fabric-VA cross-node tail/head counter (bypass NVLS multimem.red) 2026-05-10 06:07:46 +00:00
Qinghua Zhou
bf0a7e788a ext/ep: WIP Phase 4 normalize fence/sync ordering in combine writer 2026-05-10 03:58:20 +00:00
Qinghua Zhou
28f1d722e1 ext/ep: WIP Phase 4 NVLS HT B2 cooperative-copy + relaxed multimem.red
- Switch nvls_ctr_add from multimem.red.release to multimem.red.relaxed
  with an explicit __threadfence_system() at the call site to publish
  data writes before bumping the counter. Empirically on Azure GB200,
  release semantics on multimem.red.add issued concurrently from many
  warps triggers unspecified launch failure.

- Add fabric-VA cooperative copy for the small notify-dispatch meta
  payload in kRDMASender (replaces port_channel.put which is unreliable
  cross-node on Azure CX-7 RoCE without flush).
2026-05-10 01:33:14 +00:00
Qinghua Zhou
591fe8272b ext/ep: WIP Phase 4 NVLS HT B2 third multimem barrier in notify_dispatch
Replace the cross-node port_channel signal/wait at the tail of
notify_dispatch (the "B2" barrier — third NVLS barrier after b0 and
b1) with an epoch-monotonic NVLS multimem.red.add.u64 at slot
+16 on the per-rank multicast region.

Root cause this fixes: with port_channel signal/wait still in place,
exactly one CTA thread (the one with `barrier_thread_id != rdma_rank`)
hangs in IB wait() because Azure CX-7 RoCE has the same broken
semaphore path that failed in Phase 1-3. All other threads then
deadlock at the subsequent `if constexpr (!kLowLatencyMode)
__syncthreads()` waiting for the IB thread, so notify_dispatch never
reaches the second barrier_device or the dispatch kernel launch.
Diagnostic prints confirmed all ranks reached "pass final-barrier"
only AFTER the print, never the "DONE" print.

Validated cross-node on 2x Azure GB200:
  * All 8 ranks pass NVLS b0, b1, AND b2 barriers.
  * `notify_dispatch DONE` prints for all ranks.
  * Dispatch kernel enters (`[ph4-K] dispatch entry rank=0..3`).
  * Sender hot path runs; cooperative-copy `[ph4-d]` print fires
    showing valid fabric VA dst_p (0xbf80..0xbfe0 range) and
    local src_p — confirming the int4 cooperative store at least
    starts executing.

Next failure: dispatch kernel hits "CUDA error: unspecified launch
failure" shortly after the first `[ph4-d]` cooperative copy. Almost
certainly an illegal memory access in `dst_p[k] = src_p[k]` (wrong
n_int4, wrong slot offset, or stale peer base). To be debugged in
next iteration.
2026-05-09 21:39:46 +00:00
Qinghua Zhou
46701d4161 ext/ep: WIP Phase 4 NVLS HT B2 fabric-IPC data path
Extend NVLS HT B2 plan to the dispatch/combine token data payload:
the IB RDMA WRITE that PortChannel uses for the cross-node payload
hangs on Azure GB200 NVL72 (same root cause as the LL signal path),
so route the payload over the NVL72 fabric VA via cuMem fabric IPC
peer pointers and a warp-cooperative int4 store. Producer-consumer
ordering is signalled via the existing NVLS Phase 3 multimem.red.add
counter (no extra port-channel / IB hops).

Changes:
- src/ext/ep/buffer.cc:
  * Broaden use_fabric_ipc_alloc to (LL || num_rdma_ranks > 1) so the
    rdma_buffer is gpuCallocPhysical'd (cuMem fabric handle) for HT
    internode too.
  * Lift the LL fabric-IPC bring-up (sendMemory/recvMemory + connect
    + RegisteredMemory.data() peer pointers) out of the LL-only block;
    populate peer_rdma_bases[num_ranks] for HT and copy to GPU.
  * Plumb peer_rdma_bases_gpu into both internode::dispatch and
    internode::combine call sites.
- src/ext/ep/kernels/api.cuh: thread `void* const* peer_rdma_bases`
  through dispatch + combine signatures.
- src/ext/ep/kernels/internode.cu:
  * Sender coordinator (kRDMASenderCoordinator, dispatch + combine):
    replace handle.put + handle.flush with an int4 cooperative copy
    where lane[dst_rdma_rank] computes n_issue/issue_tail and shfl-
    broadcasts to the warp; all 32 lanes do dst_p[k] = src_p[k] in
    int4 strides; __threadfence_system + __syncwarp; lane 0 fires
    nvls_ctr_add to advance the consumer counter.
  * Diagnostic printfs at notify_dispatch tail / dispatch+combine
    entry / sender hot path (gated to channel 0, issue_tail==0).
- src/ext/ep/kernels/configs.cuh: NUM_TIMEOUT_CYCLES lowered to 20G
  (~10s on GB200) for debug.

Status: WIP. With these changes notify_dispatch hangs at the second
intra-node barrier_device<NUM_MAX_NVL_PEERS>(task_fifo_ptrs, head,
nvl_rank) call (the one after the final port_channel barrier and
moe_recv_counter store). Diagnostics confirm:
  * fabric-IPC alloc fires (low_latency=0, nvls=1).
  * peer_rdma_bases populated cross-node (fabric VA range).
  * NVLS notify_dispatch barriers b0 and b1 pass cross-node.
  * notify_dispatch reaches "pass final-barrier" on all 8 ranks.
  * All 8 ranks deadlock at the second barrier_device; "DONE"
    print never fires; dispatch kernel never starts.

The first barrier_device in the same kernel (line ~318) passes, so
basic NVL task_fifo access works. Suspect: the fabric-IPC alloc of
rdma_buffer or the all-to-all IPC bring-up perturbs task_fifo_ptrs
or the proxy/connection state used by the second barrier. Next step
is to bisect by reverting the use_fabric_ipc_alloc broadening while
keeping the IPC bring-up (or vice versa).
2026-05-09 20:59:34 +00:00
Qinghua Zhou
3ab2e43b79 ext/ep: NVLS HT B2 phases 1-3 (notify_dispatch barrier + counter fast path)
Phase 1: allocate + bind a per-rank NVLS multicast region for HT
internode counters/barriers/summary, gated by isNvlsSupported() &&
num_rdma_ranks>1 && !low_latency_mode. Layout in buffer.hpp:
{tail, head, barrier, data} sub-regions, 64 channels, NMP^2 slots
each. Falls back transparently when NVLS unavailable.

Phase 2: replace cross-node port_channel.signal/wait barriers in
notify_dispatch (b0,b1) with epoch-monotonic NVLS multimem.red.add.u64
barriers + leader ld.acquire spin. Broadcast per-sender summary via
multimem.st.release.sys.global.v4.b32. Validated cross-node on Azure
GB200 NVL72 (2 nodes x 4 GPUs): rank 0 reads v=8 expected=8 reliably,
confirming connectNvlsCollective binds a fabric-wide multicast object
on this hardware.

Phase 3: route 4 cross-node tail/head counter sites in dispatch+combine
through NVLS multimem.red.add (with handle.flush() between data put
and counter add to preserve visibility ordering); add 3 device-inline
helpers (nvls_ctr_slot_index/add/load) and thread 4 base ptrs through
the kernel templates, LAUNCH_KERNEL macros, host wrappers, and the
buffer.cc call sites. Each NVLS branch is gated by nvls_*_mc != nullptr;
the legacy IB code path is preserved verbatim under else for non-NVLS
hardware.

Deploy note: mscclpp_ep_cpp.so lives at site-packages/mscclpp_ep_cpp.so
(top-level), not inside mscclpp/. Multi-node deploys must scp it
explicitly in addition to rsyncing the package dir.

Status: phases 1-3 build and deploy clean. Phase 2 validated cross-node.
Phase 3 cannot be end-to-end validated on Azure CX-7 because dispatch/
combine still rely on legacy port_channel handle.put for the actual
token data payload, and that IB write path has the same cross-node
failure mode as signal/wait/putWithSignal on this RoCE config (see
debug history sec 10.6, 10.12). Phase 3 code is correct (verified by
nullptr-overriding the NVLS ptrs reproduces the same legacy hang) and
will activate cleanly on hardware with working IB.
2026-05-09 19:25:29 +00:00
Qinghua Zhou
8f2c4e7d98 ext/ep: route LL internode atomics over NVL72 fabric (Proposal A)
Azure CX-7 RoCE has IBV_ATOMIC_NONE so the proxy emulated atomicAdd hangs LL internode tests. Bypass RDMA atomics for the LL path by routing peer pointers through cuMem fabric IPC over the NVL72 NVSwitch fabric (intra-node CUDA-IPC, cross-node fabric handles imported via nvidia-imex). LL kernels then perform direct st.global + atomicAdd on peer pointers.

- buffer.cc: allocate rdma_buffer_ptr via mscclpp::detail::gpuCallocPhysical (POSIX_FD|FABRIC handle types) so it is eligible for cuMem fabric IPC.

- buffer.cc: lift LL IPC fast-path gate from low_latency_mode && num_rdma_ranks==1 to low_latency_mode; drop cudaIpcGet/OpenMemHandle exchange and resolve peer bases from RegisteredMemory::data() (mscclpp CudaIpc transport handles fabric handle import).

- buffer.hpp: peer_rdma_bases is std::vector<void*> sized to num_ranks (was fixed-size NUM_MAX_NVL_PEERS); destructor relies on RegisteredMemory shared_ptrs for IPC mapping cleanup.

Validated on 2x Azure GB200 (8 ranks): LL dispatch/combine PASS with bit-exact results; LL bench at 128 tokens/h7168/topk=8 hits 39.9us dispatch, 37.7us combine (~3 TB/s aggregated).
2026-05-08 22:04:35 +00:00
Qinghua Zhou
5d16ac958e EP GB200 (4 GPUs/node) support
- configs.cuh: NUM_MAX_NVL_PEERS 8 -> 4
- internode.cu: introduce NvlPackT (uint64_t for 8 peers, uint32_t for 4)
  to handle packed-bool loads of is_token_in_rank; relax SourceMeta
  static_assert; replace 4 uint64_t-coupled sites
- buffer.hpp/buffer.cc: relax NUM_MAX_NVL_PEERS assert (4 || 8); read
  MSCCLPP_EP_LOCAL_WORLD_SIZE env to override rdma_rank/nvl_rank
  partitioning when local world size != NUM_MAX_NVL_PEERS
- CMakeLists.txt (ext/ep): rpath / install fix
- pyproject.toml: MSCCLPP_BUILD_EXT_EP=ON
- src/core/atomicadd_kernel.cu, kernels/buffer.cuh, kernels/utils.cuh:
  related EP fixes
- test_internode_multirank.py: NUM_MAX_NVL_PEERS=4, rank %% 4
2026-05-08 01:42:21 +00:00
Qinghua Zhou
e87c66a85d ext/ep: apply clang-format and black to fix CI lint failures
Run `tools/lint.sh cpp` (clang-format 14) and `tools/lint.sh py`
(black) over the EP extension files added by this PR. No functional
changes; pure reformatting to satisfy the cpplint and pylint CI jobs.
2026-05-06 04:12:20 +00:00
Qinghua Zhou
01032fa167 core: TODO notes on CUDA-IPC atomicAdd context/flush caveats
Annotate the two known issues flagged by Copilot review on PR #796:

- atomicadd_kernel.cu: launching the atomicAdd kernel from a separate
  CUDA context while `dst` is a CUDA-IPC mapping registered in the
  primary context is technically UB; works in practice on current
  drivers but should be revisited.
- context.cc: `CudaIpcStream::sync()` deliberately skips
  `proxyAtomicStream_` to avoid deadlocking the proxy thread, with
  the side effect that `Connection::flush()` does not order pending
  remote atomicAdd ops on the CUDA-IPC transport.

Both behaviors were cherry-picked from DeepEP branch
`chhwang/dev-atomic-add-cleanup` and should be revisited before this
lands on `main`.
2026-05-06 03:44:10 +00:00
Qinghua Zhou
23e8ce6dbe ext/ep: add pragma once to event.hpp and update validation docs
- Add #pragma once to src/ext/ep/event.hpp; including it in multiple TUs
  would otherwise redefine EventHandle.
- python/mscclpp/ext/ep/buffer.py: low-latency internode is now validated
  on 2x H100x8; remove the 'untested on multi-node H100' note.
- src/ext/ep/kernels/internode_ll.cu: replace the untested-on-multi-node
  WARNING with the current validated-on-2x-H100x8 status.

Addresses Copilot review comments on PR #796.
2026-05-06 03:24:34 +00:00
Qinghua Zhou
c641487c55 ext/ep: fix SWITCH_* macros and add missing standard headers
- Wrap SWITCH_* macros in launch.cuh in do { ... } while(false) so the
  trailing while(false) terminates the macro instead of dangling after
  the closing brace of the switch.
- Add #include <type_traits> to utils.cuh for std::remove_reference used
  in UNROLLED_WARP_COPY.
- Add #include <limits> to intranode_kernel.cu and internode.cu for
  std::numeric_limits.

Addresses Copilot review comments on PR #796.
2026-05-06 03:18:39 +00:00
Qinghua Zhou
b2880652ce ext/ep: remove unused mscclpp_ep CMake target
The mscclpp_ep SHARED library target globbed buffer.cc and bindings.cpp
(which depend on Torch and define PYBIND11_MODULE) but did not link
against Torch or torch_python, and would have produced duplicate
pybind11 module symbols alongside mscclpp_ep_cpp. The target was unused;
remove it.
2026-05-06 03:09:04 +00:00
Qinghua Zhou
075a43ade7 ext/ep: remove outdated single-rank smoke test 2026-05-06 03:09:04 +00:00
Qinghua Zhou
5178155be8 ext/ep: add MIT license headers to EP sources and tests 2026-05-06 02:42:49 +00:00
Qinghua Zhou
89f17dab5b Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-05 19:28:37 -07:00
Qinghua Zhou
89cb62d047 Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-05 19:24:28 -07:00
Qinghua Zhou
fdf7d579dc ext/ep: optional preallocated outputs for low_latency_dispatch
Add optional out_packed_recv_x / out_src_info / out_layout_range /
out_count parameters to Buffer::low_latency_dispatch so callers can
hoist the four recv-side allocations out of a hot loop, mirroring the
existing out= path on low_latency_combine.

The bench in test_low_latency_multirank.py preallocates these tensors
once and passes them on every iter so the timed loop reflects kernel
cost, not torch.empty + caching-allocator overhead.
2026-04-30 18:45:44 +00:00
Qinghua Zhou
2529774868 tests/ep: intranode send-side counts unique (token, dst_node) to match NCCL-EP
Previously total_send_tokens was Sigma over dst_rank of num_tokens_per_rank
which over-counts intra-node fan-out. NCCL-EP's ep_bench collapses
multiple destinations on the same node into one count; on a single-node
run that means total_send_tokens = number of tokens with at least one
valid expert. Switching to is_token_in_rank.any(dim=1).sum() makes the
send-side BW comparable to NCCL-EP's send: total_bw / nvl_bw line.
2026-04-29 23:31:47 +00:00
Qinghua Zhou
6ad82e8bbe tests/ep: disable NCCL HeartbeatMonitor to silence mpirun shutdown noise
Set TORCH_NCCL_ENABLE_MONITORING=0 before importing torch.distributed.
The barrier+destroy_process_group finally block (afbdcd6a) suffices
under torchrun, but under mpirun rank 0 (the TCPStore server) can exit
before non-zero ranks finish teardown, and the background heartbeat
thread polls the store and logs 'recvValue failed / Connection was
likely closed'. Disabling the monitor outright is safe for short-lived
bench runs.
2026-04-29 20:44:37 +00:00
Qinghua Zhou
e752dbaf97 tests/ep: add NCCL-EP six-metric BW breakdown (send/recv x total/nvl/rdma)
For HT intra/internode benches, compute per-rank avg total_send/rdma_send
and total_recv/rdma_recv token counts (matching NCCL-EP ep_bench
accounting) and print send-side and recv-side BW split into total / nvl
/ rdma columns. Combine reverses send<->recv. Byte-count line mirrors
NCCL-EP's '(per rank avg)' summary so numbers are directly comparable.
2026-04-29 20:44:10 +00:00
Qinghua Zhou
f2feb120b8 ext/ep: refresh README to reflect current LL, proxy sharding, and bench harness 2026-04-29 18:26:36 +00:00
Qinghua Zhou
9213587ffe ep tests: report dispatch/combine min, avg, max time and use avg for BW
Aligns with NCCL-EP's ep_bench convention (BW computed from average time
across ranks). Previously we reported only the max time and computed BW
per-rank, which made our numbers more pessimistic than NCCL-EP's.
2026-04-29 16:50:33 +00:00
Qinghua Zhou
afbdcd6a3d ep tests: clean shutdown to silence TCPStore/HeartbeatMonitor noise
Add dist.barrier() + dist.destroy_process_group() in a finally block so
non-zero ranks don't poll the TCPStore after rank 0 (the store server)
exits, which produced noisy 'recvValue failed / Connection was likely
closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor.

Also pass device_id to init_process_group in the internode test to
silence 'Guessing device ID based on global rank' warnings.
2026-04-29 05:16:22 +00:00
Qinghua Zhou
0227626335 ep: env-tunable + arch-aware num_proxy_services (default 8 on Hopper, 1 on Blackwell)
Override at runtime with MSCCLPP_EP_NUM_PROXIES.
N=8 is the knee on H100+IB; N>=12 collapses from CPU oversubscription.
Intra-node LL is unchanged.
2026-04-27 21:28:29 +00:00
Qinghua Zhou
7e8415580f ep: shard PortChannels across multiple ProxyServices
Each mscclpp::ProxyService spawns one host-side proxy thread that
drains its FIFO and posts IB work requests. With LL combine pushing
~1k put + 60 atomicAdd FIFO entries per iter, that single thread is
the wall-clock bottleneck on cross-node runs.

Split the channel set across kNumProxyServices=4 separate services
so the host-side dispatch parallelism scales linearly. SemaphoreIds
and MemoryIds are scoped to a ProxyService, so:

- addMemory() is broadcast to every service in the same global order
  so a single MemoryId still identifies the memory everywhere.
- Each (peer_rank, channel_idx) is assigned to one proxy_idx via
  round-robin; the resulting PortChannel is built on that proxy and
  inherits its FIFO. The kernel is unchanged: the flat handle array
  routes the right way automatically.

No kernel-level changes, no tuning of QP count, no new env knobs.
2026-04-27 20:24:00 +00:00
Qinghua Zhou
48540bc11e tests/ep: align internode HT bench with NCCL-EP accounting
Same change as the intra-node bench (commit 4ed6f229), applied to the
cross-node test:

- Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can
  match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8).
- Switch BW accounting from recv_tokens*hidden to bench_tokens*hidden,
  matching NCCL-EP's `RDMA_send` per-rank byte count.
2026-04-27 17:39:17 +00:00
Qinghua Zhou
4ed6f229f2 tests/ep: align intranode HT bench with NCCL-EP accounting
- Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can
  match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8). The
  functional check above continues to use the smaller (num_ranks*4
  experts, topk=4) configuration.

- Switch BW accounting from recv_tokens*hidden to bench_tokens*hidden,
  matching NCCL-EP's `RDMA_send` per-rank byte count. The previous
  formula counted DeepEP's expanded recv layout (one row per
  (token,src_rank) pair), inflating reported GB/s ~5x and making
  cross-stack comparisons misleading.
2026-04-27 17:14:42 +00:00
Qinghua Zhou
9c129b8b5a ep/ll: keep (3,10) on PortChannel path
Cross-node LL regressed when (1, 32) was applied uniformly: dispatch
1031us -> 1570us, combine 2553us -> 3484us. Larger grid means more
concurrent putWithSignal calls onto the host-proxy FIFO and a costlier
cg::this_grid().sync() between phases, both of which dominate the IB
path even though more SMs help the recv-side compute.

Make (kNumWarpGroups, kNumWarpsPerGroup) path-dependent: (1, 32) when
use_ipc_path, (3, 10) otherwise. Restores cross-node performance and
keeps the intra-node win.
2026-04-25 01:00:35 +00:00
Qinghua Zhou
7d6efee18b ep/ll: use 1 expert per SM with 32 warps per block
NCCL-EP's LL dispatch/combine kernel uses (numWarpGroups=1,
numWarpsPerGroup=32) when num_experts <= device_num_sms, giving each
SM ownership of a single expert and 32 warps to cooperate on its
recv-side per-(expert, src_rank) work. We were using (3, 10) — 3
experts per SM, 10 warps per (expert, rank) pair — which left a
significant amount of recv-side parallelism on the table because each
warp had to walk ~3x more tokens sequentially.

Switching to (1, 32) for both dispatch and combine matches NCCL-EP's
structure for typical EP sizes (num_experts in {32, 64, 256}) where
num_experts <= 132 SMs.

The static_assert kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup
still holds (9 <= 32) and the wider block also lets the staging loop
process the hidden-dim with one int4 per thread (hidden_bf16_int4=896
fits easily in 992 working threads).
2026-04-25 01:00:35 +00:00
Qinghua Zhou
1600074f09 tests/ep: hoist combine output tensor out of the timed loop
The LL combine benchmark was cloning the ~58 MB dispatch recv buffer
('recv_x.clone()') on every timed iteration, adding ~20 us of D2D
memcpy per sample and masking kernel-level changes. It also called
torch.empty() for the output inside the loop. Both now live outside
the timed region; the kernel is invoked against a persistent bench_out
and the recv_x produced by the most recent dispatch.
2026-04-24 21:06:49 +00:00
Qinghua Zhou
6d0f99f084 ep/ll: gate SM-count grid bump behind IPC path
On the PortChannel (cross-node) path the extra blocks don't help: the
dispatch recv loop strides tokens per-warp-group (not per-SM), and the
additional blocks instead add cooperative-grid sync overhead and
increase concurrent host-proxy FIFO traffic. Measured cross-node
dispatch regressed from 1013us to 3063us when the unconditional grid
bump was active.

Keep the scaled grid for the IPC path (intra-node), where combine-recv
and dispatch token striding scale with sm_id and the 1.2-1.3x speedup
reproduces.
2026-04-24 00:02:33 +00:00
Qinghua Zhou
85316b1863 ep/ll: bump grid size to better utilize SMs
LL dispatch/combine are latency-bound at typical problem sizes: for
num_experts=32 the previous grid was cell_div(32,3)=11 blocks, i.e. 8%
of a 132-SM H100. The recv-side bodies already stride tokens by sm_id,
so extra blocks parallelize token work linearly. Extra blocks past
num_experts are gated out of the send/count phases by the existing
'responsible_expert_idx < num_experts' check.

Cap at the device's SM count (cooperative launch + launch_bounds(960,1)
allow one block per SM).
2026-04-23 23:38:23 +00:00
Qinghua Zhou
9840853c69 tests/ep: HT benches also print per_rank_bw
Same alignment with NCCL-EP ep_bench as the LL test: report both
per-rank (agg/num_ranks) and aggregate throughput.
2026-04-23 22:58:23 +00:00
Qinghua Zhou
10cd0012f1 tests/ep: LL bench prints per_rank_bw and accepts size env vars
- Report both per-rank and aggregate BW to align with NCCL-EP's ep_bench
  (which reports per-rank GB/s).
- Accept MSCCLPP_EP_LL_TOKENS/HIDDEN/TOPK/EXPERTS_PER_RANK env overrides
  so we can match external benchmark problem sizes (NCCL-EP LL defaults
  are num_tokens=128, hidden=7168, top_k=8).
2026-04-23 22:20:40 +00:00
Qinghua Zhou
63afb25ab3 tests/ep: LL bench combine uses recv_tokens×hidden for payload bytes
Each local expert sends one copy per dispatched token back to its owner,
so the bytes actually on the wire during combine match dispatch. The
previous num_tokens×hidden under-counted by ~num_topk×, making combine
BW look artificially low next to dispatch.
2026-04-23 21:53:34 +00:00
Qinghua Zhou
b0eb5da53d ext/ep: LL intra-node fast path via CUDA IPC + MemoryChannel
When all ranks live on the same host (num_rdma_ranks == 1), the LL
kernels now bypass PortChannel/IB-loopback entirely. In Buffer::sync()
we additionally:
  - allGather IPC handles for each rank's rdma_buffer_ptr and
    cudaIpcOpenMemHandle them into peer_rdma_bases[]
  - build per-peer MemoryChannels over CUDA IPC connections (tag=2)
    used only for the LL barrier ring

The three LL kernels (clean / dispatch / combine) gain a kIpcPath
template parameter and two extra args (peer_rdma_bases,
memory_channel_handles). At each peer op:
  - put -> peer-mapped warp copy over NVLink
  - atomicAdd-like flag store -> single-writer st_na_release on peer ptr
  - signal/wait barrier -> MemoryChannel signal/wait

Cross-node LL (num_rdma_ranks > 1) is untouched; the IPC setup block is
a no-op. The host launch wrappers select the variant via use_ipc_path.
2026-04-23 21:10:39 +00:00
Qinghua Zhou
906fa3c48f tests/ep: size HT buffers for bench hidden so bench phase fits 2026-04-23 17:13:09 +00:00
Qinghua Zhou
441bfa5265 ext/ep: keep self slots in LL sema+port-channel loops (fixes cross-node)
The prior commit skipped r==rank in the semaphore and port-channel
build loops on the theory that the self-slot handshake skew was the
cause of LL direction asymmetry. That was wrong (the real bug was
int32 atomic alignment), and skipping self breaks other code paths
that assume every rank slot is represented -- cross-node HT and LL
failed with cudaErrorInvalidResourceHandle at the first barrier after
Buffer init. Restore the self-inclusive loop.
2026-04-23 15:58:57 +00:00
Qinghua Zhou
a073ca7bef ext/ep: restore self CUDA-IPC connection (was needed by HT/LL paths)
Dropping the self ipc_cfg connection caused cudaErrorInvalidResourceHandle
on multi-node launches. Keep the self connection (needed by other code
paths that assume every rank is in the connections map) but continue to
skip the self slot in the semaphore + port-channel construction loops so
the kernel's [local_expert*num_ranks + dst_rank] indexing hits only peer
handles; the self slot is a zero-initialized placeholder since the
kernel's same-rank branch uses a direct warp copy.
2026-04-23 06:42:34 +00:00
Qinghua Zhou
1e430874ce ext/ep: fix LL IB atomicAdd alignment by widening signaling buffers to int64
The low-latency dispatch/combine kernels signal recv counts via MSCCL++
PortChannel.atomicAdd, which lowers to IB IBV_WR_ATOMIC_FETCH_AND_ADD.
That opcode requires the remote address to be 8-byte aligned, but
LowLatencyLayout packed the per-expert signaling slots as int32. Odd
slots landed at offset %8 == 4; the NIC silently dropped those atomics
and the target rank spun forever in recv_hook (observed: even->odd
direction works, odd->even does not, across all tested topologies
including 2-rank intra-node, 8-rank intra-node, and 2-node 1-GPU-each).

Widen dispatch_rdma_recv_count_buffer / combine_rdma_recv_flag_buffer to
int64_t, update clean kernel + kernel signatures + next_clean pointers
accordingly, and add int64_t overloads for st_na_release /
ld_acquire_sys_global in utils.cuh.

Also drop the bogus self CUDA-IPC connection in Buffer::sync() that was
previously skewing the cross-rank buildAndAddSemaphore handshake order;
the kernel's same-rank branch uses a direct warp copy and never touches
the self port-channel slot (filled with a zero-initialized placeholder
so the [local_expert*num_ranks + dst_rank] indexing still holds).
2026-04-23 06:31:18 +00:00
Qinghua Zhou
c51a8a5305 ext/ep tests: time dispatch and combine separately in MSCCLPP_EP_BENCH
Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).

Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
2026-04-22 23:11:04 +00:00
Qinghua Zhou
2391ce1de7 ext/ep tests: add optional HT benchmark pass
Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports
per-iter latency (max across ranks, CUDA-event timed) and aggregate
effective bandwidth (sum across ranks, dispatch+combine payload bytes).
Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN.

Bench reuses the Buffer allocated for the correctness phase and
self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.
2026-04-22 19:03:09 +00:00
Qinghua Zhou
f0a72263c8 ext/ep: unfilter LL sync + add LL multirank test (intra-node WIP)
- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode.
  DeepEP's original filter was safe because its LL path used NVSHMEM; this
  port drives LL via PortChannel so the kernel indexes
  port_channel_handles[local_expert*num_ranks + dst_rank] for every
  dst_rank. All peers now get a real memory/connection/semaphore/port
  channel entry.
- Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine
  functional round-trip, BF16 only). Works cross-node in DeepEP's
  1-GPU-per-node topology.
- Known limitation documented in src/ext/ep/README.md and the test docstring:
  intra-node 8-GPU LL currently hangs because every peer transfer routes
  through the CPU proxy over IB loopback between distinct HCAs on the same
  host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which
  mis-aligns the 32-bit rdma_recv_count slots when used for same-node
  peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for
  same-node, PortChannel for cross-node) or 64-bit counters.
2026-04-22 06:11:30 +00:00
Qinghua Zhou
9e96bf3b5d ep: document internode HT validation on 2x H100x8
Refresh status docs and comments now that internode HT dispatch and
combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via
test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover
their per-rank token payloads with zero diff).

- src/ext/ep/README.md: consolidate the previously duplicated README
  into a single document; mark intranode and internode HT dispatch and
  combine as validated in the status table; add a 'Running the tests'
  section with torchrun examples for both the intranode and the 2x8
  internode setups; record the dispatch->combine
  torch.cuda.synchronize() + dist.barrier() requirement under Known
  limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural
  port, untested.

- python/mscclpp/ext/ep/buffer.py: update the module docstring and the
  Buffer constructor docstring to say internode HT is validated and
  clarify that LL mode is untested on multi-node hardware.

- src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported'
  and 'low-latency paths still stubbed' comments. mscclpp_ep does not
  use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL
  paths are a structural port that is present but untested, not stubbed.
  Note validation on 2x H100x8 in the internode section header.
2026-04-22 03:56:16 +00:00
Qinghua Zhou
c351b871a1 ep: fix internode combine in multirank test
Two issues prevented internode HT combine from completing on 2x8 H100:

1. Wrong prefix matrices passed to internode_combine. Combine runs in the
   reverse direction of dispatch, so it must consume the receiver-side
   matrices returned by dispatch (recv_rdma_channel_prefix_matrix,
   recv_rdma_rank_prefix_sum, recv_gbl_channel_prefix_matrix), not the
   sender-side rdma_channel_prefix_matrix / gbl_channel_prefix_matrix.
   This matches DeepEP's deep_ep/buffer.py::internode_combine handle
   unpacking. Without the fix the NVL forwarder's 'NVL check' timed out
   because token_start_idx/token_end_idx were computed against the wrong
   per-channel layout.

2. Cross-rank race between dispatch and combine. Even with the correct
   matrices, launching combine immediately after dispatch deadlocked the
   forwarder NVL check (tail stuck one short of expected_head) because
   peers still had in-flight dispatch proxy traffic while fast ranks had
   already started combine. A torch.cuda.synchronize() + dist.barrier()
   between the two calls makes the test pass deterministically on 16
   ranks (combine diff == 0, max|expected| up to 60.0).

The barrier in the test is a workaround; the real fix belongs in
Buffer::internode_dispatch / Buffer::internode_combine so the
dispatch->combine handoff fully fences outstanding proxy work across
ranks. Marked with an XXX comment in the test.
2026-04-22 02:21:29 +00:00