Inside the IBGDA template branch, runtime-check whether the host has
opened a CUDA IPC peer pointer for the destination rank. If yes, do
the send via NVLink (warp copy / st_na_release on the peer-mapped
pointer); else fall through to the existing port_put / rdma_write_inl8.
Host: in sync(), when low_latency_mode && num_rdma_ranks > 1 && IBGDA
is up, allgather rdma_buffer_ptr IPC handles and cudaIpcOpenMemHandle
only for same-node peers. Sparse pointer table is mirrored to GPU and
threaded into the launchers as peer_bases.
Kernel: per-peer branch added at all four RDMA send sites (dispatch
send-data, dispatch send-count, combine send-data, combine send-flag).
Recv-side polling is transport-agnostic and unchanged.
Result on 16-rank/2-node LL bench:
baseline (IBGDA only): 38.7 / 39.4 GB/s
Phase 11 hybrid: 65.9 / 67.0 GB/s (+70%)
Now matches nccl-ep default-mode numbers (63-71 / 62-72 GB/s).
Validation max diff = 0.
Gated by MSCCLPP_EP_HYBRID_LL env (default on). Single-node LL is
untouched (num_rdma_ranks>1 gate).
Side-by-side at TOKENS=128 hidden=7168 topk=8 BF16 16-rank/2-node:
mscclpp_ep LL : 38.7 / 39.4 GB/s
nccl_ep LL (default) : 63-71 / 62-72 GB/s (NVLink+RDMA mixed)
nccl_ep LL --disable-nvlink (RDMA-only): 42.3 / 42.5 GB/s
The default nccl-ep number routes ~half of selections through NVLink
at NVLink line rate; the weighted average inflates per-rank throughput
above the per-NIC ceiling. Apples-to-apples (RDMA only), nccl-ep is at
42.3 GB/s and we are at 38.7 — a real but small ~9% gap.
Architectural cause: our internode_ll.cu uses a whole-kernel kIpcPath
template (all peers via NVLink OR all via RDMA). nccl-ep selects per
peer, so 7/15 intranode peers go via NVLink while 8/15 go via RDMA.
Implication: Phases 1-9 were optimizing against an unreachable target.
The kernel is at 94% of the actual single-NIC RDMA ceiling. Further
productive work is hybrid NVLink+RDMA dispatch, not RDMA tuning.
- Refactor IbgdaSetup to support N NICs per rank: each rank picks
nic[i] = numa_base + (effective_base + i) % 4 (NUMA-aware).
- Channel c routes via ib_ctxs[c % num_nics]. d_local_mrs becomes a
vector of size num_nics; d_remote_mrs becomes num_nics * num_ranks.
- New env MSCCLPP_EP_NUM_NICS (default 1 = identical to Phase 8 baseline).
- Debug env MSCCLPP_EP_NIC_DUP forces all ctxs onto the same NIC
to isolate multi-IbCtx overhead from real multi-NIC routing.
Empirical findings on 2x8xH100 NDv5 with TOKENS=128 / TOPK=8:
- N=1: 38.7/39.4 GB/s (baseline preserved).
- N=2: 15.4/17.0 GB/s; N=4: 9.0/9.6 GB/s. Strict monotonic regression.
- NIC_DUP=1, N=2: 38.6/39.0 GB/s — multi-PD overhead is zero.
Conclusion: regression is purely from PCIe topology — H100 has a single
efficient P2P path to its NUMA-affine NIC. Posting WRs to any other NIC
forces cross-PCIe-switch hops that dominate any bandwidth gain.
Multi-NIC plumbing is left in place behind opt-in env so the path can
be re-evaluated on different hardware. Single-NIC ceiling ~41 GB/s
stands.
Profiled the combine kernel (TOKENS=128, TOPK=8, BF16 hidden=7168, 16
ranks, IBGDA, grid (1, 32) -> 64 blocks):
send=17us wait=220us grid_sync=112us reduce=9us total=357us
Reduce (the only place TMA/cp.async could help) is 9us / 357us = 2.5%
of total kernel time. Halving it best-case yields <1% end-to-end perf.
The 220us wait is the same NIC-bandwidth ceiling dispatch hits: combine
sends ~14 MB/rank/iter back from each (le, src_rank) to its source
rank. The Phase 7 speculation that combine had a non-NIC bottleneck
profile was wrong - both paths are NIC-bound at the same ceiling.
The 112us grid_sync interval is the same within-rank wait skew from
Phase 7.3 (per-(le, src_rank) NIC contention). Not addressable in
software.
Decision: not implementing the cp.async pipeline. The 50+ LoC of
shared-memory plumbing and __pipeline synchronization is not justified
by the <1% upper bound.
Updated synthesis: both LL paths are single-NIC bandwidth bound at this
problem size. Kernel-side software work is essentially done at 38.9 /
39.6 GB/s = 94-97% of the 41 GB/s practical ceiling. Remaining options
are multi-NIC striping, smaller payload (FP8 / smaller hidden), or NDR2
hardware uplift.
After Phase 6 grid (1, 32) tuning landed at 38.7 / 39.6 GB/s, explored
three more avenues; none produced material gains. All three confirm the
same underlying constraint: single-NIC bandwidth ceiling.
Multi-SGE WR coalesce (kMultiSge template flag, max_send_sge bumped to 4):
WR count per QP fell 16 -> 4 (4x reduction) but wait dropped only 7%
(212 -> 197us). Net dispatch +0.3 GB/s. 1008 WRs/iter * 14336 B = 14.4 MB
per rank per iter; at NDR 50 GB/s NIC ceiling that is 288us min wire time.
Current 38.7 GB/s payload + 7% IB overhead = 41 GB/s on wire = 82% of
single-NIC ceiling. Reverted (stashed); +0.3 GB/s does not justify the
kernel restructure or ABI change.
CUDA Graph capture: incompatible with the IBGDA QP state model.
reserve_wqe_slots does atomicAdd on device-side resv_head per call;
graph replay re-executes WQE writes verbatim while the QP state has
already advanced past the captured snapshot, so subsequent replays write
into stale / wrapped SQ slots and the doorbell index is wrong.
Cross-rank skew: per-rank wait_avg range 188-226us (38us spread); slowest
ranks r2, r7 (node 1), r8 (node 2). Within-rank variance dominates --
single rank's wait spans 200us avg to 321us max across its 64 blocks --
reflecting per-(le, src_rank) NIC contention, not addressable in
software. Best-case skew elimination would save ~25us = ~7%; not pursued.
Synthesis: practical single-NIC LL ceiling at this problem size is
~41 GB/s payload. Phase 6's 38.7 / 39.6 sits at 94-97% of that ceiling.
Further gains require multi-NIC striping, a different problem size, or
NDR2 hardware. Combine path TMA / cp.async pipeline is the remaining
software option since it has a different (non-NIC) bottleneck profile.
Sweep on 2x8xH100 NDR (TOKENS=128, TOPK=8, num_experts=64, num_ranks=16,
BF16 hidden=7168) shows the IBGDA RDMA path benefits from the same
(kNumWarpGroups=1, kNumWarpsPerGroup=32) topology as the IPC path, not
the previous (3, 10).
config blocks dispatch GB/s combine GB/s
(3, 10) baseline 22 36.32 37.43
(2, 16) 32 ~36.6 ~36.9
(4, 8) 16 34.21 n/a
(1, 32) adopted 64 38.65 39.43
The previous (3, 10) comment justified itself via host-proxy FIFO
contention, but that rationale only applies to the PortChannel path;
on the IBGDA path there is no host proxy in the data line. With
num_sms_base = ceil(num_experts/kNumWarpGroups), (1, 32) grows the grid
22 -> 64 blocks (1 expert per SM) and the recv-side unpack body
(strided by sub_warp_id) runs ~3x faster with 32 warps per group
instead of 10. Send WR count is unchanged so wait stays NIC-bound.
Profile delta: send 55->28us, wait ~190us (unchanged), unpack ~7->5us,
kernel total ~310->282us. Per-rank BW now 77% of NDR HCA ceiling.
Also adds LL_OPTIMIZATION_HISTORY.md documenting the full Phase 1-6
investigation (K-shard, eager DB, grid_sync, TOPK, grid sweep).
Default still 16. Lets us sweep QP count for diagnostic purposes.
Sweep results at TOKENS=128/TOPK=8 BF16, 16 ranks across 2 nodes:
channels=4 dispatch=37.16 combine=36.57 agg=1180 GB/s
channels=8 dispatch=36.95 combine=36.64 agg=1177 GB/s
channels=16 dispatch=36.18 combine=37.36 agg=1177 GB/s
channels=32 dispatch=36.45 combine=37.67 agg=1186 GB/s
QP count is essentially flat across 4..32 (within ±2% measurement
noise), confirming SQ/doorbell coalescing is not the bottleneck. The
~200us 'wait' window in both dispatch and combine is genuine RDMA
arrival jitter, not back-pressure from over-provisioned QPs.
Add a per-block timestamp slot just after cg::this_grid().sync() in the
combine kernel so the previous 'reduce' window is decomposed into:
[b*8 + 0] send entry
[b*8 + 1] send done
[b*8 + 2] recv-flag spinwait done
[b*8 + 3] grid_sync done
[b*8 + 4] kernel done
At TOKENS=128/TOPK=8 IBGDA this reveals the breakdown is
send=23us wait=200us grid_sync=110us reduce=22us total=355us
not the previously assumed 'reduce=130us' (which lumped grid_sync into
the arithmetic). The actual int4-load + bf16 FMA pass is only ~22us, so
a TMA-pipelined receive cannot meaningfully recover the gap vs nccl-ep:
the difference is on the sender / RDMA arrival side, not the reducer.
Print the resolved `mlx5_ibN` device name during build_ibgda_setup so
NIC affinity is visible at startup, and let MSCCLPP_EP_IB_DEVICE_OVERRIDE
force a specific IB transport index (0..7) for diagnostic NIC swaps.
Confirmed on NDv5 (InfiniBand, 8 NICs/node, 8 GPUs/node) that the
default `device_id == local_rank` mapping already yields the correct
NUMA-affine NIC for every rank, leaving no NIC-side headroom in this
configuration.
Extend the LL dispatch profile from 4 to 16 per-block uint64_t slots so
we can observe per-warp_group wait and unpack times separately:
[b*16 + 0] send phase entry
[b*16 + 1] send phase done
[b*16 + 2] kernel done
[b*16 + 4 + wg] warp_group wg unpack-start (recv-count ack received)
[b*16 + 8 + wg] warp_group wg unpack-end (after for-loop)
This decomposes the previous lumped 'unpack' (kernel_exit minus
last_wg_ack) into the actual per-warp_group copy time and the per-
warp_group network wait, exposing that the copy itself is ~6 us / wg
and the bulk of the time is network arrival jitter.
Move the combine prof_buf base from offset 65536 to 196608 so it sits
past the dispatch prof_buf (1024 + 1024*16*8 = 132096), preserving the
no-collision guarantee with dispatch's atomic counters.
Add MSCCLPP_EP_LL_PROFILE_PRINT_EVERY=N (default 1) so readback /
cudaMemcpy / cudaMemset only run every Nth call. With EVERY=29 over a
30-iter bench (10 warmup + 20 timed) the readback fires once and the
PROFILE-ON BW penalty drops from ~10% to ~3% vs PROFILE-OFF, making
the instrumentation usable during real benchmarks.
Add an off-by-default CMake option MSCCLPP_EP_LL_PROFILE that compiles
in per-block clock64() timestamps in the low-latency dispatch and
combine kernels, plus host-side readback in buffer.cc that prints a
min/avg/max breakdown when MSCCLPP_EP_LL_PROFILE_PRINT=1.
Per-block slots written to workspace tail (layout [block][4] uint64_t):
[0]=phase entry [1]=send done [2]=wait done [3]=kernel exit
Workspace offsets are chosen to coexist with the dispatch atomic
counters (atomic_counter_per_expert + atomic_finish_counter_per_expert
occupy [0..2*num_experts*sizeof(int)]). Dispatch prof_buf at offset
1024; combine prof_buf at offset 65536. The earlier draft placed
combine prof at offset 256, which silently corrupted dispatch's
atomic_finish_counter_per_expert between iterations and caused iter
1+ to hang in dispatch's FINISHED_SUM_TAG spinwait.
Profile breakdown (us, 16 ranks, TOKENS=128/TOPK=8) confirms combine
is wait-dominated:
combine send~10 wait~50 reduce~40
dispatch send~30 wait~100 unpack~50 (FP8 cast in send)
OFF by default (zero overhead). Enable with:
cmake -DMSCCLPP_EP_LL_PROFILE=ON .
MSCCLPP_EP_LL_PROFILE_PRINT=1 ./run_ll_mpirun.sh
Add a GPU-initiated RDMA WRITE path for the LL dispatch/combine kernels
based on mlx5dv direct verbs, alongside the existing IPC and host-FIFO
PortChannel paths. Selected at runtime via MSCCLPP_EP_USE_IBGDA when
num_rdma_ranks > 1.
Core (src/core, include/mscclpp):
- New ibgda module (ibgda.{hpp,cc}, ibgda_device.cuh): per-peer mlx5
QP/MR/CQ setup, device-side WQE writers (write_rdma_wqe,
write_rdma_write_inl_wqe for 4B/8B), submit_requests / submit_no_db
ring helpers, and a poller thread for send CQs.
- ibgda_port_channel_device.{hpp,cuh}: thin port_put() wrapper over
rdma_write with signal_cqe / ring_db flags so callers can issue
UNSIGNALED batched WRs and ring the doorbell once at the tail.
- mlx5dv_wrapper: expose extra symbols needed for direct WQE
construction; minor connection.cc / proxy.cc / port_channel.cc
plumbing to surface QP / MR handles and rkeys to the EP layer.
EP layer (src/ext/ep):
- ibgda_setup.{hpp,cc}: build per-(local_expert, peer_rank) GpuQp
handles, exchange remote MR addr/rkey via the bootstrap, own the
CQ poller. h.dst is set to the per-peer remote_mrs index.
- buffer.{hpp,cc}: gate IBGDA path with use_ibgda_path_ &&
ibgda_setup_ != nullptr && !use_ipc; pass device_handles to the
kernel launchers.
- kernels/internode_ll.cu: 3-way DISPATCH_LAUNCH_CASE /
COMBINE_LAUNCH_CASE (IPC / IBGDA / port-FIFO), templated on
kIbgdaPath. Data PUTs are issued UNSIGNALED with ring_db=false;
the trailing per-QP count write (dispatch) and flag write
(combine) keep the defaults so each QP gets a single signaled
WR that advances prod_idx past all queued data WRs and rings
the doorbell once.
Test (test/python/ext/ep): extend test_low_latency_multirank.py with
env-driven config knobs (MSCCLPP_EP_LL_TOKENS / _HIDDEN / _TOPK /
_EXPERTS_PER_RANK) for sweeping the new path.
Run `tools/lint.sh cpp` (clang-format 14) and `tools/lint.sh py`
(black) over the EP extension files added by this PR. No functional
changes; pure reformatting to satisfy the cpplint and pylint CI jobs.
Annotate the two known issues flagged by Copilot review on PR #796:
- atomicadd_kernel.cu: launching the atomicAdd kernel from a separate
CUDA context while `dst` is a CUDA-IPC mapping registered in the
primary context is technically UB; works in practice on current
drivers but should be revisited.
- context.cc: `CudaIpcStream::sync()` deliberately skips
`proxyAtomicStream_` to avoid deadlocking the proxy thread, with
the side effect that `Connection::flush()` does not order pending
remote atomicAdd ops on the CUDA-IPC transport.
Both behaviors were cherry-picked from DeepEP branch
`chhwang/dev-atomic-add-cleanup` and should be revisited before this
lands on `main`.
- Add #pragma once to src/ext/ep/event.hpp; including it in multiple TUs
would otherwise redefine EventHandle.
- python/mscclpp/ext/ep/buffer.py: low-latency internode is now validated
on 2x H100x8; remove the 'untested on multi-node H100' note.
- src/ext/ep/kernels/internode_ll.cu: replace the untested-on-multi-node
WARNING with the current validated-on-2x-H100x8 status.
Addresses Copilot review comments on PR #796.
- Wrap SWITCH_* macros in launch.cuh in do { ... } while(false) so the
trailing while(false) terminates the macro instead of dangling after
the closing brace of the switch.
- Add #include <type_traits> to utils.cuh for std::remove_reference used
in UNROLLED_WARP_COPY.
- Add #include <limits> to intranode_kernel.cu and internode.cu for
std::numeric_limits.
Addresses Copilot review comments on PR #796.
The mscclpp_ep SHARED library target globbed buffer.cc and bindings.cpp
(which depend on Torch and define PYBIND11_MODULE) but did not link
against Torch or torch_python, and would have produced duplicate
pybind11 module symbols alongside mscclpp_ep_cpp. The target was unused;
remove it.
Add optional out_packed_recv_x / out_src_info / out_layout_range /
out_count parameters to Buffer::low_latency_dispatch so callers can
hoist the four recv-side allocations out of a hot loop, mirroring the
existing out= path on low_latency_combine.
The bench in test_low_latency_multirank.py preallocates these tensors
once and passes them on every iter so the timed loop reflects kernel
cost, not torch.empty + caching-allocator overhead.
Previously total_send_tokens was Sigma over dst_rank of num_tokens_per_rank
which over-counts intra-node fan-out. NCCL-EP's ep_bench collapses
multiple destinations on the same node into one count; on a single-node
run that means total_send_tokens = number of tokens with at least one
valid expert. Switching to is_token_in_rank.any(dim=1).sum() makes the
send-side BW comparable to NCCL-EP's send: total_bw / nvl_bw line.
Set TORCH_NCCL_ENABLE_MONITORING=0 before importing torch.distributed.
The barrier+destroy_process_group finally block (afbdcd6a) suffices
under torchrun, but under mpirun rank 0 (the TCPStore server) can exit
before non-zero ranks finish teardown, and the background heartbeat
thread polls the store and logs 'recvValue failed / Connection was
likely closed'. Disabling the monitor outright is safe for short-lived
bench runs.
Aligns with NCCL-EP's ep_bench convention (BW computed from average time
across ranks). Previously we reported only the max time and computed BW
per-rank, which made our numbers more pessimistic than NCCL-EP's.
Add dist.barrier() + dist.destroy_process_group() in a finally block so
non-zero ranks don't poll the TCPStore after rank 0 (the store server)
exits, which produced noisy 'recvValue failed / Connection was likely
closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor.
Also pass device_id to init_process_group in the internode test to
silence 'Guessing device ID based on global rank' warnings.
Each mscclpp::ProxyService spawns one host-side proxy thread that
drains its FIFO and posts IB work requests. With LL combine pushing
~1k put + 60 atomicAdd FIFO entries per iter, that single thread is
the wall-clock bottleneck on cross-node runs.
Split the channel set across kNumProxyServices=4 separate services
so the host-side dispatch parallelism scales linearly. SemaphoreIds
and MemoryIds are scoped to a ProxyService, so:
- addMemory() is broadcast to every service in the same global order
so a single MemoryId still identifies the memory everywhere.
- Each (peer_rank, channel_idx) is assigned to one proxy_idx via
round-robin; the resulting PortChannel is built on that proxy and
inherits its FIFO. The kernel is unchanged: the flat handle array
routes the right way automatically.
No kernel-level changes, no tuning of QP count, no new env knobs.
Same change as the intra-node bench (commit 4ed6f229), applied to the
cross-node test:
- Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can
match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8).
- Switch BW accounting from recv_tokens*hidden to bench_tokens*hidden,
matching NCCL-EP's `RDMA_send` per-rank byte count.
- Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can
match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8). The
functional check above continues to use the smaller (num_ranks*4
experts, topk=4) configuration.
- Switch BW accounting from recv_tokens*hidden to bench_tokens*hidden,
matching NCCL-EP's `RDMA_send` per-rank byte count. The previous
formula counted DeepEP's expanded recv layout (one row per
(token,src_rank) pair), inflating reported GB/s ~5x and making
cross-stack comparisons misleading.
Cross-node LL regressed when (1, 32) was applied uniformly: dispatch
1031us -> 1570us, combine 2553us -> 3484us. Larger grid means more
concurrent putWithSignal calls onto the host-proxy FIFO and a costlier
cg::this_grid().sync() between phases, both of which dominate the IB
path even though more SMs help the recv-side compute.
Make (kNumWarpGroups, kNumWarpsPerGroup) path-dependent: (1, 32) when
use_ipc_path, (3, 10) otherwise. Restores cross-node performance and
keeps the intra-node win.
NCCL-EP's LL dispatch/combine kernel uses (numWarpGroups=1,
numWarpsPerGroup=32) when num_experts <= device_num_sms, giving each
SM ownership of a single expert and 32 warps to cooperate on its
recv-side per-(expert, src_rank) work. We were using (3, 10) — 3
experts per SM, 10 warps per (expert, rank) pair — which left a
significant amount of recv-side parallelism on the table because each
warp had to walk ~3x more tokens sequentially.
Switching to (1, 32) for both dispatch and combine matches NCCL-EP's
structure for typical EP sizes (num_experts in {32, 64, 256}) where
num_experts <= 132 SMs.
The static_assert kNumMaxTopK + 1 <= kNumWarpGroups * kNumWarpsPerGroup
still holds (9 <= 32) and the wider block also lets the staging loop
process the hidden-dim with one int4 per thread (hidden_bf16_int4=896
fits easily in 992 working threads).
The LL combine benchmark was cloning the ~58 MB dispatch recv buffer
('recv_x.clone()') on every timed iteration, adding ~20 us of D2D
memcpy per sample and masking kernel-level changes. It also called
torch.empty() for the output inside the loop. Both now live outside
the timed region; the kernel is invoked against a persistent bench_out
and the recv_x produced by the most recent dispatch.
On the PortChannel (cross-node) path the extra blocks don't help: the
dispatch recv loop strides tokens per-warp-group (not per-SM), and the
additional blocks instead add cooperative-grid sync overhead and
increase concurrent host-proxy FIFO traffic. Measured cross-node
dispatch regressed from 1013us to 3063us when the unconditional grid
bump was active.
Keep the scaled grid for the IPC path (intra-node), where combine-recv
and dispatch token striding scale with sm_id and the 1.2-1.3x speedup
reproduces.
LL dispatch/combine are latency-bound at typical problem sizes: for
num_experts=32 the previous grid was cell_div(32,3)=11 blocks, i.e. 8%
of a 132-SM H100. The recv-side bodies already stride tokens by sm_id,
so extra blocks parallelize token work linearly. Extra blocks past
num_experts are gated out of the send/count phases by the existing
'responsible_expert_idx < num_experts' check.
Cap at the device's SM count (cooperative launch + launch_bounds(960,1)
allow one block per SM).
- Report both per-rank and aggregate BW to align with NCCL-EP's ep_bench
(which reports per-rank GB/s).
- Accept MSCCLPP_EP_LL_TOKENS/HIDDEN/TOPK/EXPERTS_PER_RANK env overrides
so we can match external benchmark problem sizes (NCCL-EP LL defaults
are num_tokens=128, hidden=7168, top_k=8).
Each local expert sends one copy per dispatched token back to its owner,
so the bytes actually on the wire during combine match dispatch. The
previous num_tokens×hidden under-counted by ~num_topk×, making combine
BW look artificially low next to dispatch.
When all ranks live on the same host (num_rdma_ranks == 1), the LL
kernels now bypass PortChannel/IB-loopback entirely. In Buffer::sync()
we additionally:
- allGather IPC handles for each rank's rdma_buffer_ptr and
cudaIpcOpenMemHandle them into peer_rdma_bases[]
- build per-peer MemoryChannels over CUDA IPC connections (tag=2)
used only for the LL barrier ring
The three LL kernels (clean / dispatch / combine) gain a kIpcPath
template parameter and two extra args (peer_rdma_bases,
memory_channel_handles). At each peer op:
- put -> peer-mapped warp copy over NVLink
- atomicAdd-like flag store -> single-writer st_na_release on peer ptr
- signal/wait barrier -> MemoryChannel signal/wait
Cross-node LL (num_rdma_ranks > 1) is untouched; the IPC setup block is
a no-op. The host launch wrappers select the variant via use_ipc_path.
The prior commit skipped r==rank in the semaphore and port-channel
build loops on the theory that the self-slot handshake skew was the
cause of LL direction asymmetry. That was wrong (the real bug was
int32 atomic alignment), and skipping self breaks other code paths
that assume every rank slot is represented -- cross-node HT and LL
failed with cudaErrorInvalidResourceHandle at the first barrier after
Buffer init. Restore the self-inclusive loop.
Dropping the self ipc_cfg connection caused cudaErrorInvalidResourceHandle
on multi-node launches. Keep the self connection (needed by other code
paths that assume every rank is in the connections map) but continue to
skip the self slot in the semaphore + port-channel construction loops so
the kernel's [local_expert*num_ranks + dst_rank] indexing hits only peer
handles; the self slot is a zero-initialized placeholder since the
kernel's same-rank branch uses a direct warp copy.
The low-latency dispatch/combine kernels signal recv counts via MSCCL++
PortChannel.atomicAdd, which lowers to IB IBV_WR_ATOMIC_FETCH_AND_ADD.
That opcode requires the remote address to be 8-byte aligned, but
LowLatencyLayout packed the per-expert signaling slots as int32. Odd
slots landed at offset %8 == 4; the NIC silently dropped those atomics
and the target rank spun forever in recv_hook (observed: even->odd
direction works, odd->even does not, across all tested topologies
including 2-rank intra-node, 8-rank intra-node, and 2-node 1-GPU-each).
Widen dispatch_rdma_recv_count_buffer / combine_rdma_recv_flag_buffer to
int64_t, update clean kernel + kernel signatures + next_clean pointers
accordingly, and add int64_t overloads for st_na_release /
ld_acquire_sys_global in utils.cuh.
Also drop the bogus self CUDA-IPC connection in Buffer::sync() that was
previously skewing the cross-rank buildAndAddSemaphore handshake order;
the kernel's same-rank branch uses a direct warp copy and never touches
the self port-channel slot (filled with a zero-initialized placeholder
so the [local_expert*num_ranks + dst_rank] indexing still holds).
Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).
Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports
per-iter latency (max across ranks, CUDA-event timed) and aggregate
effective bandwidth (sum across ranks, dispatch+combine payload bytes).
Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN.
Bench reuses the Buffer allocated for the correctness phase and
self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.
- Buffer::sync no longer drops non-same-GPU-id peers in low_latency_mode.
DeepEP's original filter was safe because its LL path used NVSHMEM; this
port drives LL via PortChannel so the kernel indexes
port_channel_handles[local_expert*num_ranks + dst_rank] for every
dst_rank. All peers now get a real memory/connection/semaphore/port
channel entry.
- Add test/python/ext/ep/test_low_latency_multirank.py (LL dispatch+combine
functional round-trip, BF16 only). Works cross-node in DeepEP's
1-GPU-per-node topology.
- Known limitation documented in src/ext/ep/README.md and the test docstring:
intra-node 8-GPU LL currently hangs because every peer transfer routes
through the CPU proxy over IB loopback between distinct HCAs on the same
host, and (separately) CudaIpcConnection::atomicAdd is a 64-bit op which
mis-aligns the 32-bit rdma_recv_count slots when used for same-node
peers. Proper fix needs a mixed-transport LL variant (MemoryChannel for
same-node, PortChannel for cross-node) or 64-bit counters.
Refresh status docs and comments now that internode HT dispatch and
combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via
test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover
their per-rank token payloads with zero diff).
- src/ext/ep/README.md: consolidate the previously duplicated README
into a single document; mark intranode and internode HT dispatch and
combine as validated in the status table; add a 'Running the tests'
section with torchrun examples for both the intranode and the 2x8
internode setups; record the dispatch->combine
torch.cuda.synchronize() + dist.barrier() requirement under Known
limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural
port, untested.
- python/mscclpp/ext/ep/buffer.py: update the module docstring and the
Buffer constructor docstring to say internode HT is validated and
clarify that LL mode is untested on multi-node hardware.
- src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported'
and 'low-latency paths still stubbed' comments. mscclpp_ep does not
use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL
paths are a structural port that is present but untested, not stubbed.
Note validation on 2x H100x8 in the internode section header.
Two issues prevented internode HT combine from completing on 2x8 H100:
1. Wrong prefix matrices passed to internode_combine. Combine runs in the
reverse direction of dispatch, so it must consume the receiver-side
matrices returned by dispatch (recv_rdma_channel_prefix_matrix,
recv_rdma_rank_prefix_sum, recv_gbl_channel_prefix_matrix), not the
sender-side rdma_channel_prefix_matrix / gbl_channel_prefix_matrix.
This matches DeepEP's deep_ep/buffer.py::internode_combine handle
unpacking. Without the fix the NVL forwarder's 'NVL check' timed out
because token_start_idx/token_end_idx were computed against the wrong
per-channel layout.
2. Cross-rank race between dispatch and combine. Even with the correct
matrices, launching combine immediately after dispatch deadlocked the
forwarder NVL check (tail stuck one short of expected_head) because
peers still had in-flight dispatch proxy traffic while fast ranks had
already started combine. A torch.cuda.synchronize() + dist.barrier()
between the two calls makes the test pass deterministically on 16
ranks (combine diff == 0, max|expected| up to 60.0).
The barrier in the test is a workaround; the real fix belongs in
Buffer::internode_dispatch / Buffer::internode_combine so the
dispatch->combine handoff fully fences outstanding proxy work across
ranks. Marked with an XXX comment in the test.