Aligns with NCCL-EP's ep_bench convention (BW computed from average time
across ranks). Previously we reported only the max time and computed BW
per-rank, which made our numbers more pessimistic than NCCL-EP's.
Add dist.barrier() + dist.destroy_process_group() in a finally block so
non-zero ranks don't poll the TCPStore after rank 0 (the store server)
exits, which produced noisy 'recvValue failed / Connection was likely
closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor.
Also pass device_id to init_process_group in the internode test to
silence 'Guessing device ID based on global rank' warnings.
Same change as the intra-node bench (commit 4ed6f229), applied to the
cross-node test:
- Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can
match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8).
- Switch BW accounting from recv_tokens*hidden to bench_tokens*hidden,
matching NCCL-EP's `RDMA_send` per-rank byte count.
Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).
Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports
per-iter latency (max across ranks, CUDA-event timed) and aggregate
effective bandwidth (sum across ranks, dispatch+combine payload bytes).
Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN.
Bench reuses the Buffer allocated for the correctness phase and
self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.
Two issues prevented internode HT combine from completing on 2x8 H100:
1. Wrong prefix matrices passed to internode_combine. Combine runs in the
reverse direction of dispatch, so it must consume the receiver-side
matrices returned by dispatch (recv_rdma_channel_prefix_matrix,
recv_rdma_rank_prefix_sum, recv_gbl_channel_prefix_matrix), not the
sender-side rdma_channel_prefix_matrix / gbl_channel_prefix_matrix.
This matches DeepEP's deep_ep/buffer.py::internode_combine handle
unpacking. Without the fix the NVL forwarder's 'NVL check' timed out
because token_start_idx/token_end_idx were computed against the wrong
per-channel layout.
2. Cross-rank race between dispatch and combine. Even with the correct
matrices, launching combine immediately after dispatch deadlocked the
forwarder NVL check (tail stuck one short of expected_head) because
peers still had in-flight dispatch proxy traffic while fast ranks had
already started combine. A torch.cuda.synchronize() + dist.barrier()
between the two calls makes the test pass deterministically on 16
ranks (combine diff == 0, max|expected| up to 60.0).
The barrier in the test is a workaround; the real fix belongs in
Buffer::internode_dispatch / Buffer::internode_combine so the
dispatch->combine handoff fully fences outstanding proxy work across
ranks. Marked with an XXX comment in the test.
The `internode` kernels index device-side port channel handles as
`port_channel_handles[channel_id * num_ranks + peer_rank]`, where
`peer_rank` is a global rank in [0, num_ranks). `Buffer::sync` was
building that table by iterating `std::unordered_map<int, MemoryId>`
(and similarly for connections/semaphores), which yields hash order
rather than ascending rank order. Once the cross-node fan-out grew
beyond a single peer, a local rank's trigger for peer `r` landed on
the semaphore/memory pair of a different peer, so RDMA puts and
atomic tail updates went to the wrong destination and the forwarder
spun on a tail counter that never advanced.
Changes:
- Build `sema_ids` and `port_channel_handles` by iterating
`for (int r = 0; r < num_ranks; ++r)` and looking up the
connection / memory id for rank `r`, skipping ranks excluded by
low-latency mode (inserting a placeholder handle so the stride
stays `num_ranks`).
- Tag the RDMA-phase `sendMemory`/`recvMemory`/`connect` calls with
`kRdmaTag = 1` so they do not collide with NVL-phase tag-0
traffic between the same pair of ranks.
- Drop an unused `r` local in the NVL setup loop.
With this fix and a matched `libmscclpp.so` on both nodes, the
2-node x 8-GPU internode HT dispatch path completes successfully
(`[dispatch] OK`). Combine is still under investigation.
Also adds `test/python/ext/ep/test_internode_multirank.py`, a
torchrun-based 2-node functional test that exercises
`get_dispatch_layout` -> `internode_dispatch` -> `internode_combine`
and validates per-source-rank token values end-to-end.