Commit Graph

10 Commits

Author SHA1 Message Date
Qinghua Zhou
e752dbaf97 tests/ep: add NCCL-EP six-metric BW breakdown (send/recv x total/nvl/rdma)
For HT intra/internode benches, compute per-rank avg total_send/rdma_send
and total_recv/rdma_recv token counts (matching NCCL-EP ep_bench
accounting) and print send-side and recv-side BW split into total / nvl
/ rdma columns. Combine reverses send<->recv. Byte-count line mirrors
NCCL-EP's '(per rank avg)' summary so numbers are directly comparable.
2026-04-29 20:44:10 +00:00
Qinghua Zhou
9213587ffe ep tests: report dispatch/combine min, avg, max time and use avg for BW
Aligns with NCCL-EP's ep_bench convention (BW computed from average time
across ranks). Previously we reported only the max time and computed BW
per-rank, which made our numbers more pessimistic than NCCL-EP's.
2026-04-29 16:50:33 +00:00
Qinghua Zhou
afbdcd6a3d ep tests: clean shutdown to silence TCPStore/HeartbeatMonitor noise
Add dist.barrier() + dist.destroy_process_group() in a finally block so
non-zero ranks don't poll the TCPStore after rank 0 (the store server)
exits, which produced noisy 'recvValue failed / Connection was likely
closed' stack traces from ProcessGroupNCCL's HeartbeatMonitor.

Also pass device_id to init_process_group in the internode test to
silence 'Guessing device ID based on global rank' warnings.
2026-04-29 05:16:22 +00:00
Qinghua Zhou
48540bc11e tests/ep: align internode HT bench with NCCL-EP accounting
Same change as the intra-node bench (commit 4ed6f229), applied to the
cross-node test:

- Add MSCCLPP_EP_BENCH_EXPERTS / _TOPK env knobs so the bench phase can
  match NCCL-EP's `ep_bench -a ht` defaults (256 experts, top-8).
- Switch BW accounting from recv_tokens*hidden to bench_tokens*hidden,
  matching NCCL-EP's `RDMA_send` per-rank byte count.
2026-04-27 17:39:17 +00:00
Qinghua Zhou
9840853c69 tests/ep: HT benches also print per_rank_bw
Same alignment with NCCL-EP ep_bench as the LL test: report both
per-rank (agg/num_ranks) and aggregate throughput.
2026-04-23 22:58:23 +00:00
Qinghua Zhou
906fa3c48f tests/ep: size HT buffers for bench hidden so bench phase fits 2026-04-23 17:13:09 +00:00
Qinghua Zhou
c51a8a5305 ext/ep tests: time dispatch and combine separately in MSCCLPP_EP_BENCH
Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).

Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
2026-04-22 23:11:04 +00:00
Qinghua Zhou
2391ce1de7 ext/ep tests: add optional HT benchmark pass
Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports
per-iter latency (max across ranks, CUDA-event timed) and aggregate
effective bandwidth (sum across ranks, dispatch+combine payload bytes).
Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN.

Bench reuses the Buffer allocated for the correctness phase and
self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.
2026-04-22 19:03:09 +00:00
Qinghua Zhou
c351b871a1 ep: fix internode combine in multirank test
Two issues prevented internode HT combine from completing on 2x8 H100:

1. Wrong prefix matrices passed to internode_combine. Combine runs in the
   reverse direction of dispatch, so it must consume the receiver-side
   matrices returned by dispatch (recv_rdma_channel_prefix_matrix,
   recv_rdma_rank_prefix_sum, recv_gbl_channel_prefix_matrix), not the
   sender-side rdma_channel_prefix_matrix / gbl_channel_prefix_matrix.
   This matches DeepEP's deep_ep/buffer.py::internode_combine handle
   unpacking. Without the fix the NVL forwarder's 'NVL check' timed out
   because token_start_idx/token_end_idx were computed against the wrong
   per-channel layout.

2. Cross-rank race between dispatch and combine. Even with the correct
   matrices, launching combine immediately after dispatch deadlocked the
   forwarder NVL check (tail stuck one short of expected_head) because
   peers still had in-flight dispatch proxy traffic while fast ranks had
   already started combine. A torch.cuda.synchronize() + dist.barrier()
   between the two calls makes the test pass deterministically on 16
   ranks (combine diff == 0, max|expected| up to 60.0).

The barrier in the test is a workaround; the real fix belongs in
Buffer::internode_dispatch / Buffer::internode_combine so the
dispatch->combine handoff fully fences outstanding proxy work across
ranks. Marked with an XXX comment in the test.
2026-04-22 02:21:29 +00:00
Qinghua Zhou
393d6e2673 ep: fix port-channel rank ordering for internode HT dispatch
The `internode` kernels index device-side port channel handles as
`port_channel_handles[channel_id * num_ranks + peer_rank]`, where
`peer_rank` is a global rank in [0, num_ranks). `Buffer::sync` was
building that table by iterating `std::unordered_map<int, MemoryId>`
(and similarly for connections/semaphores), which yields hash order
rather than ascending rank order. Once the cross-node fan-out grew
beyond a single peer, a local rank's trigger for peer `r` landed on
the semaphore/memory pair of a different peer, so RDMA puts and
atomic tail updates went to the wrong destination and the forwarder
spun on a tail counter that never advanced.

Changes:
  - Build `sema_ids` and `port_channel_handles` by iterating
    `for (int r = 0; r < num_ranks; ++r)` and looking up the
    connection / memory id for rank `r`, skipping ranks excluded by
    low-latency mode (inserting a placeholder handle so the stride
    stays `num_ranks`).
  - Tag the RDMA-phase `sendMemory`/`recvMemory`/`connect` calls with
    `kRdmaTag = 1` so they do not collide with NVL-phase tag-0
    traffic between the same pair of ranks.
  - Drop an unused `r` local in the NVL setup loop.

With this fix and a matched `libmscclpp.so` on both nodes, the
2-node x 8-GPU internode HT dispatch path completes successfully
(`[dispatch] OK`). Combine is still under investigation.

Also adds `test/python/ext/ep/test_internode_multirank.py`, a
torchrun-based 2-node functional test that exercises
`get_dispatch_layout` -> `internode_dispatch` -> `internode_combine`
and validates per-source-rank token values end-to-end.
2026-04-21 19:12:25 +00:00