Previously the optional benchmark measured full round-trip latency. Split
it to time dispatch alone (N iters) and combine alone (N iters reusing
one dispatch output), reporting per-phase latency (max across ranks) and
aggregate effective bandwidth (sum across ranks).
Applies to intranode HT, internode HT, and the (currently unreachable on
intra-node 8-GPU) LL test. Internode HT keeps the sync+barrier guard
between dispatch and combine but excludes it from either phase's timing.
Gated behind MSCCLPP_EP_BENCH=1 to keep correctness runs fast. Reports
per-iter latency (max across ranks, CUDA-event timed) and aggregate
effective bandwidth (sum across ranks, dispatch+combine payload bytes).
Tunable via MSCCLPP_EP_BENCH_WARMUP / _ITERS / _TOKENS / _HIDDEN.
Bench reuses the Buffer allocated for the correctness phase and
self-skips if the requested hidden exceeds the per-peer NVL/RDMA budget.
Three issues blocked end-to-end intranode validation across multiple
ranks. This commit fixes them and adds a 2/4/8-rank functional test.
1. Combine receiver: OOB __shared__ read
In the combine receiver warp, the wait loop evaluated
`channel_tail_idx[recv_lane_id] <= expected_head` before the
`expected_head >= 0` guard. `channel_tail_idx` is a shared array
of size `kNumRanks`, but the loop runs on all 32 lanes of a warp,
so lanes with `recv_lane_id >= kNumRanks` indexed out of bounds.
compute-sanitizer reported "Invalid __shared__ read of size 4
bytes" at combine<bf16,2,768>+0xdd0, surfaced asynchronously as
cudaErrorIllegalAddress at the kernel launch site. Swap the
operands so the rank-bounds check short-circuits the shared read.
2. Python bindings: UniqueId ABI
`mscclpp::UniqueId` is a `std::array<uint8_t, N>` which pybind11
auto-converts to a Python `list`, silently overriding any
`py::class_<UniqueId>` wrapper. Expose `create_unique_id` /
`connect` as lambdas that produce/consume `py::bytes` and memcpy
into a local `UniqueId`. Also coerce `bytes`->`bytearray` at the
Python call site for `sync()` whose signature expects
`pybind11::bytearray`.
3. Python frontend: communicator required for NVL-only sync
`Buffer::sync()` uses `communicator->connect(ipc_config, ...)` on
the pure-NVLink path, so the communicator must be initialized
even when `num_rdma_ranks == 1` and `low_latency_mode == False`.
Always broadcast the unique id and call `runtime.connect()`
before `sync()`.
Validation on a single H100x8 node via torchrun:
- 2 ranks: dispatch 195 tokens, combine diff=0
- 4 ranks: dispatch 371 tokens, combine diff=0
- 8 ranks: dispatch 456 tokens, combine diff=0
Test harness added at test/python/ext/ep/test_intranode_multirank.py.