The `internode` kernels index device-side port channel handles as
`port_channel_handles[channel_id * num_ranks + peer_rank]`, where
`peer_rank` is a global rank in [0, num_ranks). `Buffer::sync` was
building that table by iterating `std::unordered_map<int, MemoryId>`
(and similarly for connections/semaphores), which yields hash order
rather than ascending rank order. Once the cross-node fan-out grew
beyond a single peer, a local rank's trigger for peer `r` landed on
the semaphore/memory pair of a different peer, so RDMA puts and
atomic tail updates went to the wrong destination and the forwarder
spun on a tail counter that never advanced.
Changes:
- Build `sema_ids` and `port_channel_handles` by iterating
`for (int r = 0; r < num_ranks; ++r)` and looking up the
connection / memory id for rank `r`, skipping ranks excluded by
low-latency mode (inserting a placeholder handle so the stride
stays `num_ranks`).
- Tag the RDMA-phase `sendMemory`/`recvMemory`/`connect` calls with
`kRdmaTag = 1` so they do not collide with NVL-phase tag-0
traffic between the same pair of ranks.
- Drop an unused `r` local in the NVL setup loop.
With this fix and a matched `libmscclpp.so` on both nodes, the
2-node x 8-GPU internode HT dispatch path completes successfully
(`[dispatch] OK`). Combine is still under investigation.
Also adds `test/python/ext/ep/test_internode_multirank.py`, a
torchrun-based 2-node functional test that exercises
`get_dispatch_layout` -> `internode_dispatch` -> `internode_combine`
and validates per-source-rank token values end-to-end.