mscclpp

microsoft/mscclpp

Fork 0

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Commit Graph

Author	SHA1	Message	Date
Qinghua Zhou	393d6e2673	ep: fix port-channel rank ordering for internode HT dispatch The `internode` kernels index device-side port channel handles as `port_channel_handles[channel_id * num_ranks + peer_rank]`, where `peer_rank` is a global rank in [0, num_ranks). `Buffer::sync` was building that table by iterating `std::unordered_map<int, MemoryId>` (and similarly for connections/semaphores), which yields hash order rather than ascending rank order. Once the cross-node fan-out grew beyond a single peer, a local rank's trigger for peer `r` landed on the semaphore/memory pair of a different peer, so RDMA puts and atomic tail updates went to the wrong destination and the forwarder spun on a tail counter that never advanced. Changes: - Build `sema_ids` and `port_channel_handles` by iterating `for (int r = 0; r < num_ranks; ++r)` and looking up the connection / memory id for rank `r`, skipping ranks excluded by low-latency mode (inserting a placeholder handle so the stride stays `num_ranks`). - Tag the RDMA-phase `sendMemory`/`recvMemory`/`connect` calls with `kRdmaTag = 1` so they do not collide with NVL-phase tag-0 traffic between the same pair of ranks. - Drop an unused `r` local in the NVL setup loop. With this fix and a matched `libmscclpp.so` on both nodes, the 2-node x 8-GPU internode HT dispatch path completes successfully (`[dispatch] OK`). Combine is still under investigation. Also adds `test/python/ext/ep/test_internode_multirank.py`, a torchrun-based 2-node functional test that exercises `get_dispatch_layout` -> `internode_dispatch` -> `internode_combine` and validates per-source-rank token values end-to-end.	2026-04-21 19:12:25 +00:00

Author

SHA1

Message

Date

Qinghua Zhou

393d6e2673

ep: fix port-channel rank ordering for internode HT dispatch

The `internode` kernels index device-side port channel handles as
`port_channel_handles[channel_id * num_ranks + peer_rank]`, where
`peer_rank` is a global rank in [0, num_ranks). `Buffer::sync` was
building that table by iterating `std::unordered_map<int, MemoryId>`
(and similarly for connections/semaphores), which yields hash order
rather than ascending rank order. Once the cross-node fan-out grew
beyond a single peer, a local rank's trigger for peer `r` landed on
the semaphore/memory pair of a different peer, so RDMA puts and
atomic tail updates went to the wrong destination and the forwarder
spun on a tail counter that never advanced.

Changes:
  - Build `sema_ids` and `port_channel_handles` by iterating
    `for (int r = 0; r < num_ranks; ++r)` and looking up the
    connection / memory id for rank `r`, skipping ranks excluded by
    low-latency mode (inserting a placeholder handle so the stride
    stays `num_ranks`).
  - Tag the RDMA-phase `sendMemory`/`recvMemory`/`connect` calls with
    `kRdmaTag = 1` so they do not collide with NVL-phase tag-0
    traffic between the same pair of ranks.
  - Drop an unused `r` local in the NVL setup loop.

With this fix and a matched `libmscclpp.so` on both nodes, the
2-node x 8-GPU internode HT dispatch path completes successfully
(`[dispatch] OK`). Combine is still under investigation.

Also adds `test/python/ext/ep/test_internode_multirank.py`, a
torchrun-based 2-node functional test that exercises
`get_dispatch_layout` -> `internode_dispatch` -> `internode_combine`
and validates per-source-rank token values end-to-end.

2026-04-21 19:12:25 +00:00

1 Commits