mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-12 09:17:06 +00:00
Two issues prevented internode HT combine from completing on 2x8 H100: 1. Wrong prefix matrices passed to internode_combine. Combine runs in the reverse direction of dispatch, so it must consume the receiver-side matrices returned by dispatch (recv_rdma_channel_prefix_matrix, recv_rdma_rank_prefix_sum, recv_gbl_channel_prefix_matrix), not the sender-side rdma_channel_prefix_matrix / gbl_channel_prefix_matrix. This matches DeepEP's deep_ep/buffer.py::internode_combine handle unpacking. Without the fix the NVL forwarder's 'NVL check' timed out because token_start_idx/token_end_idx were computed against the wrong per-channel layout. 2. Cross-rank race between dispatch and combine. Even with the correct matrices, launching combine immediately after dispatch deadlocked the forwarder NVL check (tail stuck one short of expected_head) because peers still had in-flight dispatch proxy traffic while fast ranks had already started combine. A torch.cuda.synchronize() + dist.barrier() between the two calls makes the test pass deterministically on 16 ranks (combine diff == 0, max|expected| up to 60.0). The barrier in the test is a workaround; the real fix belongs in Buffer::internode_dispatch / Buffer::internode_combine so the dispatch->combine handoff fully fences outstanding proxy work across ranks. Marked with an XXX comment in the test.