test/ext/ep: HT tests — env-driven cfg + allgather bookkeeping

- Internode HT test: accept MSCCLPP_EP_HT_{TOKENS,HIDDEN,TOPK,EXPERTS} env vars to override the functional-check problem size (was hardcoded to num_tokens=128, hidden=1024, num_topk=min(4,num_ranks), num_experts=num_ranks*4). - Both intranode + internode HT tests: replace dist.all_to_all_single bookkeeping (per-(src,dst) recv-count matrix used for the six-metric NVL/RDMA BW breakdown) with dist.all_gather_into_tensor + transpose. Functionally identical (gathered[:, rank] gives the same recv-from-src column) but works on socket-NCCL with NCCL_IB_DISABLE=1, which is required on rigs where NCCL IB cannot coexist with mscclpp RDMA. Sends num_ranks^2 int64 instead of num_ranks per rank — negligible (64 ints at 8 ranks).
2026-05-24 14:54:51 +00:00 · 2026-05-12 02:53:40 +00:00
parent 7b06a60786
commit 3f459a995d
2 changed files with 21 additions and 16 deletions
--- a/test/python/ext/ep/test_intranode_multirank.py
+++ b/test/python/ext/ep/test_intranode_multirank.py
@@ -354,12 +354,13 @@ def main():
    bytes_per_token = bench_hidden * x_b.element_size()
    total_send_tokens_local = int(is_token_in_rank_b.any(dim=1).sum().item())
    rdma_send_tokens_local = 0  # intranode: no remote nodes
-    recv_from_src = torch.empty(num_ranks, dtype=torch.int64, device="cuda")
-    dist.all_to_all_single(
-        recv_from_src,
-        num_tokens_per_rank_b.to(torch.int64),
-        group=group,
-    )
+    # Replaced dist.all_to_all_single (NCCL socket transport fails with
+    # NCCL_IB_DISABLE=1 internode) with all_gather_into_tensor + transpose,
+    # which works on the same socket-NCCL setup the LL test uses.
+    _send_row = num_tokens_per_rank_b.to(torch.int64).contiguous()
+    _gathered = torch.empty(num_ranks * num_ranks, dtype=torch.int64, device="cuda")
+    dist.all_gather_into_tensor(_gathered, _send_row, group=group)
+    recv_from_src = _gathered.view(num_ranks, num_ranks)[:, rank].contiguous()
    total_recv_tokens_local = int(recv_from_src.sum().item())
    rdma_recv_tokens_local = 0  # intranode