ext/ep: GPU-initiated IBGDA path for low-latency dispatch/combine

Add a GPU-initiated RDMA WRITE path for the LL dispatch/combine kernels based on mlx5dv direct verbs, alongside the existing IPC and host-FIFO PortChannel paths. Selected at runtime via MSCCLPP_EP_USE_IBGDA when num_rdma_ranks > 1. Core (src/core, include/mscclpp): - New ibgda module (ibgda.{hpp,cc}, ibgda_device.cuh): per-peer mlx5 QP/MR/CQ setup, device-side WQE writers (write_rdma_wqe, write_rdma_write_inl_wqe for 4B/8B), submit_requests / submit_no_db ring helpers, and a poller thread for send CQs. - ibgda_port_channel_device.{hpp,cuh}: thin port_put() wrapper over rdma_write with signal_cqe / ring_db flags so callers can issue UNSIGNALED batched WRs and ring the doorbell once at the tail. - mlx5dv_wrapper: expose extra symbols needed for direct WQE construction; minor connection.cc / proxy.cc / port_channel.cc plumbing to surface QP / MR handles and rkeys to the EP layer. EP layer (src/ext/ep): - ibgda_setup.{hpp,cc}: build per-(local_expert, peer_rank) GpuQp handles, exchange remote MR addr/rkey via the bootstrap, own the CQ poller. h.dst is set to the per-peer remote_mrs index. - buffer.{hpp,cc}: gate IBGDA path with use_ibgda_path_ && ibgda_setup_ != nullptr && !use_ipc; pass device_handles to the kernel launchers. - kernels/internode_ll.cu: 3-way DISPATCH_LAUNCH_CASE / COMBINE_LAUNCH_CASE (IPC / IBGDA / port-FIFO), templated on kIbgdaPath. Data PUTs are issued UNSIGNALED with ring_db=false; the trailing per-QP count write (dispatch) and flag write (combine) keep the defaults so each QP gets a single signaled WR that advances prod_idx past all queued data WRs and rings the doorbell once. Test (test/python/ext/ep): extend test_low_latency_multirank.py with env-driven config knobs (MSCCLPP_EP_LL_TOKENS / _HIDDEN / _TOPK / _EXPERTS_PER_RANK) for sweeping the new path.
2026-05-12 09:17:06 +00:00 · 2026-05-07 05:14:15 +00:00
parent e87c66a85d
commit 04ebba7563
22 changed files with 1851 additions and 80 deletions
--- a/test/python/ext/ep/test_low_latency_multirank.py
+++ b/test/python/ext/ep/test_low_latency_multirank.py
@@ -43,14 +43,45 @@ import sys
 # noisy 'recvValue failed / Connection was likely closed' stack traces.
 os.environ.setdefault("TORCH_NCCL_ENABLE_MONITORING", "0")

+import ctypes
+
+import psutil
 import torch
 import torch.distributed as dist


+# Load libnuma for NUMA-aware memory binding (mirrors DeepEP/tests/utils.py).
+try:
+    _libnuma = ctypes.CDLL("libnuma.so")
+    _libnuma.numa_available.restype = ctypes.c_int
+    _libnuma.numa_run_on_node.argtypes = [ctypes.c_int]
+    _libnuma.numa_set_preferred.argtypes = [ctypes.c_int]
+except OSError:
+    _libnuma = None
+
+
+def set_numa_affinity(local_rank: int):
+    cores_per_rank = 12
+    numa_node = local_rank // 4
+    core_start = local_rank * cores_per_rank
+    core_end = core_start + cores_per_rank
+    p = psutil.Process(os.getpid())
+    p.cpu_affinity(list(range(core_start, core_end)))
+    print(f"Rank {local_rank} numa node {numa_node} bound to cores {core_start}-{core_end - 1}")
+
+    # Bind memory to NUMA node
+    if _libnuma is not None and _libnuma.numa_available() != -1:
+        _libnuma.numa_set_preferred(numa_node)
+        print(f"Rank {local_rank}: CPU affinity → cores {core_start}-{core_end - 1}, memory NUMA → node {numa_node}")
+    else:
+        print(f"Rank {local_rank}: libnuma not available")
+
+
 def init_dist():
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ.get("LOCAL_RANK", rank))
+    set_numa_affinity(local_rank)
    torch.cuda.set_device(local_rank)
    dist.init_process_group(
        backend="nccl",