ext/ep: GPU-initiated IBGDA path for low-latency dispatch/combine

Add a GPU-initiated RDMA WRITE path for the LL dispatch/combine kernels
based on mlx5dv direct verbs, alongside the existing IPC and host-FIFO
PortChannel paths. Selected at runtime via MSCCLPP_EP_USE_IBGDA when
num_rdma_ranks > 1.

Core (src/core, include/mscclpp):
  - New ibgda module (ibgda.{hpp,cc}, ibgda_device.cuh): per-peer mlx5
    QP/MR/CQ setup, device-side WQE writers (write_rdma_wqe,
    write_rdma_write_inl_wqe for 4B/8B), submit_requests / submit_no_db
    ring helpers, and a poller thread for send CQs.
  - ibgda_port_channel_device.{hpp,cuh}: thin port_put() wrapper over
    rdma_write with signal_cqe / ring_db flags so callers can issue
    UNSIGNALED batched WRs and ring the doorbell once at the tail.
  - mlx5dv_wrapper: expose extra symbols needed for direct WQE
    construction; minor connection.cc / proxy.cc / port_channel.cc
    plumbing to surface QP / MR handles and rkeys to the EP layer.

EP layer (src/ext/ep):
  - ibgda_setup.{hpp,cc}: build per-(local_expert, peer_rank) GpuQp
    handles, exchange remote MR addr/rkey via the bootstrap, own the
    CQ poller. h.dst is set to the per-peer remote_mrs index.
  - buffer.{hpp,cc}: gate IBGDA path with use_ibgda_path_ &&
    ibgda_setup_ != nullptr && !use_ipc; pass device_handles to the
    kernel launchers.
  - kernels/internode_ll.cu: 3-way DISPATCH_LAUNCH_CASE /
    COMBINE_LAUNCH_CASE (IPC / IBGDA / port-FIFO), templated on
    kIbgdaPath. Data PUTs are issued UNSIGNALED with ring_db=false;
    the trailing per-QP count write (dispatch) and flag write
    (combine) keep the defaults so each QP gets a single signaled
    WR that advances prod_idx past all queued data WRs and rings
    the doorbell once.

Test (test/python/ext/ep): extend test_low_latency_multirank.py with
env-driven config knobs (MSCCLPP_EP_LL_TOKENS / _HIDDEN / _TOPK /
_EXPERTS_PER_RANK) for sweeping the new path.
This commit is contained in:
Qinghua Zhou
2026-05-07 05:14:15 +00:00
parent e87c66a85d
commit 04ebba7563
22 changed files with 1851 additions and 80 deletions

View File

@@ -43,14 +43,45 @@ import sys
# noisy 'recvValue failed / Connection was likely closed' stack traces.
os.environ.setdefault("TORCH_NCCL_ENABLE_MONITORING", "0")
import ctypes
import psutil
import torch
import torch.distributed as dist
# Load libnuma for NUMA-aware memory binding (mirrors DeepEP/tests/utils.py).
try:
_libnuma = ctypes.CDLL("libnuma.so")
_libnuma.numa_available.restype = ctypes.c_int
_libnuma.numa_run_on_node.argtypes = [ctypes.c_int]
_libnuma.numa_set_preferred.argtypes = [ctypes.c_int]
except OSError:
_libnuma = None
def set_numa_affinity(local_rank: int):
cores_per_rank = 12
numa_node = local_rank // 4
core_start = local_rank * cores_per_rank
core_end = core_start + cores_per_rank
p = psutil.Process(os.getpid())
p.cpu_affinity(list(range(core_start, core_end)))
print(f"Rank {local_rank} numa node {numa_node} bound to cores {core_start}-{core_end - 1}")
# Bind memory to NUMA node
if _libnuma is not None and _libnuma.numa_available() != -1:
_libnuma.numa_set_preferred(numa_node)
print(f"Rank {local_rank}: CPU affinity → cores {core_start}-{core_end - 1}, memory NUMA → node {numa_node}")
else:
print(f"Rank {local_rank}: libnuma not available")
def init_dist():
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
local_rank = int(os.environ.get("LOCAL_RANK", rank))
set_numa_affinity(local_rank)
torch.cuda.set_device(local_rank)
dist.init_process_group(
backend="nccl",