mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-12 09:17:06 +00:00
ext/ep: GPU-initiated IBGDA path for low-latency dispatch/combine
Add a GPU-initiated RDMA WRITE path for the LL dispatch/combine kernels
based on mlx5dv direct verbs, alongside the existing IPC and host-FIFO
PortChannel paths. Selected at runtime via MSCCLPP_EP_USE_IBGDA when
num_rdma_ranks > 1.
Core (src/core, include/mscclpp):
- New ibgda module (ibgda.{hpp,cc}, ibgda_device.cuh): per-peer mlx5
QP/MR/CQ setup, device-side WQE writers (write_rdma_wqe,
write_rdma_write_inl_wqe for 4B/8B), submit_requests / submit_no_db
ring helpers, and a poller thread for send CQs.
- ibgda_port_channel_device.{hpp,cuh}: thin port_put() wrapper over
rdma_write with signal_cqe / ring_db flags so callers can issue
UNSIGNALED batched WRs and ring the doorbell once at the tail.
- mlx5dv_wrapper: expose extra symbols needed for direct WQE
construction; minor connection.cc / proxy.cc / port_channel.cc
plumbing to surface QP / MR handles and rkeys to the EP layer.
EP layer (src/ext/ep):
- ibgda_setup.{hpp,cc}: build per-(local_expert, peer_rank) GpuQp
handles, exchange remote MR addr/rkey via the bootstrap, own the
CQ poller. h.dst is set to the per-peer remote_mrs index.
- buffer.{hpp,cc}: gate IBGDA path with use_ibgda_path_ &&
ibgda_setup_ != nullptr && !use_ipc; pass device_handles to the
kernel launchers.
- kernels/internode_ll.cu: 3-way DISPATCH_LAUNCH_CASE /
COMBINE_LAUNCH_CASE (IPC / IBGDA / port-FIFO), templated on
kIbgdaPath. Data PUTs are issued UNSIGNALED with ring_db=false;
the trailing per-QP count write (dispatch) and flag write
(combine) keep the defaults so each QP gets a single signaled
WR that advances prod_idx past all queued data WRs and rings
the doorbell once.
Test (test/python/ext/ep): extend test_low_latency_multirank.py with
env-driven config knobs (MSCCLPP_EP_LL_TOKENS / _HIDDEN / _TOPK /
_EXPERTS_PER_RANK) for sweeping the new path.
This commit is contained in:
@@ -43,14 +43,45 @@ import sys
|
||||
# noisy 'recvValue failed / Connection was likely closed' stack traces.
|
||||
os.environ.setdefault("TORCH_NCCL_ENABLE_MONITORING", "0")
|
||||
|
||||
import ctypes
|
||||
|
||||
import psutil
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
|
||||
|
||||
# Load libnuma for NUMA-aware memory binding (mirrors DeepEP/tests/utils.py).
|
||||
try:
|
||||
_libnuma = ctypes.CDLL("libnuma.so")
|
||||
_libnuma.numa_available.restype = ctypes.c_int
|
||||
_libnuma.numa_run_on_node.argtypes = [ctypes.c_int]
|
||||
_libnuma.numa_set_preferred.argtypes = [ctypes.c_int]
|
||||
except OSError:
|
||||
_libnuma = None
|
||||
|
||||
|
||||
def set_numa_affinity(local_rank: int):
|
||||
cores_per_rank = 12
|
||||
numa_node = local_rank // 4
|
||||
core_start = local_rank * cores_per_rank
|
||||
core_end = core_start + cores_per_rank
|
||||
p = psutil.Process(os.getpid())
|
||||
p.cpu_affinity(list(range(core_start, core_end)))
|
||||
print(f"Rank {local_rank} numa node {numa_node} bound to cores {core_start}-{core_end - 1}")
|
||||
|
||||
# Bind memory to NUMA node
|
||||
if _libnuma is not None and _libnuma.numa_available() != -1:
|
||||
_libnuma.numa_set_preferred(numa_node)
|
||||
print(f"Rank {local_rank}: CPU affinity → cores {core_start}-{core_end - 1}, memory NUMA → node {numa_node}")
|
||||
else:
|
||||
print(f"Rank {local_rank}: libnuma not available")
|
||||
|
||||
|
||||
def init_dist():
|
||||
rank = int(os.environ["RANK"])
|
||||
world_size = int(os.environ["WORLD_SIZE"])
|
||||
local_rank = int(os.environ.get("LOCAL_RANK", rank))
|
||||
set_numa_affinity(local_rank)
|
||||
torch.cuda.set_device(local_rank)
|
||||
dist.init_process_group(
|
||||
backend="nccl",
|
||||
|
||||
Reference in New Issue
Block a user