mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Files

Qinghua Zhou 04ebba7563 ext/ep: GPU-initiated IBGDA path for low-latency dispatch/combine

Add a GPU-initiated RDMA WRITE path for the LL dispatch/combine kernels
based on mlx5dv direct verbs, alongside the existing IPC and host-FIFO
PortChannel paths. Selected at runtime via MSCCLPP_EP_USE_IBGDA when
num_rdma_ranks > 1.

Core (src/core, include/mscclpp):
  - New ibgda module (ibgda.{hpp,cc}, ibgda_device.cuh): per-peer mlx5
    QP/MR/CQ setup, device-side WQE writers (write_rdma_wqe,
    write_rdma_write_inl_wqe for 4B/8B), submit_requests / submit_no_db
    ring helpers, and a poller thread for send CQs.
  - ibgda_port_channel_device.{hpp,cuh}: thin port_put() wrapper over
    rdma_write with signal_cqe / ring_db flags so callers can issue
    UNSIGNALED batched WRs and ring the doorbell once at the tail.
  - mlx5dv_wrapper: expose extra symbols needed for direct WQE
    construction; minor connection.cc / proxy.cc / port_channel.cc
    plumbing to surface QP / MR handles and rkeys to the EP layer.

EP layer (src/ext/ep):
  - ibgda_setup.{hpp,cc}: build per-(local_expert, peer_rank) GpuQp
    handles, exchange remote MR addr/rkey via the bootstrap, own the
    CQ poller. h.dst is set to the per-peer remote_mrs index.
  - buffer.{hpp,cc}: gate IBGDA path with use_ibgda_path_ &&
    ibgda_setup_ != nullptr && !use_ipc; pass device_handles to the
    kernel launchers.
  - kernels/internode_ll.cu: 3-way DISPATCH_LAUNCH_CASE /
    COMBINE_LAUNCH_CASE (IPC / IBGDA / port-FIFO), templated on
    kIbgdaPath. Data PUTs are issued UNSIGNALED with ring_db=false;
    the trailing per-QP count write (dispatch) and flag write
    (combine) keep the defaults so each QP gets a single signaled
    WR that advances prod_idx past all queued data WRs and rings
    the doorbell once.

Test (test/python/ext/ep): extend test_low_latency_multirank.py with
env-driven config knobs (MSCCLPP_EP_LL_TOKENS / _HIDDEN / _TOPK /
_EXPERTS_PER_RANK) for sweeping the new path.

2026-05-07 05:14:15 +00:00

mscclpp

ext/ep: GPU-initiated IBGDA path for low-latency dispatch/combine

2026-05-07 05:14:15 +00:00

CMakeLists.txt

Torch integration (#692 )

2026-01-21 20:32:24 -08:00