Commit Graph

9 Commits

Author SHA1 Message Date
Qinghua Zhou
4569c4e751 Phase 11: hybrid NVLink + RDMA LL dispatch (+70% throughput)
Inside the IBGDA template branch, runtime-check whether the host has
opened a CUDA IPC peer pointer for the destination rank. If yes, do
the send via NVLink (warp copy / st_na_release on the peer-mapped
pointer); else fall through to the existing port_put / rdma_write_inl8.

Host: in sync(), when low_latency_mode && num_rdma_ranks > 1 && IBGDA
is up, allgather rdma_buffer_ptr IPC handles and cudaIpcOpenMemHandle
only for same-node peers. Sparse pointer table is mirrored to GPU and
threaded into the launchers as peer_bases.

Kernel: per-peer branch added at all four RDMA send sites (dispatch
send-data, dispatch send-count, combine send-data, combine send-flag).
Recv-side polling is transport-agnostic and unchanged.

Result on 16-rank/2-node LL bench:
  baseline (IBGDA only):   38.7 / 39.4 GB/s
  Phase 11 hybrid:         65.9 / 67.0 GB/s   (+70%)
Now matches nccl-ep default-mode numbers (63-71 / 62-72 GB/s).
Validation max diff = 0.

Gated by MSCCLPP_EP_HYBRID_LL env (default on). Single-node LL is
untouched (num_rdma_ranks>1 gate).
2026-05-09 23:04:15 +00:00
Qinghua Zhou
04ebba7563 ext/ep: GPU-initiated IBGDA path for low-latency dispatch/combine
Add a GPU-initiated RDMA WRITE path for the LL dispatch/combine kernels
based on mlx5dv direct verbs, alongside the existing IPC and host-FIFO
PortChannel paths. Selected at runtime via MSCCLPP_EP_USE_IBGDA when
num_rdma_ranks > 1.

Core (src/core, include/mscclpp):
  - New ibgda module (ibgda.{hpp,cc}, ibgda_device.cuh): per-peer mlx5
    QP/MR/CQ setup, device-side WQE writers (write_rdma_wqe,
    write_rdma_write_inl_wqe for 4B/8B), submit_requests / submit_no_db
    ring helpers, and a poller thread for send CQs.
  - ibgda_port_channel_device.{hpp,cuh}: thin port_put() wrapper over
    rdma_write with signal_cqe / ring_db flags so callers can issue
    UNSIGNALED batched WRs and ring the doorbell once at the tail.
  - mlx5dv_wrapper: expose extra symbols needed for direct WQE
    construction; minor connection.cc / proxy.cc / port_channel.cc
    plumbing to surface QP / MR handles and rkeys to the EP layer.

EP layer (src/ext/ep):
  - ibgda_setup.{hpp,cc}: build per-(local_expert, peer_rank) GpuQp
    handles, exchange remote MR addr/rkey via the bootstrap, own the
    CQ poller. h.dst is set to the per-peer remote_mrs index.
  - buffer.{hpp,cc}: gate IBGDA path with use_ibgda_path_ &&
    ibgda_setup_ != nullptr && !use_ipc; pass device_handles to the
    kernel launchers.
  - kernels/internode_ll.cu: 3-way DISPATCH_LAUNCH_CASE /
    COMBINE_LAUNCH_CASE (IPC / IBGDA / port-FIFO), templated on
    kIbgdaPath. Data PUTs are issued UNSIGNALED with ring_db=false;
    the trailing per-QP count write (dispatch) and flag write
    (combine) keep the defaults so each QP gets a single signaled
    WR that advances prod_idx past all queued data WRs and rings
    the doorbell once.

Test (test/python/ext/ep): extend test_low_latency_multirank.py with
env-driven config knobs (MSCCLPP_EP_LL_TOKENS / _HIDDEN / _TOPK /
_EXPERTS_PER_RANK) for sweeping the new path.
2026-05-07 05:14:15 +00:00
Qinghua Zhou
e87c66a85d ext/ep: apply clang-format and black to fix CI lint failures
Run `tools/lint.sh cpp` (clang-format 14) and `tools/lint.sh py`
(black) over the EP extension files added by this PR. No functional
changes; pure reformatting to satisfy the cpplint and pylint CI jobs.
2026-05-06 04:12:20 +00:00
Qinghua Zhou
89cb62d047 Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-05 19:24:28 -07:00
Qinghua Zhou
fdf7d579dc ext/ep: optional preallocated outputs for low_latency_dispatch
Add optional out_packed_recv_x / out_src_info / out_layout_range /
out_count parameters to Buffer::low_latency_dispatch so callers can
hoist the four recv-side allocations out of a hot loop, mirroring the
existing out= path on low_latency_combine.

The bench in test_low_latency_multirank.py preallocates these tensors
once and passes them on every iter so the timed loop reflects kernel
cost, not torch.empty + caching-allocator overhead.
2026-04-30 18:45:44 +00:00
Qinghua Zhou
0227626335 ep: env-tunable + arch-aware num_proxy_services (default 8 on Hopper, 1 on Blackwell)
Override at runtime with MSCCLPP_EP_NUM_PROXIES.
N=8 is the knee on H100+IB; N>=12 collapses from CPU oversubscription.
Intra-node LL is unchanged.
2026-04-27 21:28:29 +00:00
Qinghua Zhou
7e8415580f ep: shard PortChannels across multiple ProxyServices
Each mscclpp::ProxyService spawns one host-side proxy thread that
drains its FIFO and posts IB work requests. With LL combine pushing
~1k put + 60 atomicAdd FIFO entries per iter, that single thread is
the wall-clock bottleneck on cross-node runs.

Split the channel set across kNumProxyServices=4 separate services
so the host-side dispatch parallelism scales linearly. SemaphoreIds
and MemoryIds are scoped to a ProxyService, so:

- addMemory() is broadcast to every service in the same global order
  so a single MemoryId still identifies the memory everywhere.
- Each (peer_rank, channel_idx) is assigned to one proxy_idx via
  round-robin; the resulting PortChannel is built on that proxy and
  inherits its FIFO. The kernel is unchanged: the flat handle array
  routes the right way automatically.

No kernel-level changes, no tuning of QP count, no new env knobs.
2026-04-27 20:24:00 +00:00
Qinghua Zhou
b0eb5da53d ext/ep: LL intra-node fast path via CUDA IPC + MemoryChannel
When all ranks live on the same host (num_rdma_ranks == 1), the LL
kernels now bypass PortChannel/IB-loopback entirely. In Buffer::sync()
we additionally:
  - allGather IPC handles for each rank's rdma_buffer_ptr and
    cudaIpcOpenMemHandle them into peer_rdma_bases[]
  - build per-peer MemoryChannels over CUDA IPC connections (tag=2)
    used only for the LL barrier ring

The three LL kernels (clean / dispatch / combine) gain a kIpcPath
template parameter and two extra args (peer_rdma_bases,
memory_channel_handles). At each peer op:
  - put -> peer-mapped warp copy over NVLink
  - atomicAdd-like flag store -> single-writer st_na_release on peer ptr
  - signal/wait barrier -> MemoryChannel signal/wait

Cross-node LL (num_rdma_ranks > 1) is untouched; the IPC setup block is
a no-op. The host launch wrappers select the variant via use_ipc_path.
2026-04-23 21:10:39 +00:00
Qinghua Zhou
88425a6771 Add Expert-Parallel (MoE dispatch/combine) extension under src/ext/ep
Port DeepEP's high-throughput MoE dispatch/combine kernels onto MSCCL++
as an optional build target `mscclpp_ep_cpp`, gated by -DMSCCLPP_BUILD_EXT_EP
(OFF by default). Sources are lifted from DeepEP branch
`chhwang/dev-atomic-add-cleanup` and rebased onto upstream MSCCL++ APIs;
the NVSHMEM / IBGDA dependencies are replaced with `PortChannel` +
`MemoryChannel` + the new `Connection::atomicAdd` primitive.

Scope
-----
Intranode (NVLink-only):
  * `Buffer` ctor/dtor: cudaMalloc nvl workspace, export IPC handle,
    allocate FIFO + peer-pointer tables, start `ProxyService`.
  * `sync()`: import peer IPC handles, upload peer pointer table,
    build `MemoryDevice2DeviceSemaphore` + `MemoryChannel` per peer.
  * `get_dispatch_layout`, `intranode_dispatch`, `intranode_combine`
    ported verbatim (torch::Tensor ABI preserved).

Internode HT (NVLink + RDMA):
  * `sync()` RDMA branch: cudaMalloc RDMA buffer + `bootstrap->barrier()`
    (replacing NVSHMEM symmetric-heap allocation); register with
    `all_transport`, exchange via `sendMemory`/`recvMemory`, build 12 IB
    QPs/peer + 16 semaphores/peer + 16 port channels/peer.
  * Full `internode.cu` port (notify_dispatch / dispatch / cached_notify
    / combine / get_dispatch_layout). The 4 raw `ChannelTrigger` atomic
    sites are rewritten to call the new
    `PortChannelDeviceHandle::atomicAdd(offset, value)` API; the single
    `nvshmem_fence()` is replaced with `__threadfence_system()` (remote
    visibility guaranteed by the subsequent port-channel barrier).
  * `internode_dispatch` / `internode_combine` host code ported, with
    the torch tensor marshalling and CPU spin-wait on mapped counters.

Low-latency (pure RDMA):
  * Not ported. `low_latency_dispatch`, `low_latency_combine`,
    `clean_low_latency_buffer`, `get_next_low_latency_combine_buffer`
    throw `std::runtime_error`; the Python frontend refuses to
    construct a Buffer with `low_latency_mode=True`.

Python layer
------------
* New pybind11 + libtorch Python extension `mscclpp_ep_cpp` (separate
  from the nanobind `_mscclpp` because the EP ABI carries
  `torch::Tensor` / `at::cuda::CUDAStream`).
* `mscclpp.ext.ep.Buffer` mirrors `deep_ep.Buffer`; exchanges device
  IDs, IPC handles and the bootstrap UniqueId over the user's
  `torch.distributed` process group before calling `sync()`.
* `mscclpp.ext` auto-imports `ep` if the extension is built.

Build
-----
* `src/ext/ep/CMakeLists.txt`: finds Python + Torch; warns and skips if
  `CMAKE_PREFIX_PATH` doesn't point at `torch.utils.cmake_prefix_path`.
  Falls back to Torch's bundled pybind11 if a standalone pybind11 is not
  installed. Links `libtorch_python` explicitly (without it, `import
  mscclpp_ep_cpp` fails with `undefined symbol: THPDtypeType`).
* Top-level `CMakeLists.txt` exposes the `MSCCLPP_BUILD_EXT_EP` option
  (default OFF).

Tests
-----
* `test/python/ext/ep/test_ep_smoke.py`: skipped if the extension isn't
  built. Covers Config round-trip, low-latency size hint, and the LL
  construction guard. Multi-rank functional tests still to do on H100.

Notes
-----
* Builds against the preceding "atomic add" commit which adds
  `Connection::atomicAdd` and `PortChannelDeviceHandle::atomicAdd` to
  upstream MSCCL++.
* Intranode path verified end-to-end (build + import + smoke tests).
* Internode HT is code-complete but requires real IB hardware to
  validate; see `src/ext/ep/README.md` for the detailed port plan and
  remaining LL migration.
2026-04-20 20:15:23 +00:00