From 9e96bf3b5d17c2ce4c77d3780299639df9f0827c Mon Sep 17 00:00:00 2001 From: Qinghua Zhou Date: Wed, 22 Apr 2026 03:56:16 +0000 Subject: [PATCH] ep: document internode HT validation on 2x H100x8 Refresh status docs and comments now that internode HT dispatch and combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover their per-rank token payloads with zero diff). - src/ext/ep/README.md: consolidate the previously duplicated README into a single document; mark intranode and internode HT dispatch and combine as validated in the status table; add a 'Running the tests' section with torchrun examples for both the intranode and the 2x8 internode setups; record the dispatch->combine torch.cuda.synchronize() + dist.barrier() requirement under Known limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural port, untested. - python/mscclpp/ext/ep/buffer.py: update the module docstring and the Buffer constructor docstring to say internode HT is validated and clarify that LL mode is untested on multi-node hardware. - src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported' and 'low-latency paths still stubbed' comments. mscclpp_ep does not use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL paths are a structural port that is present but untested, not stubbed. Note validation on 2x H100x8 in the internode section header. --- python/mscclpp/ext/ep/buffer.py | 15 ++- src/ext/ep/README.md | 231 ++++++++++++-------------------- src/ext/ep/buffer.cc | 16 ++- 3 files changed, 103 insertions(+), 159 deletions(-) diff --git a/python/mscclpp/ext/ep/buffer.py b/python/mscclpp/ext/ep/buffer.py index 238a95ef..c8010383 100644 --- a/python/mscclpp/ext/ep/buffer.py +++ b/python/mscclpp/ext/ep/buffer.py @@ -11,11 +11,14 @@ DeepEP users can port with minimal changes. Current status (see ``src/ext/ep/README.md``): -* Intranode (NVLink-only) dispatch and combine are fully ported. -* ``get_dispatch_layout`` is ported. -* Internode HT (MSCCL++ PortChannel + MemoryChannel) is ported. -* Internode low-latency kernels are ported structurally (NVSHMEM/IBGDA -> - MSCCL++ PortChannel) but **untested on multi-node H100**. +* Intranode (NVLink-only) dispatch and combine: ported and validated on + one node with 8 GPUs. +* ``get_dispatch_layout``: ported. +* Internode HT (MSCCL++ PortChannel + MemoryChannel) dispatch and combine: + ported and validated on 2 nodes x 8 H100 GPUs with + ``test/python/ext/ep/test_internode_multirank.py``. +* Internode low-latency kernels: structural port (NVSHMEM/IBGDA -> + MSCCL++ PortChannel), **untested on multi-node H100**. """ from __future__ import annotations @@ -54,7 +57,7 @@ class Buffer: low-latency modes. low_latency_mode: Enable the low-latency dispatch/combine path (structural port, - untested). + untested on multi-node hardware). num_qps_per_rank: Ignored for intranode mode. """ diff --git a/src/ext/ep/README.md b/src/ext/ep/README.md index bb6fad0a..c2328cbb 100644 --- a/src/ext/ep/README.md +++ b/src/ext/ep/README.md @@ -1,54 +1,61 @@ # MSCCL++ Expert-Parallel (EP) extension -A port of DeepEP's MoE `dispatch`/`combine` primitives into MSCCL++, targeting: +A port of DeepEP's MoE `dispatch` / `combine` primitives into MSCCL++, +targeting: - **High-Throughput (HT) mode** from [DeepEP](https://github.com/deepseek-ai/DeepEP), branch `chhwang/dev-atomic-add-cleanup` — which already swaps NVSHMEM for - `mscclpp::PortChannel`/`MemoryChannel`. + `mscclpp::PortChannel` / `mscclpp::MemoryChannel`. - **Low-Latency (LL) mode** from [`nccl/contrib/nccl_ep`](https://github.com/NVIDIA/nccl/tree/master/contrib/nccl_ep), which implements pure-RDMA dispatch/combine on top of the NCCL Device API. ## Status -| Feature | Status | -|------------------------------------|---------------------------------| -| `Buffer` construction + IPC + sync | ✅ ported (NVLink + RDMA) | -| `get_dispatch_layout` | ✅ ported | -| `intranode_dispatch` (NVLink) | ✅ ported | -| `intranode_combine` (NVLink) | ✅ ported | -| `internode_dispatch` (NVLink+RDMA) | ✅ ported (pending H100 test) | -| `internode_combine` (NVLink+RDMA) | ✅ ported (pending H100 test) | -| `low_latency_dispatch` (pure RDMA) | ⚠️ structural port, untested | -| `low_latency_combine` (pure RDMA) | ⚠️ structural port, untested | -| `Connection::atomicAdd` API | ✅ cherry-picked into mscclpp | -| Python frontend `mscclpp.ext.ep` | ✅ wraps HT + LL paths | -| pybind11 module `mscclpp_ep_cpp` | ✅ builds conditionally | +| Feature | Status | +|------------------------------------|---------------------------------------------| +| `Buffer` construction + IPC + sync | ✅ ported (NVLink + RDMA) | +| `get_dispatch_layout` | ✅ ported | +| `intranode_dispatch` (NVLink) | ✅ ported, validated (8 ranks, 1 node) | +| `intranode_combine` (NVLink) | ✅ ported, validated (8 ranks, 1 node) | +| `internode_dispatch` (NVLink+RDMA) | ✅ ported, validated (16 ranks, 2×H100×8) | +| `internode_combine` (NVLink+RDMA) | ✅ ported, validated (16 ranks, 2×H100×8) | +| `low_latency_dispatch` (pure RDMA) | ⚠️ structural port, untested | +| `low_latency_combine` (pure RDMA) | ⚠️ structural port, untested | +| `Connection::atomicAdd` API | ✅ cherry-picked into mscclpp | +| Python frontend `mscclpp.ext.ep` | ✅ wraps HT + LL paths | +| pybind11 module `mscclpp_ep_cpp` | ✅ builds conditionally | -Internode HT is code-complete but unverified on real hardware — the -`sync()` path replaces DeepEP's NVSHMEM symmetric-heap allocation with -`cudaMalloc` + `bootstrap->barrier()`, and the kernels use the new -`PortChannelDeviceHandle::atomicAdd` instead of the old raw-trigger -pattern. +Internode HT was validated end-to-end on two H100×8 nodes connected over +Infiniband using [`test/python/ext/ep/test_internode_multirank.py`](../../../test/python/ext/ep/test_internode_multirank.py). +All 16 ranks complete dispatch followed by combine with exact (zero-diff) +recovery of the per-rank token payloads. The low-latency port is **structural**: the DeepEP LL kernels (pure IBGDA) have been mechanically translated to MSCCL++ port-channel ops. Semantic mapping: -| DeepEP / IBGDA | MSCCL++ replacement | -|------------------------------------------|--------------------------------------------------------------| -| `nvshmemx_barrier_all_block()` | signal+wait ring across `port_channel_handles[peer_rank]` | +| DeepEP / IBGDA | MSCCL++ replacement | +|------------------------------------------|------------------------------------------------------------------| +| `nvshmemx_barrier_all_block()` | signal + wait ring across `port_channel_handles[peer_rank]` | | `nvshmemi_ibgda_put_nbi_warp(...)` | lane-0 `port_channel_handles[qp*N+dst].put(dst_off, src_off, n)` | -| `nvshmemi_ibgda_amo_nonfetch_add(...)` | lane-0 `port_channel_handles[qp*N+dst].atomicAdd(off, int64)` | +| `nvshmemi_ibgda_amo_nonfetch_add(...)` | lane-0 `port_channel_handles[qp*N+dst].atomicAdd(off, int64)` | -**Known limitations**: +### Known limitations -* LL performance will NOT match IBGDA — the MSCCL++ port channel uses a +- LL performance will NOT match IBGDA — the MSCCL++ port channel uses a CPU proxy. The port is for functional parity, not latency. -* `Buffer::sync()` in `low_latency_mode=True` only connects peers sharing +- `Buffer::sync()` in `low_latency_mode=True` only connects peers sharing the same local GPU ID (DeepEP convention). LL kernels therefore assume one-GPU-per-node topology, i.e. `num_ranks == num_rdma_ranks`. Running with >1 GPU per node in LL mode will fail to reach cross-GPU peers. -* Multi-node H100 validation is still pending. +- Multi-node H100 validation of LL mode is still pending. +- The internode HT functional test inserts an explicit + `torch.cuda.synchronize()` + `dist.barrier()` between dispatch and + combine. Without it, fast ranks can launch combine while peers still + have in-flight dispatch proxy traffic, deadlocking the combine NVL + forwarder. Folding this barrier into + `Buffer::internode_dispatch` / `Buffer::internode_combine` (or + `cached_notify`) is tracked in the test's `XXX` comment. ## Build @@ -67,7 +74,7 @@ The Python frontend picks it up automatically: ```python from mscclpp.ext import ep -buf = ep.Buffer(group, num_nvl_bytes=..., num_rdma_bytes=0) +buf = ep.Buffer(group, num_nvl_bytes=..., num_rdma_bytes=...) ``` ## Layout @@ -95,7 +102,34 @@ python/mscclpp/ext/ep/ └── buffer.py — torch.distributed-aware frontend test/python/ext/ep/ -└── test_ep_smoke.py — size-hint + rejection smoke test +├── test_ep_smoke.py — size-hint + rejection smoke test +├── test_intranode_multirank.py — NVLink dispatch+combine, 8 ranks +└── test_internode_multirank.py — HT dispatch+combine, 16 ranks (2×8) +``` + +## Running the tests + +Intranode (single node, 8 GPUs): + +```bash +torchrun --nnodes=1 --nproc_per_node=8 \ + test/python/ext/ep/test_intranode_multirank.py +``` + +Internode HT (2 nodes × 8 GPUs): + +```bash +# node 0 (master) +NCCL_SOCKET_IFNAME=eth0 MSCCLPP_SOCKET_IFNAME=eth0 GLOO_SOCKET_IFNAME=eth0 \ +torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 \ + --master_addr= --master_port=29600 \ + test/python/ext/ep/test_internode_multirank.py + +# node 1 (worker) +NCCL_SOCKET_IFNAME=eth0 MSCCLPP_SOCKET_IFNAME=eth0 GLOO_SOCKET_IFNAME=eth0 \ +torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \ + --master_addr= --master_port=29600 \ + test/python/ext/ep/test_internode_multirank.py ``` ## Migration plan @@ -105,29 +139,26 @@ test/python/ext/ep/ - [x] Copy DeepEP kernel headers (configs / buffer / utils / launch / exception). - [x] Port intranode kernels + runtime (NVLink only). - [x] Port `get_dispatch_layout` (host-safe subset of internode kernels). -- [x] Port host Buffer: ctor, sync, get_dispatch_layout, intranode dispatch/combine. +- [x] Port host Buffer: ctor, sync, get_dispatch_layout, intranode + dispatch/combine. - [x] pybind11 `mscclpp_ep_cpp` module + Python frontend. -### Phase 2 — internode HT (NVLink + RDMA) +### Phase 2 — internode HT (NVLink + RDMA) — DONE -Port the rest of `DeepEP/csrc/kernels/internode.cu` (`notify_dispatch`, -`dispatch`, `cached_notify`, `combine`). Because we are starting from the -`chhwang/dev-atomic-add-cleanup` branch, the NVSHMEM -> MSCCL++ substitution -is already done upstream — just copy the kernel bodies and wire them through -`api.cuh`. The launchers need `PortChannelDeviceHandle*` / -`MemoryChannelDeviceHandle*` arguments that `Buffer::sync()` already builds -(see the `num_rdma_bytes > 0` branch — currently throws, but the code -populating `port_channel_handles_device_ptr` and -`memory_channel_handles_device_ptr` is ready). Finally replace the stubs in -`buffer.cc` (`internode_dispatch`, `internode_combine`) with the real bodies -from DeepEP. +- [x] Port `notify_dispatch`, `dispatch`, `cached_notify`, `combine` kernels. +- [x] Wire `Buffer::internode_dispatch` / `Buffer::internode_combine` host + orchestration. +- [x] `Buffer::sync()` builds `port_channel_handles_device_ptr` and + `memory_channel_handles_device_ptr`, with port channels ordered by + peer rank (so kernel-side indexing by `peer_rank` is consistent). +- [x] Validated on 2×H100×8 with `test_internode_multirank.py`. -### Phase 3 — Low-Latency (pure RDMA) +### Phase 3 — Low-Latency (pure RDMA) — STRUCTURAL PORT Port `DeepEP/csrc/kernels/internode_ll.cu` and cross-reference `nccl/contrib/nccl_ep/device/low_latency.cu`. The nccl_ep reference is -modular (see `device_primitives.cuh`, `hybrid_ep.cuh`) and uses NCCL Device -API; the translation table is: +modular (see `device_primitives.cuh`, `hybrid_ep.cuh`) and uses NCCL +Device API; the translation table is: | nccl_ep / DeepEP primitive | MSCCL++ replacement | |-----------------------------------------|-------------------------------------------------| @@ -140,107 +171,13 @@ API; the translation table is: | NVSHMEM symmetric heap | `cudaMalloc` + `ProxyService::addMemory` | | NVSHMEM barrier | `bootstrap->barrier()` or `intranode::barrier` | -Finally fill in `buffer.cc::low_latency_dispatch` / `low_latency_combine` -from the DeepEP bodies (already translated on the `chhwang/...` branch). +Bodies of `Buffer::low_latency_dispatch` / `low_latency_combine` have been +translated from DeepEP but are **untested on real hardware**. ### Phase 4 — Validation -- Port `DeepEP/tests/test_{intranode,internode,low_latency}.py` into - `test/python/ext/ep/`. -- Run on the same H100/H800 reference rig DeepEP uses; compare throughput. -# MSCCL++ Expert-Parallel (EP) extension — migration plan - -This directory is a **scaffolding-only** port of the Mixture-of-Experts (MoE) -`dispatch` / `combine` primitives from: - -- **High-Throughput (HT) mode** — [DeepEP](https://github.com/deepseek-ai/DeepEP), - branch `chhwang/dev-atomic-add-cleanup`. That branch has already replaced - NVSHMEM / IBGDA primitives with `mscclpp::PortChannel` and - `mscclpp::MemoryChannel`, so the port is largely mechanical. -- **Low-Latency (LL) mode** — [`nccl/contrib/nccl_ep`](https://github.com/NVIDIA/nccl), - which implements a pure-RDMA dispatch/combine on top of the NCCL Device API - (GIN put/signal + LSA load/store). The kernels need to be re-expressed in - terms of MSCCL++ device handles. - -## Layout - -| Path | Purpose | -|------|---------| -| [`include/mscclpp/ext/ep/config.hpp`](../../../include/mscclpp/ext/ep/config.hpp) | Public host-side config + size hints (`EpConfig`). | -| [`include/mscclpp/ext/ep/event.hpp`](../../../include/mscclpp/ext/ep/event.hpp) | RAII wrapper around `cudaEvent_t`. | -| [`include/mscclpp/ext/ep/buffer.hpp`](../../../include/mscclpp/ext/ep/buffer.hpp) | Public `Buffer` class; dispatch/combine entry points. | -| [`include/mscclpp/ext/ep/api.hpp`](../../../include/mscclpp/ext/ep/api.hpp) | Umbrella include. | -| [`src/ext/ep/buffer.cc`](buffer.cc) | Host-side orchestration. Constructor + proxy service wired up; `sync()` / kernel stubs `TODO`. | -| [`src/ext/ep/config.cc`](config.cc) | `EpConfig` method bodies. | -| [`src/ext/ep/event.cc`](event.cc) | `EventHandle` implementation. | -| [`src/ext/ep/intranode.cu`](intranode.cu) | **STUB** — HT NVLink-only dispatch/combine. | -| [`src/ext/ep/internode.cu`](internode.cu) | **STUB** — HT NVLink+RDMA dispatch/combine. | -| [`src/ext/ep/internode_ll.cu`](internode_ll.cu) | **STUB** — LL pure-RDMA dispatch/combine. | -| [`src/ext/ep/kernels/api.cuh`](kernels/api.cuh) | Private kernel-facing API (prototypes only for now). | -| [`src/ext/ep/kernels/exception.cuh`](kernels/exception.cuh) | `EP_HOST_ASSERT` / `EP_DEVICE_ASSERT` / `EP_CUDA_CHECK`. | -| [`python/csrc/ext/ep/ep_py.cpp`](../../../python/csrc/ext/ep/ep_py.cpp) | nanobind bindings (submodule `mscclpp._mscclpp.ep`). | -| [`python/mscclpp/ext/ep/`](../../../python/mscclpp/ext/ep/) | Python frontend (`ep.Buffer`). | -| [`test/python/ext/ep/test_ep_skeleton.py`](../../../test/python/ext/ep/test_ep_skeleton.py) | Unit test placeholder. | - -## Build - -The extension is **off by default**. Enable it with: - -```bash -cmake -S . -B build -DMSCCLPP_BUILD_EXT_EP=ON -cmake --build build -j -``` - -This produces `libmscclpp_ep.so` and, when Python bindings are built, exposes -`mscclpp._mscclpp.ep` and `mscclpp.ext.ep`. - -## Migration plan (in order) - -1. **HT intranode.** Port `DeepEP/csrc/kernels/intranode.cu` into - [`intranode.cu`](intranode.cu). All communication is via peer IPC pointers, - so only `include` paths and `torch::Tensor` -> `TensorRef` marshalling need - to change. Flesh out `Buffer::sync()` so that - `nvlBufferPeers_[peer] = cudaIpcOpenMemHandle(...)` is populated and the - table is uploaded to `nvlBufferPeersDevice_`. -2. **HT internode.** Port `DeepEP/csrc/kernels/internode.cu` into - [`internode.cu`](internode.cu). Most of the heavy lifting (NVSHMEM -> - MSCCL++) is already done on the DeepEP `chhwang/dev-atomic-add-cleanup` - branch; copy the kernel bodies as-is and add the launchers. Ensure the - custom trigger type `0x0` atomicAdd path in `EpProxyService` (see - [`buffer.cc`](buffer.cc)) is in place. -3. **LL mode.** Port from `nccl/contrib/nccl_ep/device/low_latency.cu` (or - DeepEP `internode_ll.cu`) into [`internode_ll.cu`](internode_ll.cu). The - translation table lives in the file header; the critical substitution is - `ncclGinPutSignal` / `nvshmemi_ibgda_*` -> `PortChannelDeviceHandle::put` - + `signal` + `wait`, and `ncclGetPeerPointer` -> the `nvlBufferPeersDevice_` - offset table. -4. **TensorRef marshalling.** Extend [`ep_py.cpp`](../../../python/csrc/ext/ep/ep_py.cpp) - to accept DLPack / `torch.Tensor` for the dispatch/combine entry points. - The `TensorRef` type in [`buffer.hpp`](../../../include/mscclpp/ext/ep/buffer.hpp) - is intentionally Torch-free so the C++ core can be reused from - non-PyTorch callers. -5. **Tests.** Grow - [`test/python/ext/ep/`](../../../test/python/ext/ep/) by porting the - scenarios from `DeepEP/tests/test_{intranode,internode,low_latency}.py`. - -## API mapping cheatsheet - -| DeepEP / nccl_ep primitive | MSCCL++ replacement | -|----------------------------------------------------|-----------------------------------------------------------| -| `nvshmemi_ibgda_put_nbi_warp` | `PortChannelDeviceHandle::put` + `signal` | -| `nvshmem_signal_wait_until` | `PortChannelDeviceHandle::wait` | -| `ncclGinPutSignal` | `PortChannelDeviceHandle::put` + `signal` | -| `ncclGinWaitSignal` | `PortChannelDeviceHandle::wait` | -| `ncclGetPeerPointer` / `cudaIpcOpenMemHandle` | `Buffer::nvlBufferPeersDevice_[peer]` + byte offset | -| `ncclTeamLsa` locality check | `Buffer::numNvlRanks_` per-rdma-rank group | -| NVSHMEM symmetric heap allocation | `cudaMalloc` + proxy-registered memory (`ProxyService`) | -| NVSHMEM barrier | `bootstrap_->barrier()` or `intranode::barrier` kernel | - -## Status - -- Headers, CMake targets, Python bindings, and the frontend compile (build - verification has not been run in this session). -- All dispatch/combine entry points throw from C++ or raise `NotImplementedError` - from Python. Constructor, proxy-service startup, and buffer-size hints are - real; `sync()` only flips the `available_` flag and does **not** yet open - peer IPC handles or build MSCCL++ connections. +- [x] `test_intranode_multirank.py` — NVLink round-trip validated. +- [x] `test_internode_multirank.py` — HT round-trip validated on 2×H100×8. +- [ ] `test_low_latency.py` — port from `DeepEP/tests/test_low_latency.py` + and validate on real hardware. +- [ ] Throughput benchmarks against DeepEP upstream. diff --git a/src/ext/ep/buffer.cc b/src/ext/ep/buffer.cc index a299d582..99a081b1 100644 --- a/src/ext/ep/buffer.cc +++ b/src/ext/ep/buffer.cc @@ -161,8 +161,10 @@ pybind11::bytearray Buffer::get_local_ipc_handle() const { } pybind11::bytearray Buffer::get_local_nvshmem_unique_id() const { - // NVSHMEM support is not yet ported; see stub at bottom of this file. - throw std::runtime_error("mscclpp::ep::Buffer::get_local_nvshmem_unique_id: NVSHMEM support not yet ported"); + // The MSCCL++ EP port replaces NVSHMEM with PortChannel/MemoryChannel, + // so there is no NVSHMEM unique id to expose. Kept for ABI parity with + // DeepEP's Python frontend; callers should use the MSCCL++ bootstrap. + throw std::runtime_error("mscclpp::ep::Buffer::get_local_nvshmem_unique_id: not applicable (NVSHMEM is not used in mscclpp_ep)"); } torch::Tensor Buffer::get_local_buffer_tensor(const pybind11::object& dtype, int64_t offset, bool use_rdma_buffer) const { @@ -761,10 +763,12 @@ Buffer::intranode_combine(const torch::Tensor& x, const std::optional, std::optional, std::optional, std::vector, torch::Tensor, torch::Tensor, std::optional, torch::Tensor, std::optional, torch::Tensor, std::optional, std::optional, std::optional, std::optional>