mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
ep: document internode HT validation on 2x H100x8
Refresh status docs and comments now that internode HT dispatch and combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover their per-rank token payloads with zero diff). - src/ext/ep/README.md: consolidate the previously duplicated README into a single document; mark intranode and internode HT dispatch and combine as validated in the status table; add a 'Running the tests' section with torchrun examples for both the intranode and the 2x8 internode setups; record the dispatch->combine torch.cuda.synchronize() + dist.barrier() requirement under Known limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural port, untested. - python/mscclpp/ext/ep/buffer.py: update the module docstring and the Buffer constructor docstring to say internode HT is validated and clarify that LL mode is untested on multi-node hardware. - src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported' and 'low-latency paths still stubbed' comments. mscclpp_ep does not use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL paths are a structural port that is present but untested, not stubbed. Note validation on 2x H100x8 in the internode section header.
This commit is contained in:
@@ -11,11 +11,14 @@ DeepEP users can port with minimal changes.
|
||||
|
||||
Current status (see ``src/ext/ep/README.md``):
|
||||
|
||||
* Intranode (NVLink-only) dispatch and combine are fully ported.
|
||||
* ``get_dispatch_layout`` is ported.
|
||||
* Internode HT (MSCCL++ PortChannel + MemoryChannel) is ported.
|
||||
* Internode low-latency kernels are ported structurally (NVSHMEM/IBGDA ->
|
||||
MSCCL++ PortChannel) but **untested on multi-node H100**.
|
||||
* Intranode (NVLink-only) dispatch and combine: ported and validated on
|
||||
one node with 8 GPUs.
|
||||
* ``get_dispatch_layout``: ported.
|
||||
* Internode HT (MSCCL++ PortChannel + MemoryChannel) dispatch and combine:
|
||||
ported and validated on 2 nodes x 8 H100 GPUs with
|
||||
``test/python/ext/ep/test_internode_multirank.py``.
|
||||
* Internode low-latency kernels: structural port (NVSHMEM/IBGDA ->
|
||||
MSCCL++ PortChannel), **untested on multi-node H100**.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
@@ -54,7 +57,7 @@ class Buffer:
|
||||
low-latency modes.
|
||||
low_latency_mode:
|
||||
Enable the low-latency dispatch/combine path (structural port,
|
||||
untested).
|
||||
untested on multi-node hardware).
|
||||
num_qps_per_rank:
|
||||
Ignored for intranode mode.
|
||||
"""
|
||||
|
||||
@@ -1,54 +1,61 @@
|
||||
# MSCCL++ Expert-Parallel (EP) extension
|
||||
|
||||
A port of DeepEP's MoE `dispatch`/`combine` primitives into MSCCL++, targeting:
|
||||
A port of DeepEP's MoE `dispatch` / `combine` primitives into MSCCL++,
|
||||
targeting:
|
||||
|
||||
- **High-Throughput (HT) mode** from [DeepEP](https://github.com/deepseek-ai/DeepEP),
|
||||
branch `chhwang/dev-atomic-add-cleanup` — which already swaps NVSHMEM for
|
||||
`mscclpp::PortChannel`/`MemoryChannel`.
|
||||
`mscclpp::PortChannel` / `mscclpp::MemoryChannel`.
|
||||
- **Low-Latency (LL) mode** from [`nccl/contrib/nccl_ep`](https://github.com/NVIDIA/nccl/tree/master/contrib/nccl_ep),
|
||||
which implements pure-RDMA dispatch/combine on top of the NCCL Device API.
|
||||
|
||||
## Status
|
||||
|
||||
| Feature | Status |
|
||||
|------------------------------------|---------------------------------|
|
||||
| `Buffer` construction + IPC + sync | ✅ ported (NVLink + RDMA) |
|
||||
| `get_dispatch_layout` | ✅ ported |
|
||||
| `intranode_dispatch` (NVLink) | ✅ ported |
|
||||
| `intranode_combine` (NVLink) | ✅ ported |
|
||||
| `internode_dispatch` (NVLink+RDMA) | ✅ ported (pending H100 test) |
|
||||
| `internode_combine` (NVLink+RDMA) | ✅ ported (pending H100 test) |
|
||||
| `low_latency_dispatch` (pure RDMA) | ⚠️ structural port, untested |
|
||||
| `low_latency_combine` (pure RDMA) | ⚠️ structural port, untested |
|
||||
| `Connection::atomicAdd` API | ✅ cherry-picked into mscclpp |
|
||||
| Python frontend `mscclpp.ext.ep` | ✅ wraps HT + LL paths |
|
||||
| pybind11 module `mscclpp_ep_cpp` | ✅ builds conditionally |
|
||||
| Feature | Status |
|
||||
|------------------------------------|---------------------------------------------|
|
||||
| `Buffer` construction + IPC + sync | ✅ ported (NVLink + RDMA) |
|
||||
| `get_dispatch_layout` | ✅ ported |
|
||||
| `intranode_dispatch` (NVLink) | ✅ ported, validated (8 ranks, 1 node) |
|
||||
| `intranode_combine` (NVLink) | ✅ ported, validated (8 ranks, 1 node) |
|
||||
| `internode_dispatch` (NVLink+RDMA) | ✅ ported, validated (16 ranks, 2×H100×8) |
|
||||
| `internode_combine` (NVLink+RDMA) | ✅ ported, validated (16 ranks, 2×H100×8) |
|
||||
| `low_latency_dispatch` (pure RDMA) | ⚠️ structural port, untested |
|
||||
| `low_latency_combine` (pure RDMA) | ⚠️ structural port, untested |
|
||||
| `Connection::atomicAdd` API | ✅ cherry-picked into mscclpp |
|
||||
| Python frontend `mscclpp.ext.ep` | ✅ wraps HT + LL paths |
|
||||
| pybind11 module `mscclpp_ep_cpp` | ✅ builds conditionally |
|
||||
|
||||
Internode HT is code-complete but unverified on real hardware — the
|
||||
`sync()` path replaces DeepEP's NVSHMEM symmetric-heap allocation with
|
||||
`cudaMalloc` + `bootstrap->barrier()`, and the kernels use the new
|
||||
`PortChannelDeviceHandle::atomicAdd` instead of the old raw-trigger
|
||||
pattern.
|
||||
Internode HT was validated end-to-end on two H100×8 nodes connected over
|
||||
Infiniband using [`test/python/ext/ep/test_internode_multirank.py`](../../../test/python/ext/ep/test_internode_multirank.py).
|
||||
All 16 ranks complete dispatch followed by combine with exact (zero-diff)
|
||||
recovery of the per-rank token payloads.
|
||||
|
||||
The low-latency port is **structural**: the DeepEP LL kernels (pure
|
||||
IBGDA) have been mechanically translated to MSCCL++ port-channel ops.
|
||||
Semantic mapping:
|
||||
|
||||
| DeepEP / IBGDA | MSCCL++ replacement |
|
||||
|------------------------------------------|--------------------------------------------------------------|
|
||||
| `nvshmemx_barrier_all_block()` | signal+wait ring across `port_channel_handles[peer_rank]` |
|
||||
| DeepEP / IBGDA | MSCCL++ replacement |
|
||||
|------------------------------------------|------------------------------------------------------------------|
|
||||
| `nvshmemx_barrier_all_block()` | signal + wait ring across `port_channel_handles[peer_rank]` |
|
||||
| `nvshmemi_ibgda_put_nbi_warp(...)` | lane-0 `port_channel_handles[qp*N+dst].put(dst_off, src_off, n)` |
|
||||
| `nvshmemi_ibgda_amo_nonfetch_add(...)` | lane-0 `port_channel_handles[qp*N+dst].atomicAdd(off, int64)` |
|
||||
| `nvshmemi_ibgda_amo_nonfetch_add(...)` | lane-0 `port_channel_handles[qp*N+dst].atomicAdd(off, int64)` |
|
||||
|
||||
**Known limitations**:
|
||||
### Known limitations
|
||||
|
||||
* LL performance will NOT match IBGDA — the MSCCL++ port channel uses a
|
||||
- LL performance will NOT match IBGDA — the MSCCL++ port channel uses a
|
||||
CPU proxy. The port is for functional parity, not latency.
|
||||
* `Buffer::sync()` in `low_latency_mode=True` only connects peers sharing
|
||||
- `Buffer::sync()` in `low_latency_mode=True` only connects peers sharing
|
||||
the same local GPU ID (DeepEP convention). LL kernels therefore assume
|
||||
one-GPU-per-node topology, i.e. `num_ranks == num_rdma_ranks`. Running
|
||||
with >1 GPU per node in LL mode will fail to reach cross-GPU peers.
|
||||
* Multi-node H100 validation is still pending.
|
||||
- Multi-node H100 validation of LL mode is still pending.
|
||||
- The internode HT functional test inserts an explicit
|
||||
`torch.cuda.synchronize()` + `dist.barrier()` between dispatch and
|
||||
combine. Without it, fast ranks can launch combine while peers still
|
||||
have in-flight dispatch proxy traffic, deadlocking the combine NVL
|
||||
forwarder. Folding this barrier into
|
||||
`Buffer::internode_dispatch` / `Buffer::internode_combine` (or
|
||||
`cached_notify`) is tracked in the test's `XXX` comment.
|
||||
|
||||
## Build
|
||||
|
||||
@@ -67,7 +74,7 @@ The Python frontend picks it up automatically:
|
||||
|
||||
```python
|
||||
from mscclpp.ext import ep
|
||||
buf = ep.Buffer(group, num_nvl_bytes=..., num_rdma_bytes=0)
|
||||
buf = ep.Buffer(group, num_nvl_bytes=..., num_rdma_bytes=...)
|
||||
```
|
||||
|
||||
## Layout
|
||||
@@ -95,7 +102,34 @@ python/mscclpp/ext/ep/
|
||||
└── buffer.py — torch.distributed-aware frontend
|
||||
|
||||
test/python/ext/ep/
|
||||
└── test_ep_smoke.py — size-hint + rejection smoke test
|
||||
├── test_ep_smoke.py — size-hint + rejection smoke test
|
||||
├── test_intranode_multirank.py — NVLink dispatch+combine, 8 ranks
|
||||
└── test_internode_multirank.py — HT dispatch+combine, 16 ranks (2×8)
|
||||
```
|
||||
|
||||
## Running the tests
|
||||
|
||||
Intranode (single node, 8 GPUs):
|
||||
|
||||
```bash
|
||||
torchrun --nnodes=1 --nproc_per_node=8 \
|
||||
test/python/ext/ep/test_intranode_multirank.py
|
||||
```
|
||||
|
||||
Internode HT (2 nodes × 8 GPUs):
|
||||
|
||||
```bash
|
||||
# node 0 (master)
|
||||
NCCL_SOCKET_IFNAME=eth0 MSCCLPP_SOCKET_IFNAME=eth0 GLOO_SOCKET_IFNAME=eth0 \
|
||||
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=0 \
|
||||
--master_addr=<master_ip> --master_port=29600 \
|
||||
test/python/ext/ep/test_internode_multirank.py
|
||||
|
||||
# node 1 (worker)
|
||||
NCCL_SOCKET_IFNAME=eth0 MSCCLPP_SOCKET_IFNAME=eth0 GLOO_SOCKET_IFNAME=eth0 \
|
||||
torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \
|
||||
--master_addr=<master_ip> --master_port=29600 \
|
||||
test/python/ext/ep/test_internode_multirank.py
|
||||
```
|
||||
|
||||
## Migration plan
|
||||
@@ -105,29 +139,26 @@ test/python/ext/ep/
|
||||
- [x] Copy DeepEP kernel headers (configs / buffer / utils / launch / exception).
|
||||
- [x] Port intranode kernels + runtime (NVLink only).
|
||||
- [x] Port `get_dispatch_layout` (host-safe subset of internode kernels).
|
||||
- [x] Port host Buffer: ctor, sync, get_dispatch_layout, intranode dispatch/combine.
|
||||
- [x] Port host Buffer: ctor, sync, get_dispatch_layout, intranode
|
||||
dispatch/combine.
|
||||
- [x] pybind11 `mscclpp_ep_cpp` module + Python frontend.
|
||||
|
||||
### Phase 2 — internode HT (NVLink + RDMA)
|
||||
### Phase 2 — internode HT (NVLink + RDMA) — DONE
|
||||
|
||||
Port the rest of `DeepEP/csrc/kernels/internode.cu` (`notify_dispatch`,
|
||||
`dispatch`, `cached_notify`, `combine`). Because we are starting from the
|
||||
`chhwang/dev-atomic-add-cleanup` branch, the NVSHMEM -> MSCCL++ substitution
|
||||
is already done upstream — just copy the kernel bodies and wire them through
|
||||
`api.cuh`. The launchers need `PortChannelDeviceHandle*` /
|
||||
`MemoryChannelDeviceHandle*` arguments that `Buffer::sync()` already builds
|
||||
(see the `num_rdma_bytes > 0` branch — currently throws, but the code
|
||||
populating `port_channel_handles_device_ptr` and
|
||||
`memory_channel_handles_device_ptr` is ready). Finally replace the stubs in
|
||||
`buffer.cc` (`internode_dispatch`, `internode_combine`) with the real bodies
|
||||
from DeepEP.
|
||||
- [x] Port `notify_dispatch`, `dispatch`, `cached_notify`, `combine` kernels.
|
||||
- [x] Wire `Buffer::internode_dispatch` / `Buffer::internode_combine` host
|
||||
orchestration.
|
||||
- [x] `Buffer::sync()` builds `port_channel_handles_device_ptr` and
|
||||
`memory_channel_handles_device_ptr`, with port channels ordered by
|
||||
peer rank (so kernel-side indexing by `peer_rank` is consistent).
|
||||
- [x] Validated on 2×H100×8 with `test_internode_multirank.py`.
|
||||
|
||||
### Phase 3 — Low-Latency (pure RDMA)
|
||||
### Phase 3 — Low-Latency (pure RDMA) — STRUCTURAL PORT
|
||||
|
||||
Port `DeepEP/csrc/kernels/internode_ll.cu` and cross-reference
|
||||
`nccl/contrib/nccl_ep/device/low_latency.cu`. The nccl_ep reference is
|
||||
modular (see `device_primitives.cuh`, `hybrid_ep.cuh`) and uses NCCL Device
|
||||
API; the translation table is:
|
||||
modular (see `device_primitives.cuh`, `hybrid_ep.cuh`) and uses NCCL
|
||||
Device API; the translation table is:
|
||||
|
||||
| nccl_ep / DeepEP primitive | MSCCL++ replacement |
|
||||
|-----------------------------------------|-------------------------------------------------|
|
||||
@@ -140,107 +171,13 @@ API; the translation table is:
|
||||
| NVSHMEM symmetric heap | `cudaMalloc` + `ProxyService::addMemory` |
|
||||
| NVSHMEM barrier | `bootstrap->barrier()` or `intranode::barrier` |
|
||||
|
||||
Finally fill in `buffer.cc::low_latency_dispatch` / `low_latency_combine`
|
||||
from the DeepEP bodies (already translated on the `chhwang/...` branch).
|
||||
Bodies of `Buffer::low_latency_dispatch` / `low_latency_combine` have been
|
||||
translated from DeepEP but are **untested on real hardware**.
|
||||
|
||||
### Phase 4 — Validation
|
||||
|
||||
- Port `DeepEP/tests/test_{intranode,internode,low_latency}.py` into
|
||||
`test/python/ext/ep/`.
|
||||
- Run on the same H100/H800 reference rig DeepEP uses; compare throughput.
|
||||
# MSCCL++ Expert-Parallel (EP) extension — migration plan
|
||||
|
||||
This directory is a **scaffolding-only** port of the Mixture-of-Experts (MoE)
|
||||
`dispatch` / `combine` primitives from:
|
||||
|
||||
- **High-Throughput (HT) mode** — [DeepEP](https://github.com/deepseek-ai/DeepEP),
|
||||
branch `chhwang/dev-atomic-add-cleanup`. That branch has already replaced
|
||||
NVSHMEM / IBGDA primitives with `mscclpp::PortChannel` and
|
||||
`mscclpp::MemoryChannel`, so the port is largely mechanical.
|
||||
- **Low-Latency (LL) mode** — [`nccl/contrib/nccl_ep`](https://github.com/NVIDIA/nccl),
|
||||
which implements a pure-RDMA dispatch/combine on top of the NCCL Device API
|
||||
(GIN put/signal + LSA load/store). The kernels need to be re-expressed in
|
||||
terms of MSCCL++ device handles.
|
||||
|
||||
## Layout
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| [`include/mscclpp/ext/ep/config.hpp`](../../../include/mscclpp/ext/ep/config.hpp) | Public host-side config + size hints (`EpConfig`). |
|
||||
| [`include/mscclpp/ext/ep/event.hpp`](../../../include/mscclpp/ext/ep/event.hpp) | RAII wrapper around `cudaEvent_t`. |
|
||||
| [`include/mscclpp/ext/ep/buffer.hpp`](../../../include/mscclpp/ext/ep/buffer.hpp) | Public `Buffer` class; dispatch/combine entry points. |
|
||||
| [`include/mscclpp/ext/ep/api.hpp`](../../../include/mscclpp/ext/ep/api.hpp) | Umbrella include. |
|
||||
| [`src/ext/ep/buffer.cc`](buffer.cc) | Host-side orchestration. Constructor + proxy service wired up; `sync()` / kernel stubs `TODO`. |
|
||||
| [`src/ext/ep/config.cc`](config.cc) | `EpConfig` method bodies. |
|
||||
| [`src/ext/ep/event.cc`](event.cc) | `EventHandle` implementation. |
|
||||
| [`src/ext/ep/intranode.cu`](intranode.cu) | **STUB** — HT NVLink-only dispatch/combine. |
|
||||
| [`src/ext/ep/internode.cu`](internode.cu) | **STUB** — HT NVLink+RDMA dispatch/combine. |
|
||||
| [`src/ext/ep/internode_ll.cu`](internode_ll.cu) | **STUB** — LL pure-RDMA dispatch/combine. |
|
||||
| [`src/ext/ep/kernels/api.cuh`](kernels/api.cuh) | Private kernel-facing API (prototypes only for now). |
|
||||
| [`src/ext/ep/kernels/exception.cuh`](kernels/exception.cuh) | `EP_HOST_ASSERT` / `EP_DEVICE_ASSERT` / `EP_CUDA_CHECK`. |
|
||||
| [`python/csrc/ext/ep/ep_py.cpp`](../../../python/csrc/ext/ep/ep_py.cpp) | nanobind bindings (submodule `mscclpp._mscclpp.ep`). |
|
||||
| [`python/mscclpp/ext/ep/`](../../../python/mscclpp/ext/ep/) | Python frontend (`ep.Buffer`). |
|
||||
| [`test/python/ext/ep/test_ep_skeleton.py`](../../../test/python/ext/ep/test_ep_skeleton.py) | Unit test placeholder. |
|
||||
|
||||
## Build
|
||||
|
||||
The extension is **off by default**. Enable it with:
|
||||
|
||||
```bash
|
||||
cmake -S . -B build -DMSCCLPP_BUILD_EXT_EP=ON
|
||||
cmake --build build -j
|
||||
```
|
||||
|
||||
This produces `libmscclpp_ep.so` and, when Python bindings are built, exposes
|
||||
`mscclpp._mscclpp.ep` and `mscclpp.ext.ep`.
|
||||
|
||||
## Migration plan (in order)
|
||||
|
||||
1. **HT intranode.** Port `DeepEP/csrc/kernels/intranode.cu` into
|
||||
[`intranode.cu`](intranode.cu). All communication is via peer IPC pointers,
|
||||
so only `include` paths and `torch::Tensor` -> `TensorRef` marshalling need
|
||||
to change. Flesh out `Buffer::sync()` so that
|
||||
`nvlBufferPeers_[peer] = cudaIpcOpenMemHandle(...)` is populated and the
|
||||
table is uploaded to `nvlBufferPeersDevice_`.
|
||||
2. **HT internode.** Port `DeepEP/csrc/kernels/internode.cu` into
|
||||
[`internode.cu`](internode.cu). Most of the heavy lifting (NVSHMEM ->
|
||||
MSCCL++) is already done on the DeepEP `chhwang/dev-atomic-add-cleanup`
|
||||
branch; copy the kernel bodies as-is and add the launchers. Ensure the
|
||||
custom trigger type `0x0` atomicAdd path in `EpProxyService` (see
|
||||
[`buffer.cc`](buffer.cc)) is in place.
|
||||
3. **LL mode.** Port from `nccl/contrib/nccl_ep/device/low_latency.cu` (or
|
||||
DeepEP `internode_ll.cu`) into [`internode_ll.cu`](internode_ll.cu). The
|
||||
translation table lives in the file header; the critical substitution is
|
||||
`ncclGinPutSignal` / `nvshmemi_ibgda_*` -> `PortChannelDeviceHandle::put`
|
||||
+ `signal` + `wait`, and `ncclGetPeerPointer` -> the `nvlBufferPeersDevice_`
|
||||
offset table.
|
||||
4. **TensorRef marshalling.** Extend [`ep_py.cpp`](../../../python/csrc/ext/ep/ep_py.cpp)
|
||||
to accept DLPack / `torch.Tensor` for the dispatch/combine entry points.
|
||||
The `TensorRef` type in [`buffer.hpp`](../../../include/mscclpp/ext/ep/buffer.hpp)
|
||||
is intentionally Torch-free so the C++ core can be reused from
|
||||
non-PyTorch callers.
|
||||
5. **Tests.** Grow
|
||||
[`test/python/ext/ep/`](../../../test/python/ext/ep/) by porting the
|
||||
scenarios from `DeepEP/tests/test_{intranode,internode,low_latency}.py`.
|
||||
|
||||
## API mapping cheatsheet
|
||||
|
||||
| DeepEP / nccl_ep primitive | MSCCL++ replacement |
|
||||
|----------------------------------------------------|-----------------------------------------------------------|
|
||||
| `nvshmemi_ibgda_put_nbi_warp` | `PortChannelDeviceHandle::put` + `signal` |
|
||||
| `nvshmem_signal_wait_until` | `PortChannelDeviceHandle::wait` |
|
||||
| `ncclGinPutSignal` | `PortChannelDeviceHandle::put` + `signal` |
|
||||
| `ncclGinWaitSignal` | `PortChannelDeviceHandle::wait` |
|
||||
| `ncclGetPeerPointer` / `cudaIpcOpenMemHandle` | `Buffer::nvlBufferPeersDevice_[peer]` + byte offset |
|
||||
| `ncclTeamLsa` locality check | `Buffer::numNvlRanks_` per-rdma-rank group |
|
||||
| NVSHMEM symmetric heap allocation | `cudaMalloc` + proxy-registered memory (`ProxyService`) |
|
||||
| NVSHMEM barrier | `bootstrap_->barrier()` or `intranode::barrier` kernel |
|
||||
|
||||
## Status
|
||||
|
||||
- Headers, CMake targets, Python bindings, and the frontend compile (build
|
||||
verification has not been run in this session).
|
||||
- All dispatch/combine entry points throw from C++ or raise `NotImplementedError`
|
||||
from Python. Constructor, proxy-service startup, and buffer-size hints are
|
||||
real; `sync()` only flips the `available_` flag and does **not** yet open
|
||||
peer IPC handles or build MSCCL++ connections.
|
||||
- [x] `test_intranode_multirank.py` — NVLink round-trip validated.
|
||||
- [x] `test_internode_multirank.py` — HT round-trip validated on 2×H100×8.
|
||||
- [ ] `test_low_latency.py` — port from `DeepEP/tests/test_low_latency.py`
|
||||
and validate on real hardware.
|
||||
- [ ] Throughput benchmarks against DeepEP upstream.
|
||||
|
||||
@@ -161,8 +161,10 @@ pybind11::bytearray Buffer::get_local_ipc_handle() const {
|
||||
}
|
||||
|
||||
pybind11::bytearray Buffer::get_local_nvshmem_unique_id() const {
|
||||
// NVSHMEM support is not yet ported; see stub at bottom of this file.
|
||||
throw std::runtime_error("mscclpp::ep::Buffer::get_local_nvshmem_unique_id: NVSHMEM support not yet ported");
|
||||
// The MSCCL++ EP port replaces NVSHMEM with PortChannel/MemoryChannel,
|
||||
// so there is no NVSHMEM unique id to expose. Kept for ABI parity with
|
||||
// DeepEP's Python frontend; callers should use the MSCCL++ bootstrap.
|
||||
throw std::runtime_error("mscclpp::ep::Buffer::get_local_nvshmem_unique_id: not applicable (NVSHMEM is not used in mscclpp_ep)");
|
||||
}
|
||||
|
||||
torch::Tensor Buffer::get_local_buffer_tensor(const pybind11::object& dtype, int64_t offset, bool use_rdma_buffer) const {
|
||||
@@ -761,10 +763,12 @@ Buffer::intranode_combine(const torch::Tensor& x, const std::optional<torch::Ten
|
||||
}
|
||||
|
||||
// -----------------------------------------------------------------------------
|
||||
// Internode (NVLink + RDMA) high-throughput path. Ported verbatim from
|
||||
// DeepEP `csrc/deep_ep.cpp`; the kernels it drives are in
|
||||
// `src/ext/ep/kernels/internode.cu`. Low-latency (pure RDMA) paths below
|
||||
// are still stubbed.
|
||||
// Internode (NVLink + RDMA) high-throughput path. Ported from DeepEP
|
||||
// `csrc/deep_ep.cpp`; the kernels it drives are in
|
||||
// `src/ext/ep/kernels/internode.cu`. Validated end-to-end on 2 x H100 x 8
|
||||
// via `test/python/ext/ep/test_internode_multirank.py`. The low-latency
|
||||
// (pure RDMA) paths further below are a structural port and have not
|
||||
// been validated on real hardware.
|
||||
// -----------------------------------------------------------------------------
|
||||
|
||||
std::tuple<torch::Tensor, std::optional<torch::Tensor>, std::optional<torch::Tensor>, std::optional<torch::Tensor>, std::vector<int>, torch::Tensor, torch::Tensor, std::optional<torch::Tensor>, torch::Tensor, std::optional<torch::Tensor>, torch::Tensor, std::optional<torch::Tensor>, std::optional<torch::Tensor>, std::optional<torch::Tensor>, std::optional<EventHandle>>
|
||||
|
||||
Reference in New Issue
Block a user