mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
Port DeepEP's high-throughput MoE dispatch/combine kernels onto MSCCL++
as an optional build target `mscclpp_ep_cpp`, gated by -DMSCCLPP_BUILD_EXT_EP
(OFF by default). Sources are lifted from DeepEP branch
`chhwang/dev-atomic-add-cleanup` and rebased onto upstream MSCCL++ APIs;
the NVSHMEM / IBGDA dependencies are replaced with `PortChannel` +
`MemoryChannel` + the new `Connection::atomicAdd` primitive.
Scope
-----
Intranode (NVLink-only):
* `Buffer` ctor/dtor: cudaMalloc nvl workspace, export IPC handle,
allocate FIFO + peer-pointer tables, start `ProxyService`.
* `sync()`: import peer IPC handles, upload peer pointer table,
build `MemoryDevice2DeviceSemaphore` + `MemoryChannel` per peer.
* `get_dispatch_layout`, `intranode_dispatch`, `intranode_combine`
ported verbatim (torch::Tensor ABI preserved).
Internode HT (NVLink + RDMA):
* `sync()` RDMA branch: cudaMalloc RDMA buffer + `bootstrap->barrier()`
(replacing NVSHMEM symmetric-heap allocation); register with
`all_transport`, exchange via `sendMemory`/`recvMemory`, build 12 IB
QPs/peer + 16 semaphores/peer + 16 port channels/peer.
* Full `internode.cu` port (notify_dispatch / dispatch / cached_notify
/ combine / get_dispatch_layout). The 4 raw `ChannelTrigger` atomic
sites are rewritten to call the new
`PortChannelDeviceHandle::atomicAdd(offset, value)` API; the single
`nvshmem_fence()` is replaced with `__threadfence_system()` (remote
visibility guaranteed by the subsequent port-channel barrier).
* `internode_dispatch` / `internode_combine` host code ported, with
the torch tensor marshalling and CPU spin-wait on mapped counters.
Low-latency (pure RDMA):
* Not ported. `low_latency_dispatch`, `low_latency_combine`,
`clean_low_latency_buffer`, `get_next_low_latency_combine_buffer`
throw `std::runtime_error`; the Python frontend refuses to
construct a Buffer with `low_latency_mode=True`.
Python layer
------------
* New pybind11 + libtorch Python extension `mscclpp_ep_cpp` (separate
from the nanobind `_mscclpp` because the EP ABI carries
`torch::Tensor` / `at::cuda::CUDAStream`).
* `mscclpp.ext.ep.Buffer` mirrors `deep_ep.Buffer`; exchanges device
IDs, IPC handles and the bootstrap UniqueId over the user's
`torch.distributed` process group before calling `sync()`.
* `mscclpp.ext` auto-imports `ep` if the extension is built.
Build
-----
* `src/ext/ep/CMakeLists.txt`: finds Python + Torch; warns and skips if
`CMAKE_PREFIX_PATH` doesn't point at `torch.utils.cmake_prefix_path`.
Falls back to Torch's bundled pybind11 if a standalone pybind11 is not
installed. Links `libtorch_python` explicitly (without it, `import
mscclpp_ep_cpp` fails with `undefined symbol: THPDtypeType`).
* Top-level `CMakeLists.txt` exposes the `MSCCLPP_BUILD_EXT_EP` option
(default OFF).
Tests
-----
* `test/python/ext/ep/test_ep_smoke.py`: skipped if the extension isn't
built. Covers Config round-trip, low-latency size hint, and the LL
construction guard. Multi-rank functional tests still to do on H100.
Notes
-----
* Builds against the preceding "atomic add" commit which adds
`Connection::atomicAdd` and `PortChannelDeviceHandle::atomicAdd` to
upstream MSCCL++.
* Intranode path verified end-to-end (build + import + smoke tests).
* Internode HT is code-complete but requires real IB hardware to
validate; see `src/ext/ep/README.md` for the detailed port plan and
remaining LL migration.
52 lines
1.9 KiB
Python
52 lines
1.9 KiB
Python
# Copyright (c) Microsoft Corporation.
|
|
# Licensed under the MIT License.
|
|
"""Smoke tests for the EP extension.
|
|
|
|
These tests only exercise single-rank / pure-Python code paths so they can
|
|
run in CI without multi-GPU resources. Multi-rank dispatch/combine tests
|
|
belong in ``test/python/ext/ep/test_intranode.py`` and are left as TODO
|
|
until the Python frontend is validated on H100.
|
|
|
|
Run with::
|
|
|
|
pytest -xvs test/python/ext/ep/test_ep_smoke.py
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import pytest
|
|
|
|
try:
|
|
import mscclpp_ep_cpp as _cpp # type: ignore[import-not-found]
|
|
except ImportError: # pragma: no cover
|
|
pytest.skip("mscclpp_ep_cpp is not built (set -DMSCCLPP_BUILD_EXT_EP=ON)", allow_module_level=True)
|
|
|
|
|
|
def test_config_roundtrip():
|
|
cfg = _cpp.Config(num_sms=20, num_max_nvl_chunked_send_tokens=6, num_max_nvl_chunked_recv_tokens=256,
|
|
num_max_rdma_chunked_send_tokens=6, num_max_rdma_chunked_recv_tokens=256)
|
|
hint = cfg.get_nvl_buffer_size_hint(7168 * 2, 8)
|
|
assert hint > 0
|
|
|
|
|
|
def test_low_latency_size_hint():
|
|
assert _cpp.get_low_latency_rdma_size_hint(128, 7168, 8, 256) > 0
|
|
|
|
|
|
def test_low_latency_rejected():
|
|
# Low-latency (pure RDMA) path is not ported yet; Python frontend must
|
|
# refuse to construct a Buffer with low_latency_mode=True. We test the
|
|
# underlying C++ constructor directly so this does not depend on the
|
|
# full `mscclpp` Python package being installed.
|
|
import torch
|
|
|
|
if not torch.cuda.is_available():
|
|
pytest.skip("CUDA not available")
|
|
|
|
# The C++ Buffer allows low_latency_mode at construction; the enforcement
|
|
# lives in the Python frontend (`mscclpp.ext.ep.buffer.Buffer.__init__`).
|
|
# Verify the C++ side does NOT reject it, so the guarantee sits at the
|
|
# Python layer where it belongs.
|
|
buf = _cpp.Buffer(rank=0, num_ranks=1, num_nvl_bytes=0, num_rdma_bytes=0, low_latency_mode=True)
|
|
assert not buf.is_available()
|