mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
Port DeepEP's pure-RDMA low-latency (LL) MoE kernels from
csrc/kernels/internode_ll.cu (branch chhwang/dev-atomic-add-cleanup)
into the MSCCL++ EP extension. NVSHMEM / IBGDA device primitives are
replaced with MSCCL++ PortChannelDeviceHandle operations:
nvshmemx_barrier_all_block() -> port-channel signal+wait ring
nvshmemi_ibgda_put_nbi_warp(...) -> lane-0 PortChannel.put(...)
nvshmemi_ibgda_amo_nonfetch_add(...) -> lane-0 PortChannel.atomicAdd(...)
The atomicAdd path relies on the MSCCL++ Connection::atomicAdd /
PortChannelDeviceHandle::atomicAdd API cherry-picked from branch
chhwang/new-atomic-add; the LL dispatch path uses a signed delta
(-num_tokens_sent - 1) which the new int64_t signature supports.
Changes:
* New file src/ext/ep/kernels/internode_ll.cu (~530 lines) with the
three kernels clean_low_latency_buffer, dispatch<kUseFP8,...>,
combine<...> plus their launchers. rdma_buffer_ptr is threaded
through the launchers so the kernel can translate virtual addresses
into registered-memory offsets expected by MSCCL++.
* kernels/api.cuh: replace the single stub signature with full LL
launcher prototypes.
* buffer.cc: replace the four LL throw-stubs
(clean_low_latency_buffer, low_latency_dispatch,
low_latency_combine, get_next_low_latency_combine_buffer) with
torch-Tensor implementations ported from DeepEP/csrc/deep_ep.cpp.
* Drop src/ext/ep/internode_stub.cc and its CMake entry.
* python/mscclpp/ext/ep/buffer.py: remove the low_latency_mode=True
NotImplementedError guard; update docstring.
* test/python/ext/ep/test_ep_smoke.py: rename
test_low_latency_rejected -> test_low_latency_buffer_construct
to reflect that LL construction is now accepted.
* src/ext/ep/README.md: update status matrix, document the
NVSHMEM -> MSCCL++ translation table, and list the known
limitations.
This is a structural port: the kernels compile, link, and pass the
single-rank smoke tests, but end-to-end behaviour on multi-node H100
is not yet validated. Two known caveats:
1. Performance will NOT match IBGDA because MSCCL++ port channels
use a CPU proxy; this port is for functional parity, not latency.
2. Buffer::sync() in LL mode only connects peers that share the
same local GPU id (DeepEP convention), so the LL kernels assume
a one-GPU-per-node topology (num_ranks == num_rdma_ranks).
Multi-GPU-per-node LL layouts will need a follow-up in sync().
Tested:
cmake --build build -j --target mscclpp_ep_cpp # builds clean
pytest test/python/ext/ep/test_ep_smoke.py # 3 passed