mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
ep: document internode HT validation on 2x H100x8
Refresh status docs and comments now that internode HT dispatch and combine have been validated end-to-end on 2 nodes x 8 H100 GPUs via test/python/ext/ep/test_internode_multirank.py (all 16 ranks recover their per-rank token payloads with zero diff). - src/ext/ep/README.md: consolidate the previously duplicated README into a single document; mark intranode and internode HT dispatch and combine as validated in the status table; add a 'Running the tests' section with torchrun examples for both the intranode and the 2x8 internode setups; record the dispatch->combine torch.cuda.synchronize() + dist.barrier() requirement under Known limitations; mark Phase 2 DONE and keep Phase 3 (LL) as structural port, untested. - python/mscclpp/ext/ep/buffer.py: update the module docstring and the Buffer constructor docstring to say internode HT is validated and clarify that LL mode is untested on multi-node hardware. - src/ext/ep/buffer.cc: drop the stale 'NVSHMEM support not yet ported' and 'low-latency paths still stubbed' comments. mscclpp_ep does not use NVSHMEM at all (PortChannel/MemoryChannel replace it), and the LL paths are a structural port that is present but untested, not stubbed. Note validation on 2x H100x8 in the internode section header.
This commit is contained in:
@@ -11,11 +11,14 @@ DeepEP users can port with minimal changes.
|
||||
|
||||
Current status (see ``src/ext/ep/README.md``):
|
||||
|
||||
* Intranode (NVLink-only) dispatch and combine are fully ported.
|
||||
* ``get_dispatch_layout`` is ported.
|
||||
* Internode HT (MSCCL++ PortChannel + MemoryChannel) is ported.
|
||||
* Internode low-latency kernels are ported structurally (NVSHMEM/IBGDA ->
|
||||
MSCCL++ PortChannel) but **untested on multi-node H100**.
|
||||
* Intranode (NVLink-only) dispatch and combine: ported and validated on
|
||||
one node with 8 GPUs.
|
||||
* ``get_dispatch_layout``: ported.
|
||||
* Internode HT (MSCCL++ PortChannel + MemoryChannel) dispatch and combine:
|
||||
ported and validated on 2 nodes x 8 H100 GPUs with
|
||||
``test/python/ext/ep/test_internode_multirank.py``.
|
||||
* Internode low-latency kernels: structural port (NVSHMEM/IBGDA ->
|
||||
MSCCL++ PortChannel), **untested on multi-node H100**.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
@@ -54,7 +57,7 @@ class Buffer:
|
||||
low-latency modes.
|
||||
low_latency_mode:
|
||||
Enable the low-latency dispatch/combine path (structural port,
|
||||
untested).
|
||||
untested on multi-node hardware).
|
||||
num_qps_per_rank:
|
||||
Ignored for intranode mode.
|
||||
"""
|
||||
|
||||
Reference in New Issue
Block a user