From f2feb120b8e5249d3a7b285bb45aafd846adcd30 Mon Sep 17 00:00:00 2001
From: Qinghua Zhou <qinghuazhou@microsoft.com>
Date: Wed, 29 Apr 2026 18:26:36 +0000
Subject: [PATCH] ext/ep: refresh README to reflect current LL, proxy sharding,
 and bench harness

---
 src/ext/ep/README.md | 143 ++++++++++++++++++++++++++++++-------------
 1 file changed, 99 insertions(+), 44 deletions(-)

diff --git a/src/ext/ep/README.md b/src/ext/ep/README.md
index 1458f0fe..b5184ebd 100644
--- a/src/ext/ep/README.md
+++ b/src/ext/ep/README.md
@@ -13,53 +13,61 @@ targeting:
 
 | Feature                            | Status                                      |
 |------------------------------------|---------------------------------------------|
-| `Buffer` construction + IPC + sync | ✅ ported (NVLink + RDMA)                   |
-| `get_dispatch_layout`              | ✅ ported                                   |
-| `intranode_dispatch` (NVLink)      | ✅ ported, validated (8 ranks, 1 node)     |
-| `intranode_combine` (NVLink)       | ✅ ported, validated (8 ranks, 1 node)     |
-| `internode_dispatch` (NVLink+RDMA) | ✅ ported, validated (16 ranks, 2×H100×8)  |
-| `internode_combine` (NVLink+RDMA)  | ✅ ported, validated (16 ranks, 2×H100×8)  |
-| `low_latency_dispatch` (pure RDMA) | ⚠️ structural port, untested                |
-| `low_latency_combine` (pure RDMA)  | ⚠️ structural port, untested                |
-| `Connection::atomicAdd` API        | ✅ cherry-picked into mscclpp               |
-| Python frontend `mscclpp.ext.ep`   | ✅ wraps HT + LL paths                      |
-| pybind11 module `mscclpp_ep_cpp`   | ✅ builds conditionally                     |
+| `Buffer` construction + IPC + sync | ✅ ported (NVLink + RDMA)                          |
+| `get_dispatch_layout`              | ✅ ported                                          |
+| `intranode_dispatch` (NVLink)      | ✅ validated (8 ranks, 1 node)                    |
+| `intranode_combine` (NVLink)       | ✅ validated (8 ranks, 1 node)                    |
+| `internode_dispatch` (NVLink+RDMA) | ✅ validated (16 ranks, 2×H100×8)                 |
+| `internode_combine` (NVLink+RDMA)  | ✅ validated (16 ranks, 2×H100×8)                 |
+| `low_latency_dispatch` (RDMA+IPC)  | ✅ validated (8 ranks intra-node; 16 ranks 2×H100) |
+| `low_latency_combine` (RDMA+IPC)   | ✅ validated (8 ranks intra-node; 16 ranks 2×H100) |
+| Multi-`ProxyService` sharding      | ✅ env-tunable, arch-aware default                 |
+| `Connection::atomicAdd` API        | ✅ cherry-picked into mscclpp                      |
+| Python frontend `mscclpp.ext.ep`   | ✅ wraps HT + LL paths                             |
+| pybind11 module `mscclpp_ep_cpp`   | ✅ builds conditionally                            |
 
 Internode HT was validated end-to-end on two H100×8 nodes connected over
 Infiniband using [`test/python/ext/ep/test_internode_multirank.py`](../../../test/python/ext/ep/test_internode_multirank.py).
 All 16 ranks complete dispatch followed by combine with exact (zero-diff)
 recovery of the per-rank token payloads.
 
-The low-latency port is **structural**: the DeepEP LL kernels (pure
-IBGDA) have been mechanically translated to MSCCL++ port-channel ops.
-Semantic mapping:
+The low-latency (LL) path uses a mixed transport: `MemoryChannel` (CUDA
+IPC) for same-node peers and `PortChannel` (CPU proxy + IB verbs) for
+remote peers. The DeepEP LL kernels were translated as follows:
 
 | DeepEP / IBGDA                           | MSCCL++ replacement                                              |
 |------------------------------------------|------------------------------------------------------------------|
-| `nvshmemx_barrier_all_block()`           | signal + wait ring across `port_channel_handles[peer_rank]`      |
-| `nvshmemi_ibgda_put_nbi_warp(...)`       | lane-0 `port_channel_handles[qp*N+dst].put(dst_off, src_off, n)` |
-| `nvshmemi_ibgda_amo_nonfetch_add(...)`   | lane-0 `port_channel_handles[qp*N+dst].atomicAdd(off, int64)`    |
+| `nvshmemx_barrier_all_block()`           | signal + wait ring across per-peer channel handles               |
+| `nvshmemi_ibgda_put_nbi_warp(...)` (intra-node) | `MemoryChannelDeviceHandle::put` (CUDA IPC, no proxy)     |
+| `nvshmemi_ibgda_put_nbi_warp(...)` (inter-node) | lane-0 `PortChannelDeviceHandle::put(dst_off, src_off, n)` |
+| `nvshmemi_ibgda_amo_nonfetch_add(...)`   | lane-0 `atomicAdd` on the corresponding channel handle           |
+
+LL was validated on:
+- 8 ranks × 1 H100 node (NVLink + CUDA-IPC fast path).
+- 16 ranks × 2 H100×8 nodes (mixed CUDA-IPC intra-node + IB inter-node).
+
+### `num_proxy_services` / proxy sharding
+
+A single `mscclpp::ProxyService` is one CPU host thread driving one
+FIFO. With 8 GPUs / node sharing one proxy, the host thread becomes the
+bottleneck for cross-node LL traffic. `Buffer` therefore allocates `N`
+ProxyServices and shards `PortChannel`s across them by `(qp_idx,
+dst_rank)`.
+
+- Default: `8` on Hopper (sm_90), `1` on Blackwell / sm_100+ (NVSwitch).
+- Override at runtime: `MSCCLPP_EP_NUM_PROXIES=<N>` (clamped to ≥1).
+  Rank 0 prints the resolved value at construction.
+- Sweet spot on 2×H100×8 is `N=8`; `N=12` over-subscribes the host CPUs
+  and collapses throughput.
 
 ### Known limitations
 
-- LL performance will NOT match IBGDA — the MSCCL++ port channel uses a
-  CPU proxy. The port is for functional parity, not latency.
-- Unlike DeepEP, this port drives LL dispatch/combine through
-  `PortChannel` rather than NVSHMEM, so `Buffer::sync()` connects every
-  peer (not just same-GPU-ID peers) even in `low_latency_mode=True`.
-- **LL dispatch/combine hangs for intra-node 8-GPU (single host)
-  configurations** with the current `PortChannel`-over-IB setup: with
-  `num_nvl_bytes=0` every peer-to-peer transfer goes through the CPU
-  proxy's IB verbs path, and IB loopback between two distinct HCAs on
-  the same host does not deliver atomics reliably. Using `CudaIpc` for
-  same-node peers instead surfaces a 64-bit `atomicAdd` vs. 32-bit
-  counter alignment mismatch in `CudaIpcConnection::atomicAdd` which
-  corrupts adjacent counter slots. A proper fix requires either (a) a
-  mixed-transport LL variant that uses `MemoryChannel` (IPC, no proxy)
-  for same-node peers like HT does, or (b) widening `rdma_recv_count`
-  slots to 64 bits. See [`test/python/ext/ep/test_low_latency_multirank.py`](../../../test/python/ext/ep/test_low_latency_multirank.py).
-- H100 cross-node validation of LL mode (1 GPU per node, DeepEP's
-  recommended topology) is still pending.
+- LL performance will NOT match IBGDA for cross-node traffic — remote
+  peers go through a CPU proxy. The port is for functional parity, not
+  latency. (Intra-node LL traffic uses CUDA IPC and is competitive.)
+- Unlike DeepEP, this port drives LL through `PortChannel` /
+  `MemoryChannel` rather than NVSHMEM, so `Buffer::sync()` connects
+  every peer even in `low_latency_mode=True`.
 - The internode HT functional test inserts an explicit
   `torch.cuda.synchronize()` + `dist.barrier()` between dispatch and
   combine. Without it, fast ranks can launch combine while peers still
@@ -114,21 +122,28 @@ python/mscclpp/ext/ep/
 
 test/python/ext/ep/
 ├── test_ep_smoke.py                   — size-hint + rejection smoke test
-├── test_intranode_multirank.py        — NVLink dispatch+combine, 8 ranks
+├── test_intranode_multirank.py        — NVLink HT dispatch+combine, 8 ranks
 ├── test_internode_multirank.py        — HT dispatch+combine, 16 ranks (2×8)
-└── test_low_latency_multirank.py      — LL dispatch+combine (intra-node hang; see limitations)
+└── test_low_latency_multirank.py      — LL dispatch+combine (intra-node + cross-node)
 ```
 
 ## Running the tests
 
-Intranode (single node, 8 GPUs):
+Intranode (single node, 8 GPUs) — HT:
 
 ```bash
 torchrun --nnodes=1 --nproc_per_node=8 \
     test/python/ext/ep/test_intranode_multirank.py
 ```
 
-Internode HT (2 nodes × 8 GPUs):
+Intranode LL (single node, 8 GPUs):
+
+```bash
+torchrun --nnodes=1 --nproc_per_node=8 \
+    test/python/ext/ep/test_low_latency_multirank.py
+```
+
+Internode HT (2 nodes × 8 GPUs), torchrun:
 
 ```bash
 # node 0 (master)
@@ -144,6 +159,43 @@ torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \
     test/python/ext/ep/test_internode_multirank.py
 ```
 
+Internode HT/LL via mpirun (matches the NCCL-EP launch convention with
+NUMA binding and an explicit topology file):
+
+```bash
+mpirun -np 16 --allow-run-as-root --hostfile <hostfile> \
+    --mca pml ob1 --mca btl tcp,vader,self --mca btl_tcp_if_include eth0 \
+    --bind-to numa \
+    -x NCCL_SOCKET_IFNAME=eth0 -x MSCCLPP_SOCKET_IFNAME=eth0 -x GLOO_SOCKET_IFNAME=eth0 \
+    -x NCCL_IB_DISABLE=0 -x NCCL_TOPO_FILE=<topo.xml> \
+    -x MASTER_ADDR=<master_ip> -x MASTER_PORT=29600 \
+    bash -c 'export RANK=$OMPI_COMM_WORLD_RANK \
+             WORLD_SIZE=$OMPI_COMM_WORLD_SIZE \
+             LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK; \
+             exec python3 test/python/ext/ep/test_internode_multirank.py'
+```
+
+### Benchmark mode
+
+All three multirank tests double as micro-benchmarks when
+`MSCCLPP_EP_BENCH=1` is set. Dispatch and combine are timed separately
+with CUDA events; per-rank times are reduced across ranks and reported
+as `min` / `avg` / `max`, with bandwidth computed from the average time
+(matching NCCL-EP's `ep_bench` convention).
+
+Env knobs:
+
+| Variable                      | Meaning                                | Default |
+|-------------------------------|----------------------------------------|---------|
+| `MSCCLPP_EP_BENCH`            | Enable benchmark pass                  | `0`     |
+| `MSCCLPP_EP_BENCH_WARMUP`     | Warmup iterations                      | `10`    |
+| `MSCCLPP_EP_BENCH_ITERS`      | Timed iterations                       | `50`    |
+| `MSCCLPP_EP_BENCH_TOKENS`     | Tokens per rank                        | test-specific |
+| `MSCCLPP_EP_BENCH_HIDDEN`     | Hidden dim                             | `7168`  |
+| `MSCCLPP_EP_BENCH_EXPERTS`    | Total experts                          | test-specific |
+| `MSCCLPP_EP_BENCH_TOPK`       | top-k routing                          | `8`     |
+| `MSCCLPP_EP_NUM_PROXIES`      | Number of `ProxyService`s in `Buffer`  | 8 (Hopper) / 1 (Blackwell) |
+
 ## Migration plan
 
 ### Phase 1 — DONE
@@ -165,7 +217,7 @@ torchrun --nnodes=2 --nproc_per_node=8 --node_rank=1 \
       peer rank (so kernel-side indexing by `peer_rank` is consistent).
 - [x] Validated on 2×H100×8 with `test_internode_multirank.py`.
 
-### Phase 3 — Low-Latency (pure RDMA) — STRUCTURAL PORT
+### Phase 3 — Low-Latency (RDMA + CUDA-IPC) — DONE
 
 Port `DeepEP/csrc/kernels/internode_ll.cu` and cross-reference
 `nccl/contrib/nccl_ep/device/low_latency.cu`. The nccl_ep reference is
@@ -183,12 +235,15 @@ Device API; the translation table is:
 | NVSHMEM symmetric heap                  | `cudaMalloc` + `ProxyService::addMemory`        |
 | NVSHMEM barrier                         | `bootstrap->barrier()` or `intranode::barrier`  |
 
-Bodies of `Buffer::low_latency_dispatch` / `low_latency_combine` have been
-translated from DeepEP but are **untested on real hardware**.
+`Buffer::low_latency_dispatch` / `low_latency_combine` are validated
+intra-node (8 ranks, CUDA IPC fast path) and cross-node (16 ranks on
+2×H100×8, IPC + IB). Functional correctness is bit-exact against the
+reference dispatch/combine.
 
 ### Phase 4 — Validation
 
-- [x] `test_intranode_multirank.py` — NVLink round-trip validated.
+- [x] `test_intranode_multirank.py` — NVLink HT round-trip validated.
 - [x] `test_internode_multirank.py` — HT round-trip validated on 2×H100×8.
-- [ ] `test_low_latency_multirank.py` — LL round-trip port in place; intra-node 8-GPU hangs (see Known limitations), cross-node (1 GPU / node) pending hardware validation.
+- [x] `test_low_latency_multirank.py` — LL round-trip validated intra-node (8 ranks) and cross-node (2×H100×8).
+- [x] In-tree micro-benchmark harness (`MSCCLPP_EP_BENCH=1`) reporting min/avg/max + BW@avg, aligned with NCCL-EP `ep_bench`.
 - [ ] Throughput benchmarks against DeepEP upstream.