mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 08:50:21 +00:00
Phase 10: validate vs nccl-ep — prior 30 GB/s gap was apples-to-oranges
Side-by-side at TOKENS=128 hidden=7168 topk=8 BF16 16-rank/2-node: mscclpp_ep LL : 38.7 / 39.4 GB/s nccl_ep LL (default) : 63-71 / 62-72 GB/s (NVLink+RDMA mixed) nccl_ep LL --disable-nvlink (RDMA-only): 42.3 / 42.5 GB/s The default nccl-ep number routes ~half of selections through NVLink at NVLink line rate; the weighted average inflates per-rank throughput above the per-NIC ceiling. Apples-to-apples (RDMA only), nccl-ep is at 42.3 GB/s and we are at 38.7 — a real but small ~9% gap. Architectural cause: our internode_ll.cu uses a whole-kernel kIpcPath template (all peers via NVLink OR all via RDMA). nccl-ep selects per peer, so 7/15 intranode peers go via NVLink while 8/15 go via RDMA. Implication: Phases 1-9 were optimizing against an unreachable target. The kernel is at 94% of the actual single-NIC RDMA ceiling. Further productive work is hybrid NVLink+RDMA dispatch, not RDMA tuning.
This commit is contained in:
@@ -676,3 +676,76 @@ not work on NDv5 due to GPU↔NIC PCIe affinity.
|
||||
(94-97% of the ~41 GB/s practical single-NIC ceiling).** Further wins
|
||||
require hardware uplift (NDR2 / 100 GB/s NIC, or platforms with multi-NIC
|
||||
P2P like Grace-Hopper SuperChip).
|
||||
|
||||
---
|
||||
|
||||
## Phase 10 — Validation vs nccl-ep (CRITICAL FINDING: prior gap was apples-to-oranges)
|
||||
|
||||
### 10.1 Setup
|
||||
|
||||
Side-by-side at TOKENS=128, hidden=7168, topk=8, experts=64, BF16,
|
||||
2× 8×H100 NDv5, 16 ranks. Compared:
|
||||
|
||||
- `mscclpp_ep` LL: `MSCCLPP_EP_USE_IBGDA=1` via `run_ll_mpirun.sh`.
|
||||
- `nccl_ep` LL: `/home/qinghuazhou/nccl/build/test/nccl_ep/ep_bench -a ll`
|
||||
in two modes: default (NVLink + RDMA) and `--disable-nvlink` (RDMA only).
|
||||
|
||||
Both stacks compute throughput identically: `bytes_per_rank / avg_time_us`
|
||||
where bytes = `recv_tokens × hidden × 2` (BF16 dispatch / combine).
|
||||
|
||||
### 10.2 Result
|
||||
|
||||
| Stack | dispatch (GB/s) | combine (GB/s) |
|
||||
|-------------------------|----------------:|---------------:|
|
||||
| mscclpp_ep LL | 38.7 | 39.4 |
|
||||
| nccl_ep LL (default) | 63 - 71 | 62 - 72 |
|
||||
| nccl_ep LL `--disable-nvlink` | 42.3 | 42.5 |
|
||||
|
||||
**The "30 GB/s gap" we have been chasing all session was an
|
||||
apples-to-oranges artifact.** Default `nccl_ep` LL routes intranode
|
||||
selections (~half of all 1024 selections at 16 ranks / 2 nodes) through
|
||||
NVLink at NVLink line rate, and only ~half of selections hit RDMA. The
|
||||
weighted average inflates the reported per-rank throughput well above
|
||||
the per-NIC ceiling.
|
||||
|
||||
When `nccl_ep` is restricted to RDMA-only (apples-to-apples with our
|
||||
LL path), it lands at **42.3/42.5 GB/s** — only **9% above us** at
|
||||
38.7/39.4. Phase 7's "single-NIC ceiling ~41 GB/s" was correct: nccl-ep
|
||||
sits at 42.3 (103%), we sit at 38.7 (94%). The remaining 9% = ~3.6 GB/s
|
||||
gap is real and likely the residual cross-rank wait skew (Phase 7.3) +
|
||||
QP / WR scheduling efficiency.
|
||||
|
||||
### 10.3 The actual architectural gap
|
||||
|
||||
Our `internode_ll.cu` uses a single template parameter `kIpcPath` that
|
||||
is whole-kernel:
|
||||
|
||||
- `kIpcPath = true` (single-node) — all peers via NVLink/IPC.
|
||||
- `kIpcPath = false` (multi-node) — all peers via RDMA, including the
|
||||
~7/15 intranode peers that could ride NVLink.
|
||||
|
||||
`nccl_ep` makes the choice **per-peer** at the call site: intranode
|
||||
peers go through NVLink, internode peers through RDMA. On a 2-node
|
||||
16-rank job, each rank has 7 intranode peers (NVLink) and 8 internode
|
||||
peers (RDMA). With NVLink ≫ NIC line rate, the recv-side throughput
|
||||
math becomes a weighted average dominated by the cheaper hops.
|
||||
|
||||
### 10.4 Implication for the optimization story
|
||||
|
||||
- All Phases 1-9 measured the LL path against a target (~65 GB/s) that
|
||||
was not actually achievable on the RDMA path alone. The kernel-side
|
||||
software work (grid tune, multi-SGE, K-shard, multi-NIC) was bound by
|
||||
a real ceiling at ~41 GB/s and we are within 6% of it.
|
||||
- **The remaining productive work is NOT on the RDMA path.** It is on
|
||||
introducing a hybrid NVLink + RDMA dispatch mode that selects the
|
||||
transport per peer, matching nccl-ep's architecture.
|
||||
|
||||
### 10.5 Status
|
||||
|
||||
- Documented. No code changes in this phase — purely a measurement /
|
||||
validation result.
|
||||
- Comparison harness saved at `nccl-tests/sweep_ll_compare.sh` for
|
||||
future regression checks.
|
||||
- Future work flagged: hybrid NVLink + RDMA LL dispatch (would close
|
||||
the remaining gap to nccl-ep's default-mode numbers and is
|
||||
architecturally well-defined).
|
||||
|
||||
Reference in New Issue
Block a user