ext/ep: document Phase 8 - combine TMA/cp.async declined after profiling

Profiled the combine kernel (TOKENS=128, TOPK=8, BF16 hidden=7168, 16 ranks, IBGDA, grid (1, 32) -> 64 blocks): send=17us wait=220us grid_sync=112us reduce=9us total=357us Reduce (the only place TMA/cp.async could help) is 9us / 357us = 2.5% of total kernel time. Halving it best-case yields <1% end-to-end perf. The 220us wait is the same NIC-bandwidth ceiling dispatch hits: combine sends ~14 MB/rank/iter back from each (le, src_rank) to its source rank. The Phase 7 speculation that combine had a non-NIC bottleneck profile was wrong - both paths are NIC-bound at the same ceiling. The 112us grid_sync interval is the same within-rank wait skew from Phase 7.3 (per-(le, src_rank) NIC contention). Not addressable in software. Decision: not implementing the cp.async pipeline. The 50+ LoC of shared-memory plumbing and __pipeline synchronization is not justified by the <1% upper bound. Updated synthesis: both LL paths are single-NIC bandwidth bound at this problem size. Kernel-side software work is essentially done at 38.9 / 39.6 GB/s = 94-97% of the 41 GB/s practical ceiling. Remaining options are multi-NIC striping, smaller payload (FP8 / smaller hidden), or NDR2 hardware uplift.
2026-05-11 17:00:22 +00:00 · 2026-05-09 17:56:54 +00:00
parent f1808501e9
commit 9d729d795e
1 changed files with 68 additions and 0 deletions
--- a/LL_OPTIMIZATION_HISTORY.md
+++ b/LL_OPTIMIZATION_HISTORY.md
@@ -445,3 +445,71 @@ To meaningfully exceed this requires either:

 Combine path may still have non-NIC headroom (recv-side reduce-add could
 be pipelined with TMA / cp.async); leaving as a future independent attack.
+
+---
+
+## Phase 8 — Combine TMA / cp.async pipeline (DECLINED after profiling)
+
+The Phase 7 synthesis flagged combine as a candidate "non-NIC software
+option" worth pursuing. Profiled the combine kernel (TOKENS=128, TOPK=8,
+BF16 hidden=7168, 16 ranks, IBGDA, grid (1, 32) → 64 blocks):
+
+```
+[ep-prof combine #10 r0] blocks=64
+  send=9.5/16.6/19.1us
+  wait=1.3/220.2/330.8us
+  grid_sync=1.0/111.7/335.3us
+  reduce=7.9/8.9/9.6us
+  total=356.3/357.2/358.0us  (min/avg/max)
+```
+
+Breakdown of average per-block time (357 µs total):
+
+| phase     | time   | %      | notes                                 |
+|-----------|-------:|-------:|---------------------------------------|
+| send      | 17 µs  |   5%   | RDMA WRITEs back to source ranks      |
+| **wait**  | 220 µs | **62%**| NIC bandwidth bound (same as dispatch)|
+| grid_sync | 112 µs |  31%   | per-block skew slack (free CTA wait)  |
+| **reduce**| **9 µs** | **2.5%**| weighted reduce-add over topk inputs|
+
+The reduce arithmetic — the *only* place TMA / `cp.async` could help — is
+**9 µs out of 357 µs = 2.5%** of the total kernel time. Even halving it
+(an aggressive estimate; H100 HBM3 already runs the loop at ~50% of peak
+BW, so realistic savings are ~3 µs) yields **<1% end-to-end perf**.
+
+Wait at 220 µs is the same NIC-bandwidth ceiling that dispatch hits:
+combine sends the same ~14 MB/rank/iter back from each (le, src_rank)
+to its source rank. The Phase 7 speculation that combine had a
+"non-NIC bottleneck profile" was wrong — both paths are NIC-bound.
+
+The 112 µs `grid_sync` interval is the same within-rank wait skew
+documented in Phase 7.3 — blocks that finished waiting early sit in
+`cg::this_grid().sync()` until the slowest block catches up. It is not
+addressable by software (it tracks per-(le, src_rank) NIC contention).
+
+**Decision: not implementing.** The cp.async pipeline would add 50+ LoC
+of shared-memory plumbing and `__pipeline_*` synchronization for ≤1%
+perf. Profile is the rebuttal.
+
+## Updated synthesis (post Phase 8)
+
+Both LL paths (dispatch + combine) are **single-NIC bandwidth bound** at
+this problem size. The kernel-side software work is essentially done at
+38.9 / 39.6 GB/s = 94-97% of the ~41 GB/s practical ceiling.
+
+Remaining categories of attack, all architectural / non-trivial:
+
+1. **Multi-NIC striping** — fan WRs across `mlx5_ib0 + mlx5_ib1` per
+   rank. Requires QP-pool restructure, topology-aware peer mapping,
+   and recv-side aggregation. Largest predicted win but biggest change.
+
+2. **Smaller per-token payload** — at hidden=2048 or FP8, NIC stops
+   being the binding constraint and kernel-side wins re-open. Useful
+   for application-level integration but doesn't move the LL benchmark
+   number on the current setup.
+
+3. **NDR2 / 100 GB/s NIC** — hardware uplift, out of scope.
+
+For the current 2×8×H100 NDR(50) setup with TOKENS=128 / TOPK=8 / BF16
+hidden=7168, **the LL benchmark is closed**. Future commits should
+either pivot to multi-NIC or to a different problem regime.