diff --git a/LL_OPTIMIZATION_HISTORY.md b/LL_OPTIMIZATION_HISTORY.md
index 9633ed42..7a328627 100644
--- a/LL_OPTIMIZATION_HISTORY.md
+++ b/LL_OPTIMIZATION_HISTORY.md
@@ -445,3 +445,71 @@ To meaningfully exceed this requires either:
 
 Combine path may still have non-NIC headroom (recv-side reduce-add could
 be pipelined with TMA / cp.async); leaving as a future independent attack.
+
+---
+
+## Phase 8 — Combine TMA / cp.async pipeline (DECLINED after profiling)
+
+The Phase 7 synthesis flagged combine as a candidate "non-NIC software
+option" worth pursuing. Profiled the combine kernel (TOKENS=128, TOPK=8,
+BF16 hidden=7168, 16 ranks, IBGDA, grid (1, 32) → 64 blocks):
+
+```
+[ep-prof combine #10 r0] blocks=64
+  send=9.5/16.6/19.1us
+  wait=1.3/220.2/330.8us
+  grid_sync=1.0/111.7/335.3us
+  reduce=7.9/8.9/9.6us
+  total=356.3/357.2/358.0us  (min/avg/max)
+```
+
+Breakdown of average per-block time (357 µs total):
+
+| phase     | time   | %      | notes                                 |
+|-----------|-------:|-------:|---------------------------------------|
+| send      | 17 µs  |   5%   | RDMA WRITEs back to source ranks      |
+| **wait**  | 220 µs | **62%**| NIC bandwidth bound (same as dispatch)|
+| grid_sync | 112 µs |  31%   | per-block skew slack (free CTA wait)  |
+| **reduce**| **9 µs** | **2.5%**| weighted reduce-add over topk inputs|
+
+The reduce arithmetic — the *only* place TMA / `cp.async` could help — is
+**9 µs out of 357 µs = 2.5%** of the total kernel time. Even halving it
+(an aggressive estimate; H100 HBM3 already runs the loop at ~50% of peak
+BW, so realistic savings are ~3 µs) yields **<1% end-to-end perf**.
+
+Wait at 220 µs is the same NIC-bandwidth ceiling that dispatch hits:
+combine sends the same ~14 MB/rank/iter back from each (le, src_rank)
+to its source rank. The Phase 7 speculation that combine had a
+"non-NIC bottleneck profile" was wrong — both paths are NIC-bound.
+
+The 112 µs `grid_sync` interval is the same within-rank wait skew
+documented in Phase 7.3 — blocks that finished waiting early sit in
+`cg::this_grid().sync()` until the slowest block catches up. It is not
+addressable by software (it tracks per-(le, src_rank) NIC contention).
+
+**Decision: not implementing.** The cp.async pipeline would add 50+ LoC
+of shared-memory plumbing and `__pipeline_*` synchronization for ≤1%
+perf. Profile is the rebuttal.
+
+## Updated synthesis (post Phase 8)
+
+Both LL paths (dispatch + combine) are **single-NIC bandwidth bound** at
+this problem size. The kernel-side software work is essentially done at
+38.9 / 39.6 GB/s = 94-97% of the ~41 GB/s practical ceiling.
+
+Remaining categories of attack, all architectural / non-trivial:
+
+1. **Multi-NIC striping** — fan WRs across `mlx5_ib0 + mlx5_ib1` per
+   rank. Requires QP-pool restructure, topology-aware peer mapping,
+   and recv-side aggregation. Largest predicted win but biggest change.
+
+2. **Smaller per-token payload** — at hidden=2048 or FP8, NIC stops
+   being the binding constraint and kernel-side wins re-open. Useful
+   for application-level integration but doesn't move the LL benchmark
+   number on the current setup.
+
+3. **NDR2 / 100 GB/s NIC** — hardware uplift, out of scope.
+
+For the current 2×8×H100 NDR(50) setup with TOKENS=128 / TOPK=8 / BF16
+hidden=7168, **the LL benchmark is closed**. Future commits should
+either pivot to multi-NIC or to a different problem regime.