diff --git a/example/ck_tile/50_sparse_attn/README.md b/example/ck_tile/50_sparse_attn/README.md index 9fdad906de..0a7b513748 100644 --- a/example/ck_tile/50_sparse_attn/README.md +++ b/example/ck_tile/50_sparse_attn/README.md @@ -29,19 +29,7 @@ Not yet ported (upstream pinned to commit [`ae5b629`](https://github.com/thu-ml/ *MI300X, b=2 h=16 s=8192 d=128 fp16, 5 seeds × 9 sparsity points. All three modes dispatch to the `kM0=64 padK=0` tile bucket at this shape.* -On the canonical recipe shape, `none > warp > block` at every measured sparsity, with no crossover. The per-block guard adds +33..+35 VGPR (6..9 spills) on this tile configuration, depressing occupancy. `warp` is +0..+4 VGPR. The default is `-pv_mode=warp` (preserves R25 A1 behaviour); switch to `none` for the no-skip baseline or `block` to exercise the upstream-aligned variant. A shape sweep is needed before recommending `block` as default — the `kM0=128` path has Δ ≈ 0 VGPR for per-block and is a candidate. - -## Performance - -At b=2 h=32 s=16384 fp16, sparge (vsa backend) reaches **1.78× FMHA throughput at topk=0.4** and **5.04× at topk=0.1**, and stays above 1.0× across the full topk range. - -![Speedup vs sparsity](docs/speedup_vs_sparsity.png) - -*Speedup vs FMHA, b=2 h=32 s=16384 d=128 fp16. Shape chosen to match Fig. 10 of the SpargeAttn paper ([arXiv:2502.18137](https://arxiv.org/abs/2502.18137); Mochi-1, 22K context, head_dim=128); s=16384 is the closest grid point. Gray-outlined points have >30% inter-rep spread.* - -![Kernel breakdown](docs/kernel_breakdown.png) - -*BlockMap (`_pre`) stacked on attention (`_attn`), b=2 h=32 d=128 fp16 topk=0.4. BlockMap is roughly 17% of total at s=16384.* +On the canonical recipe shape, `none > warp > block` at every measured sparsity, with no crossover. The per-block guard adds +33..+35 VGPR (6..9 spills) on this tile configuration, depressing occupancy. `warp` is +0..+4 VGPR. The default is `-pv_mode=warp`; switch to `none` for the no-skip baseline or `block` to exercise the upstream-aligned variant. A shape sweep is needed before recommending `block` as default — the `kM0=128` path has Δ ≈ 0 VGPR for per-block and is a candidate. ## Usage diff --git a/example/ck_tile/50_sparse_attn/docs/kernel_breakdown.png b/example/ck_tile/50_sparse_attn/docs/kernel_breakdown.png deleted file mode 100644 index 8704334155..0000000000 Binary files a/example/ck_tile/50_sparse_attn/docs/kernel_breakdown.png and /dev/null differ diff --git a/example/ck_tile/50_sparse_attn/docs/speedup_vs_sparsity.png b/example/ck_tile/50_sparse_attn/docs/speedup_vs_sparsity.png deleted file mode 100644 index 9a2f053b0b..0000000000 Binary files a/example/ck_tile/50_sparse_attn/docs/speedup_vs_sparsity.png and /dev/null differ