mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-21 05:19:20 +00:00

Files

Gino Lu 840b8a37d9 test(sparse_attn): CPU-ref cross-check + BLKQ cite

Wire SpargeAttn CPU reference into test_sparge: build the block_map on host via
sparge::build_block_map_meansim and cross-check against the GPU-produced map;
self-check the VSA delta-LUT (valid count + reachable kb indices); split PASS/FAIL
into separate block_map / LUT / attention-output lines for clearer diagnosis.

Set sparge_tool::SpargeParams::BLKQ default to 64 to match SpargeAttn SM90
convention (cite upstream qk_int_sv_f8_cuda_sm90.cu:143-144); tighten bf16
tolerance back to the dense FMHA baseline (4e-2 atol, 1e-2 rtol).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

2026-05-17 02:35:51 -04:00

codegen

sparse_attn: add bm0 dispatch for sparge blockmap compatibility

2026-04-24 05:13:51 -04:00

docs

sparse_attn: split KStats kernel, add README + perf charts

2026-05-05 03:13:24 -04:00

CMakeLists.txt

refactor to combine two kernel

2026-04-22 13:13:37 -04:00

fmha_fwd_trek.hpp

cleanup(sparse_attn): R-tag rename + clang-format sweep

2026-05-17 02:35:07 -04:00

generate.py

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

jenga_sparse_attention.cpp

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

jenga_sparse_attention.h

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

README.md

sparse_attn: split KStats kernel, add README + perf charts

2026-05-05 03:13:24 -04:00

sparge_blockmap_inst.cpp

refactor(sparse_attn): caller-owned workspace + dtype-aware sizing

2026-05-17 02:34:23 -04:00

sparge_blockmap_trek.hpp

refactor(sparse_attn): caller-owned workspace + dtype-aware sizing

2026-05-17 02:34:23 -04:00

sparge_tool.hpp

test(sparse_attn): CPU-ref cross-check + BLKQ cite

2026-05-17 02:35:51 -04:00

test_jenga_sparse_attn.cpp

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

test_sparge.cpp

test(sparse_attn): CPU-ref cross-check + BLKQ cite

2026-05-17 02:35:51 -04:00

test_vsa_sparse_attn.cpp

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

vsa_sparse_attention.cpp

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

README.md

Sparge Attention (Composable Kernel)

A Composable Kernel port of SpargeAttn for AMD GPU. Both the block-map pipeline (mean-pool → cosine sim → pooled QK → top-k LUT) and the sparse FMHA stage run on-GPU. Two attention backends are exposed via -pipeline=vsa (default, faster) and -pipeline=jenga (async K/V load variant).

Status vs Upstream

Implemented:

per-block mean-pool, cosine similarity, pooled QK
top-k / cdfthreshd block selection, BlockMap LUT
sparse FMHA (both vsa and jenga backends)
per-head topk / simthreshd1 / cdfthreshd

Not yet ported (upstream pinned to commit ae5b629):

K smoothing — pre-pool k -= km; required for diffusion / video checkpoints (CogVideoX, Mochi-1, Flux, OpenSora, SD 3.5) (spas_sage_attn/core.py:L53)
is_causal mask in pooled score — required for causal-LM prefill (Llama, Qwen) (spas_sage_attn/utils.py:L338)
attention_sink — column 0 forced ON; upstream is hard-wired to True at inference (spas_sage_attn/autotune.py:L355)
pv_threshold per-Q-tile skip in attn kernel — pure perf, ~5–15% on the dominant attention slice (spas_sage_attn/core.py:L265)
Sort-based top-k selection — replaces our O(N_k^2) iterative argmax; matters at long seqlen (s ≥ 16k) (spas_sage_attn/utils.py:L345)
Q/K int8 quant fusion in pool kernel — enables a downstream int8 GEMM0 in the attn kernel (spas_sage_attn/utils.py:L371)

Performance

At b=2 h=32 s=16384 fp16, sparge (vsa backend) reaches 1.78× FMHA throughput at topk=0.4 and 5.04× at topk=0.1, and stays above 1.0× across the full topk range.

Speedup vs FMHA, b=2 h=32 s=16384 d=128 fp16. Shape chosen to match Fig. 10 of the SpargeAttn paper (arXiv:2502.18137; Mochi-1, 22K context, head_dim=128); s=16384 is the closest grid point. Gray-outlined points have >30% inter-rep spread.

BlockMap (_pre) stacked on attention (_attn), b=2 h=32 d=128 fp16 topk=0.4. BlockMap is roughly 17% of total at s=16384.

Usage

ninja tile_example_sparge
./bin/tile_example_sparge -pipeline=vsa -b=2 -h=32 -s=16384 -d=128 -topk=0.4 -simthreshd1=0.001

Add -v=1 for CPU validation; use a small shape (-b=1 -h=2 -s=512), since full-shape CPU reference scales O(s²) and runs 30+ minutes at s=8k, hours at s=16k.

References

SpargeAttn upstream (pinned to ae5b629)
Paper — Zhang et al., arXiv:2502.18137

README.md Unescape Escape

Sparge Attention (Composable Kernel)

Status vs Upstream

Performance

Usage

References

README.md