composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-21 05:19:20 +00:00

Author	SHA1	Message	Date
Gino Lu	840b8a37d9	test(sparse_attn): CPU-ref cross-check + BLKQ cite Wire SpargeAttn CPU reference into test_sparge: build the block_map on host via sparge::build_block_map_meansim and cross-check against the GPU-produced map; self-check the VSA delta-LUT (valid count + reachable kb indices); split PASS/FAIL into separate block_map / LUT / attention-output lines for clearer diagnosis. Set sparge_tool::SpargeParams::BLKQ default to 64 to match SpargeAttn SM90 convention (cite upstream qk_int_sv_f8_cuda_sm90.cu:143-144); tighten bf16 tolerance back to the dense FMHA baseline (4e-2 atol, 1e-2 rtol). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-17 02:35:51 -04:00
Gino Lu	7103eacc99	refactor(sparse_attn): caller-owned workspace + dtype-aware sizing Replace process-lifetime lazy hipMalloc K-stats workspace with a caller-owned buffer; expose sparge_blockmap_get_workspace_size() / compute_workspace_layout() host helpers. Split the combined sparge_blockmap_fwd into stage launchers (sparge_kstats_fwd_oneshot + sparge_blockmap_only_fwd_oneshot) so the chained launch is timed end-to-end. Make pooled_k storage dtype follow KDataType (fp16/bf16) instead of fp32 to halve workspace footprint and match dense-FMHA precision. Tighten per-head superparam pointers to required (non-null) and assert N_k <= 256 in jenga MakeKargs to document the 256-bool LDS staging cap. Drop the obsolete VSA extra-LDS staging. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-17 02:34:23 -04:00
Gino Lu	b00e5449c8	sparse_attn: split KStats kernel, add README + perf charts - Split SpargeKStatsKernel/Pipeline out of BlockMap (Kernel A produces per-block K stats workspace consumed by Kernel B), removing redundant K-stat recomputation across Q-blocks. - Add example/ck_tile/50_sparse_attn/README.md (status vs upstream pinned to ae5b629, unported items, usage, references). - Add example/ck_tile/50_sparse_attn/docs/{speedup_vs_sparsity,kernel_breakdown}.png + reusable plot_sparge_perf.py (b=2 h=32 s=16384 d=128 fp16 perf snapshot). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-05 03:13:24 -04:00
Gino Lu	eca3cb3e0a	sparse_attn: add bm0 dispatch for sparge blockmap compatibility Add bm0 field to fmha_jenga_fwd_traits so callers can specify the preferred Q-tile size. Codegen now emits separate tile configs for bm0=64 (sparge blockmap) and bm0=128 (original), with CppConstraint guards to select the right kernel at runtime. End-to-end test passes for both jenga and vsa paths. Performance is known to be suboptimal at this stage; tile sizes and warp counts for the bm0=64 path have not been tuned. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-24 05:13:51 -04:00
Gino Lu	ab44b83566	refactor to combine two kernel	2026-04-22 13:13:37 -04:00

5 Commits