Wire SpargeAttn CPU reference into test_sparge: build the block_map on host via
sparge::build_block_map_meansim and cross-check against the GPU-produced map;
self-check the VSA delta-LUT (valid count + reachable kb indices); split PASS/FAIL
into separate block_map / LUT / attention-output lines for clearer diagnosis.
Set sparge_tool::SpargeParams::BLKQ default to 64 to match SpargeAttn SM90
convention (cite upstream qk_int_sv_f8_cuda_sm90.cu:143-144); tighten bf16
tolerance back to the dense FMHA baseline (4e-2 atol, 1e-2 rtol).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Replace process-lifetime lazy hipMalloc K-stats workspace with a caller-owned
buffer; expose sparge_blockmap_get_workspace_size() / compute_workspace_layout()
host helpers. Split the combined sparge_blockmap_fwd into stage launchers
(sparge_kstats_fwd_oneshot + sparge_blockmap_only_fwd_oneshot) so the chained
launch is timed end-to-end.
Make pooled_k storage dtype follow KDataType (fp16/bf16) instead of fp32 to halve
workspace footprint and match dense-FMHA precision. Tighten per-head superparam
pointers to required (non-null) and assert N_k <= 256 in jenga MakeKargs to
document the 256-bool LDS staging cap. Drop the obsolete VSA extra-LDS staging.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add bm0 field to fmha_jenga_fwd_traits so callers can specify the
preferred Q-tile size. Codegen now emits separate tile configs for
bm0=64 (sparge blockmap) and bm0=128 (original), with CppConstraint
guards to select the right kernel at runtime.
End-to-end test passes for both jenga and vsa paths. Performance is
known to be suboptimal at this stage; tile sizes and warp counts for
the bm0=64 path have not been tuned.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>