Files
composable_kernel/example
msaffari-amd c8c0eaf982 Add BF16/FP16 vec_k_col_v batch_prefill kernel variants
Emit decode-aligned VEC_K_COL_V pipeline instances (5D vectorized K + 4D
ColumnMajor V) for the non-quant BF16/FP16 batch_prefill path. This mirrors
the FP8 PER_TOKEN_HEAD col-V variant (vlayout="col") but with qscale="no", so
the plain BF16/FP16 prefill path can ingest the decode KV cache layout directly
without an intermediate reshape. Variants cover logits/mask/sink/lse with
bias="no" and dropout="f".

The production hd=128 default tile (bk0=32, bk1=32) is retained unchanged. An
autotune sweep over 516 hd=128 tile instances was run but is not deployed: its
ranking was latency-only (not correctness-gated), and every meaningful-uplift
candidate altered bk0 or bk1, which produces numerically incorrect output for
the vec_k_col_v access pattern (verified vs the FP32 reference). Per-shape tile
selection is therefore deferred to a correctness-gated re-sweep.
2026-06-04 16:31:54 +00:00
..
2026-01-14 07:31:45 -08:00