Emit decode-aligned VEC_K_COL_V pipeline instances (5D vectorized K + 4D
ColumnMajor V) for the non-quant BF16/FP16 batch_prefill path. This mirrors
the FP8 PER_TOKEN_HEAD col-V variant (vlayout="col") but with qscale="no", so
the plain BF16/FP16 prefill path can ingest the decode KV cache layout directly
without an intermediate reshape. Variants cover logits/mask/sink/lse with
bias="no" and dropout="f".
The production hd=128 default tile (bk0=32, bk1=32) is retained unchanged. An
autotune sweep over 516 hd=128 tile instances was run but is not deployed: its
ranking was latency-only (not correctness-gated), and every meaningful-uplift
candidate altered bk0 or bk1, which produces numerically incorrect output for
the vec_k_col_v access pattern (verified vs the FP32 reference). Per-shape tile
selection is therefore deferred to a correctness-gated re-sweep.