mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
The FP8 32x32x16 fmha_alu1 repack trades each lane's "bad" 4-byte pack
with the lane 32 away (QK-C vs PV-A layouts differ by an l^32 swap).
Since 3431615ff this used an LDS-crossbar ds_bpermute plus an is_sub_0
v_cndmask mux.
gfx950 exposes v_permlane32_swap_b32 (__builtin_amdgcn_permlane32_swap,
feature permlane32-swap) which does the l^32 exchange in a single VALU
op with no LDS round-trip. Verified on-device that permlane32_swap(
lo_pack, hi_pack) returns {out_lo, out_hi} for every lane, folding both
the cross-lane swap and the per-lane sub-block muxing into one
instruction. Guarded #if defined(__gfx950__); ds_bpermute kept as the
#else fallback (gfx942 lacks the feature).
ISA (prefill_d128 fp8 instance): 12 ds_bpermute -> 0, replaced by 12
v_permlane32_swap_b32; v_cndmask muxing removed. FP8 prefill + decode
PASS vs torch reference. Clean A/B (median of 3, b=4 FP8 prefill):
sq=sk 2048/5000/10000 -> 1.6% / 1.9% / 2.1% faster, scaling with the
per-iter repack count.
Co-authored-by: Cursor <cursoragent@cursor.com>