composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Files

juuso-oskari 9373fab553 CK-UA: replace FP8 repack ds_bpermute with v_permlane32_swap_b32 (gfx950)

The FP8 32x32x16 fmha_alu1 repack trades each lane's "bad" 4-byte pack
with the lane 32 away (QK-C vs PV-A layouts differ by an l^32 swap).
Since 3431615ff this used an LDS-crossbar ds_bpermute plus an is_sub_0
v_cndmask mux.

gfx950 exposes v_permlane32_swap_b32 (__builtin_amdgcn_permlane32_swap,
feature permlane32-swap) which does the l^32 exchange in a single VALU
op with no LDS round-trip. Verified on-device that permlane32_swap(
lo_pack, hi_pack) returns {out_lo, out_hi} for every lane, folding both
the cross-lane swap and the per-lane sub-block muxing into one
instruction. Guarded #if defined(__gfx950__); ds_bpermute kept as the
#else fallback (gfx942 lacks the feature).

ISA (prefill_d128 fp8 instance): 12 ds_bpermute -> 0, replaced by 12
v_permlane32_swap_b32; v_cndmask muxing removed. FP8 prefill + decode
PASS vs torch reference. Clean A/B (median of 3, b=4 FP8 prefill):
sq=sk 2048/5000/10000 -> 1.6% / 1.9% / 2.1% faster, scaling with the
per-iter repack count.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-06-01 11:41:52 +00:00

[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a)

2026-03-31 15:19:43 +00:00

ck_tile

CK-UA: replace FP8 repack ds_bpermute with v_permlane32_swap_b32 (gfx950)

2026-06-01 11:41:52 +00:00

rapidjson

…