composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Files

juuso-oskari d912139ca9 CK-UA: split fmha_alu0 into rowmax/shift lambdas (default-off pipelining hook)

Refactors fmha_alu0 into fmha_alu0_rowmax (max3 reduce + cross-lane swap +
packed shift-coefficient precompute) and fmha_alu0_shift (the v_pk_fma sweep),
with a combined fmha_alu0 for the peeled/tail iterations. Bit-identical
(verified: split-on and split-off produce byte-for-byte identical standalone
output and PASS the fp8 accuracy check).

UA_FA4_SPLIT_ROWMAX (default 0) issues the rowmax ahead of the PV gemm in the
steady-state ping-pong so the MFMA cluster can cover the reduce->shift-addend
chain. MEASURED NEUTRAL (1782 TF/s either way) on the canonical fp8 prefill
shape: the post-RA scheduler already groups instructions per the core-loop
sched_group_barrier hints, so source-order reordering does not move the emitted
schedule. Kept as a structural hook; the shift stall is the QK-MFMA-result wait,
not the addend.

Context: ATT phase profiling shows softmax compute is largely hidden under the
ping-pong (matrix is only ~11% of the wave); the wall-time gate is the
barrier+memwait rendezvous (~30% of the wave) on K/V DRAM latency. exp2
Schraudolph approx (UA_FA4_EXP2_APPROX=1) is -11% here and stays off. See
ua-test-scripts/kv128_vgpr_findings.md for the full breakdown.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-06-13 11:59:21 +00:00

[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a)

2026-03-31 15:19:43 +00:00

ck_tile

CK-UA: split fmha_alu0 into rowmax/shift lambdas (default-off pipelining hook)

2026-06-13 11:59:21 +00:00

rapidjson

Update pre-commit to fixed versions, run remod for ck_tile (#2895 )

2025-10-16 15:29:17 -07:00