composable_kernel/include/ck_tile/ops at 3431615ff02ec51a987406ed1848a129fbbf0df4 - composable_kernel - Public git mirror

ROCm/composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Files

History

juuso-oskari 3431615ff0 CK-UA: fuse FP8 cvt + cross-lane swap to hide ds_bpermute latency

Previously the 32x32x16 FP8 P-tile cvt and the QK-C -> PV-A cross-lane
swap ran in two separate static_for loops back-to-back inside fmha_alu1:
the whole tile was cvt'd into p.thread_buf_ first, then a second pass
issued one ds_bpermute_b32 per 8-fp8 K-chunk and read/wrote the same
buffer to swap the "bad" 4-byte halves between paired lanes.

The ds_bpermute has nontrivial LDS-DMA latency that the scheduler has
no way to hide when it lives alone in a tight serial loop with the
gather/scatter packs around it.

Fuse the two into one 8-fp8-per-iter loop:
  1. cvt 8 fp32 -> 2 packed uint32 (lo_pack=slot[0..3], hi_pack=slot[4..7])
     using the chained cvt_pk_fp8_f32 pattern matching cast_tile_pk_fp8_fp32.
  2. Pick own_bad = (sub==0 ? hi_pack : lo_pack) and issue ds_bpermute on it.
  3. Write back all 8 fp8 bytes; the "good" half lands first so its byte
     stores can overlap with the in-flight ds_bpermute, and the next
     iter's cvts can begin while the swap is still pending.

The 16x16x32 LDS-roundtrip branch keeps the original separated cvt
loop (no swap latency to hide there since the relayout goes through
LDS, not ds_bpermute).

Single-shape FP8 perf on gfx950 GPU 2 (CUDA graph, 50 iters):
  decode d=128 b=4 sq=8 sk=4096:  0.2106 -> 0.1951 ms  (-7.4%)
  decode d=64  b=4 sq=8 sk=4096:  0.1464 -> 0.1208 ms  (-17.5%)
  prefill d=128 b=2 sq=512 sk=4k: 0.2558 -> 0.2220 ms  (-13.2%)

BF16 unchanged (0.2046 -> 0.2039 ms, within noise).

Correctness: pytest UA correctness suite 405 passed / 80 skipped
(245 BF16/FP16 + 160 FP8), unchanged from before.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-18 15:48:01 +00:00

..

add_rmsnorm2d_rdquant

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

batched_contraction

[CK Tile] batched contraction kernel generalizing (#3126 )

2025-12-02 13:30:27 +01:00

batched_transpose

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

[rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649)

2026-03-27 09:18:14 +00:00

[rocm-libraries] ROCm/rocm-libraries#5237 (commit ef10dc6)

2026-03-13 01:21:08 +00:00

[rocm-libraries] ROCm/rocm-libraries#5729 (commit 516c974)

2026-03-31 03:40:25 +00:00

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 21:55:14 +00:00

Fix fmha_fwd early-exit bug: seqlen_q <= min_seqlen_q should be <

2026-04-01 16:24:31 +00:00

[rocm-libraries] ROCm/rocm-libraries#4819 (commit b995a0b)

2026-02-25 16:13:13 +00:00

CK-UA: enable FP8 (e4m3) for prefill/m128 and the 32x32x16 small-tile decode variants

2026-05-15 17:34:50 +00:00

[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2)

2026-03-27 20:37:23 +00:00

[rocm-libraries] ROCm/rocm-libraries#5323 (commit 5454e9e)

2026-03-17 18:58:56 +00:00

grouped_convolution

[rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690)

2026-03-31 08:03:41 +00:00

image_to_column

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

[CK Tile] multi reduce improvements (#3607 )

2026-01-27 12:56:09 -08:00

Fix redundant cast in model sensitive rmsnorm (#3681 )

2026-01-30 10:52:19 +08:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

[rocm-libraries] ROCm/rocm-libraries#4274 (commit 7c380df)

2026-02-11 05:52:42 +00:00

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

unified_attention

CK-UA: fuse FP8 cvt + cross-lane swap to hide ds_bpermute latency

2026-05-18 15:48:01 +00:00

add_rmsnorm2d_rdquant.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

batched_contraction.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

batched_transpose.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

common.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

elementwise.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

epilogue.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

flatmm.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

fmha.hpp

[rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc)

2026-03-11 10:00:52 +00:00

fused_moe.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

gemm_mx.hpp

[rocm-libraries] ROCm/rocm-libraries#5241 (commit 43daeac)

2026-03-12 08:27:49 +00:00

gemm_quant.hpp

[rocm-libraries] ROCm/rocm-libraries#4964 (commit 3271d9a)

2026-03-16 08:31:56 +00:00

gemm.hpp

[rocm-libraries] ROCm/rocm-libraries#4964 (commit 3271d9a)

2026-03-16 08:31:56 +00:00

grouped_convolution.hpp

[rocm-libraries] ROCm/rocm-libraries#5241 (commit 43daeac)

2026-03-12 08:27:49 +00:00

image_to_column.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

layernorm2d.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

moe_flatmm.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

norm_reduce.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

permute.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

pooling.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

reduce.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

rmsnorm2d.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

smoothquant.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

softmax.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

sparse_attn.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

topk_softmax.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

topk.hpp

[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)

2026-03-02 12:21:44 +00:00

unified_attention.hpp

Add unified attention (42_unified_attention)

2026-04-01 16:39:15 +00:00