composable_kernel/include/ck_tile/ops at ce838e19e5ea1931041537112d85185bb7ce7fa7 - composable_kernel - Public git mirror

ROCm/composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 18:17:44 +00:00

Files

History

Ding, Yi 422b6d6c16 [CK_TILE] Fix FMHA BWD workspace upper-bound undersizing in group mode

GetWorkspaceDeviceSizeUpperBound was computing
  max_batch * nhead_q * max_seqlen_q * hdim_q
in non-deterministic group mode, but PrepareWorkspaceHost actually returns
  nhead_q * seqstart_q[batch] * hdim_q
i.e. it scales with the sum of *padded* per-batch seqlen_q, not max_batch
times the *logical* max. When per-batch padding makes seqstart_q[batch]
exceed max_batch * max_seqlen_q the launcher under-allocates dq_acc, the
kernel writes past the buffer, and tests see either ~42% wrong QGrad
values or a GPU page fault (e.g. test_ck_tile_fmha_bwd_bf16
QKVPadding/23,24,26 corrupt; /27 page-faults).

Fix: replace the (max_batch, max_seqlen_q) pair with a single
total_seqlen_q_padded parameter holding the true total padded q tokens.
Launcher derives it from the trait (group: t.seqlen_q already is the
padded total; batch: t.batch * t.seqlen_q). The four mode formulas
collapse to one:
  size = nhead_q * nsplits_factor * total_seqlen_q_padded * hdim_q
where nsplits_factor is 1 for non-deterministic, ceil(max_seqlen_k, kN0)
for deterministic group, and the persistent worker computation for
deterministic non-group (the only branch that still needs max_batch).

No caller-side API change: FA, AITER and the CK runner already pass
q.shape[0] (the padded total) as traits.seqlen_q in group mode.

Verified on gfx1201: full test_ck_tile_fmha_bwd_{bf16,fp16} 672/672 PASS,
0 fail, 0 crash (was 27/28 QKVPadding fails + 1 GPU illegal access).

2026-05-13 02:20:09 -04:00

..

add_rmsnorm2d_rdquant

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

batched_contraction

[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 )

2026-04-22 15:47:47 +00:00

batched_transpose

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

[CK_TILE] Restructure Tile Engine's benchmarking and profiling (#4769 )

2026-04-14 10:50:24 -07:00

[CK][CK TILE] Modify elementwise kernel template signature to accept independent type arguments (#6399 )

2026-04-14 01:44:27 -06:00

[CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863 )

2026-04-14 20:22:18 +00:00

[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 )

2026-04-22 15:47:47 +00:00

[CK_TILE] Fix FMHA BWD workspace upper-bound undersizing in group mode

2026-05-13 02:20:09 -04:00

[CK] Fix out of bounds modifications caused by negative topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242 )

2026-04-24 06:44:37 +08:00

[CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574 )

2026-04-30 11:32:23 -07:00

[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 )

2026-04-22 15:47:47 +00:00

[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 )

2026-04-22 15:47:47 +00:00

grouped_convolution

[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 )

2026-04-22 15:47:47 +00:00

image_to_column

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 )

2026-04-10 11:17:11 -04:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 )

2026-04-22 15:47:47 +00:00

CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 )

2026-04-10 11:17:11 -04:00

Fix redundant cast in model sensitive rmsnorm (#3681 )

2026-01-30 10:52:19 +08:00

[CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574 )

2026-04-30 11:32:23 -07:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

Add padding to cshuffle epilogue to avoid bank conflict (#4274 )

2026-02-10 22:52:00 -07:00

[CK_TILE][FMHA] Add sparse attention VSA (#3341 )

2026-01-31 00:59:47 +08:00

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

Shuffle fix for gfx950 (#3491 )

2026-01-13 09:21:29 -08:00

add_rmsnorm2d_rdquant.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

batched_contraction.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

batched_transpose.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

common.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

elementwise.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

epilogue.hpp

[CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863 )

2026-04-14 20:22:18 +00:00

flatmm.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

fmha.hpp

Improve the performance of qr_ks_vs_whole_k_prefetch pipeline (#6209 )

2026-04-24 10:30:41 -06:00

fused_moe.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

gemm_mx.hpp

Changed the include order of the new WMMA/MFMA unification framework (#5241 )

2026-03-12 09:26:58 +01:00

gemm_quant.hpp

[CK Tile] Eight Waves pipeline GEMM (#4964 )

2026-03-16 09:30:54 +01:00

gemm.hpp

[CK Tile] Eight Waves pipeline GEMM (#4964 )

2026-03-16 09:30:54 +01:00

grouped_convolution.hpp

Changed the include order of the new WMMA/MFMA unification framework (#5241 )

2026-03-12 09:26:58 +01:00

image_to_column.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

layernorm2d.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

moe_flatmm.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

norm_reduce.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

permute.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

pooling.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

reduce.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

rmsnorm2d.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

sageattn.hpp

[CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574 )

2026-04-30 11:32:23 -07:00

smoothquant.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

softmax.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

sparse_attn.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

topk_softmax.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00

topk.hpp

Cleanup and refactoring related to tile loading (#4294 )

2026-03-02 12:20:55 +00:00