composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-13 17:55:48 +00:00

Author	SHA1	Message	Date
Damien Lejeune	b686143624	Adding SWA decode dispatcher to support GPT-OSS shape + update smoke test	2026-05-08 14:38:16 +00:00
Damien Lejeune	f438cef286	Add smoke tests for SWA edge cases and performance gating	2026-05-08 11:30:48 +00:00
Damien Lejeune	5afd97ff5b	Adding SWA implementation + instances	2026-05-08 08:52:25 +00:00
Damien Lejeune	c132e6fc18	Prepare the interface to support SWA	2026-05-07 13:52:56 +00:00
root	65a3f88ad8	Fix CK-UA mixed batch: use max_seqlen_q for tier selection Decode grid (num_kv_heads, num_seqs) assumes each seq has <= kBlockQ tokens. For mixed batches (decode + prefill), avg_q is low but some seqs have hundreds of tokens, causing truncation. Added max_seqlen_q to args and check it in select_tile_tier to force medium tier (1D grid with Q tile iteration) for mixed batches. 362/362 no-window shapes now pass. Made-with: Cursor	2026-04-01 18:09:48 +00:00
root	4c5e290378	Add unified attention (42_unified_attention) and topk_softmax_decode Squashed from aghamari/unified-attention-decode-opt branch. 42_unified_attention: CK tile paged-KV attention kernel optimized for decode with 4-tier dispatch (tiny/small/medium/large), 16x16 MFMA, 2D decode grid, head-group merging. Supports hdim=64 GQA-8 and hdim=128 MHA with block_size=32. topk_softmax_decode: fused topk + softmax kernel for M=1 MoE decode. Made-with: Cursor	2026-04-01 16:24:04 +00:00