composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-17 19:40:04 +00:00

Author	SHA1	Message	Date
Damien Lejeune	5afd97ff5b	Adding SWA implementation + instances	2026-05-08 08:52:25 +00:00
root	8506db8761	Fix int32 overflow in CK-UA pipeline via pointer rebasing tensor_coordinate::get_offset() returns index_t (int32), causing overflow when page_idx * block_size * stride > 2^31 (~131K blocks for d64/GQA-8). Fix: rebase K/V data pointer for each page using int64 arithmetic instead of set_window_origin with large offsets. After rebasing p_data_ and buffer_size_, call init_raw() to refresh the AMD buffer resource descriptor, then set_window_origin({0,0}) to reset cached coordinates. Tested: num_blocks up to 2M with nkh=1/8, blk=32/64. All pass. Made-with: Cursor	2026-04-02 09:39:07 +00:00
root	87d16738bf	WIP: CK-UA KV-segment parallelism - kernel args and split range Added split-KV fields to UnifiedAttentionVarlenKargs (num_splits, i_split, lse_acc_ptr, o_acc_ptr + strides). Modified operator() to compute per-split KV range using blocks_per_split. INCOMPLETE: The pipeline returns normalized o_acc but the split-KV combine kernel needs unnormalized o_acc + lse. Need to modify the pipeline to optionally return m and l values alongside o_acc. The kernel changes compile but the epilogue needs the split path (write to float accumulators instead of final output). Made-with: Cursor	2026-04-01 19:09:59 +00:00
root	4c5e290378	Add unified attention (42_unified_attention) and topk_softmax_decode Squashed from aghamari/unified-attention-decode-opt branch. 42_unified_attention: CK tile paged-KV attention kernel optimized for decode with 4-tier dispatch (tiny/small/medium/large), 16x16 MFMA, 2D decode grid, head-group merging. Supports hdim=64 GQA-8 and hdim=128 MHA with block_size=32. topk_softmax_decode: fused topk + softmax kernel for M=1 MoE decode. Made-with: Cursor	2026-04-01 16:24:04 +00:00

4 Commits