mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
Cherry-picked all unified attention files from aghamari/unified-attention-decode-opt
onto CK develop (046d3ac27). Includes:
- Unified attention pipeline, kernel, and block masking
- All kernel tiers: large (8-warp), medium (4-warp), small (2-warp), tiny (1-warp)
- block_size=32 support with bs32 narrow tier (2-warp 16x16 MFMA kBlockM=32)
- int32 overflow fix (long_index_t for KV cache strides)
- BlockSize_ template parameter for flexible page block sizes
- Example binary and 40 instance files
Made-with: Cursor