Cherry-picked all unified attention files from aghamari/unified-attention-decode-opt
onto CK develop (046d3ac27). Includes:
- Unified attention pipeline, kernel, and block masking
- All kernel tiers: large (8-warp), medium (4-warp), small (2-warp), tiny (1-warp)
- block_size=32 support with bs32 narrow tier (2-warp 16x16 MFMA kBlockM=32)
- int32 overflow fix (long_index_t for KV cache strides)
- BlockSize_ template parameter for flexible page block sizes
- Example binary and 40 instance files
Made-with: Cursor