Files
composable_kernel/example
Amir Ghamarian bbc748defe Add unified attention d64/GQA-8 kernel instances and fix BLOCK_SIZE for small head dims
The unified attention kernel previously only supported head_size=128
with MHA (NumQPerKV=1). This change adds support for head_size=64 with
GQA-8 (NumQPerKV=8), which is the configuration used by models like
DeepSeek-V3/R1 (64 query heads, 8 KV heads, head_dim=64).
Changes:
- Add 4 new kernel instance files for d64/GQA-8:
  unified_attention_d64_{bf16,fp16}_{nmask,mask}_gqa8.cpp
- Add d64/GQA-8 dispatch path in unified_attention.cpp
- Fix BLOCK_SIZE (kPageBlockSize) in unified_attention_kernel_traits:
  compute from HEAD_SIZE instead of hardcoding 32. For HeadSize<=64,
  BLOCK_SIZE must be 64 to guarantee NumIssues>=1 on gfx950. With
  128-bit vector loads (KVector=8), LaneGroups*NumWarps=128 exceeds
  kPageBlockSize=32 when HeadSize=64, causing a division-by-zero in
  the LDS tile descriptor constexpr evaluation.
2026-03-27 09:41:10 -05:00
..
2026-01-14 07:31:45 -08:00