The unified attention kernel previously only supported head_size=128
with MHA (NumQPerKV=1). This change adds support for head_size=64 with
GQA-8 (NumQPerKV=8), which is the configuration used by models like
DeepSeek-V3/R1 (64 query heads, 8 KV heads, head_dim=64).
Changes:
- Add 4 new kernel instance files for d64/GQA-8:
unified_attention_d64_{bf16,fp16}_{nmask,mask}_gqa8.cpp
- Add d64/GQA-8 dispatch path in unified_attention.cpp
- Fix BLOCK_SIZE (kPageBlockSize) in unified_attention_kernel_traits:
compute from HEAD_SIZE instead of hardcoding 32. For HeadSize<=64,
BLOCK_SIZE must be 64 to guarantee NumIssues>=1 on gfx950. With
128-bit vector loads (KVector=8), LaneGroups*NumWarps=128 exceeds
kPageBlockSize=32 when HeadSize=64, causing a division-by-zero in
the LDS tile descriptor constexpr evaluation.