mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-16 10:59:55 +00:00
When using large KV caches (>131K blocks for d64/GQA-8), the tensor coordinate offset calculation overflows int32: offset = row_index * stride where stride = num_kv_heads * head_dim = 512 With 150K blocks at block_size=32: max_row = 4,799,968 max_offset = 4,799,968 × 512 = 2,457,583,616 > 2^31 This caused 77.7% of output elements to be incorrect. Solution: Pointer rebasing - Add k_row_stride and v_row_stride parameters to pipeline - Calculate int64 offset and rebase buffer pointer: base_ptr + (int64)offset - Set window origin to 0 (small int32 relative to new base) - Call init_raw() to update AMD buffer resource descriptor - Enabled only for hdim <= 64 (hdim=128 has different buffer layout) - Falls back to original set_window_origin when strides not provided Test results: - 150K blocks (overflow): CK vs Triton max diff 4.9e-4 (PASS) - 1K blocks (no overflow): CK vs Triton max diff 4.9e-4 (PASS) - 131K blocks (large): CK vs Triton max diff 1.2e-4 (PASS) Made-with: Claude Code