When using large KV caches (>131K blocks for d64/GQA-8), the tensor
coordinate offset calculation overflows int32:
offset = row_index * stride
where stride = num_kv_heads * head_dim = 512
With 150K blocks at block_size=32:
max_row = 4,799,968
max_offset = 4,799,968 × 512 = 2,457,583,616 > 2^31
This caused 77.7% of output elements to be incorrect.
Solution: Pointer rebasing
- Add k_row_stride and v_row_stride parameters to pipeline
- Calculate int64 offset and rebase buffer pointer: base_ptr + (int64)offset
- Set window origin to 0 (small int32 relative to new base)
- Call init_raw() to update AMD buffer resource descriptor
- Enabled only for hdim <= 64 (hdim=128 has different buffer layout)
- Falls back to original set_window_origin when strides not provided
Test results:
- 150K blocks (overflow): CK vs Triton max diff 4.9e-4 (PASS)
- 1K blocks (no overflow): CK vs Triton max diff 4.9e-4 (PASS)
- 131K blocks (large): CK vs Triton max diff 1.2e-4 (PASS)
Made-with: Claude Code