Files
composable_kernel/include/ck_tile/ops
juuso-oskari 62e8f73545 Fix int32 overflow in CK-UA via pointer rebasing
When using large KV caches (>131K blocks for d64/GQA-8), the tensor
coordinate offset calculation overflows int32:
  offset = row_index * stride
  where stride = num_kv_heads * head_dim = 512

With 150K blocks at block_size=32:
  max_row = 4,799,968
  max_offset = 4,799,968 × 512 = 2,457,583,616 > 2^31

This caused 77.7% of output elements to be incorrect.

Solution: Pointer rebasing
- Add k_row_stride and v_row_stride parameters to pipeline
- Calculate int64 offset and rebase buffer pointer: base_ptr + (int64)offset
- Set window origin to 0 (small int32 relative to new base)
- Call init_raw() to update AMD buffer resource descriptor
- Enabled only for hdim <= 64 (hdim=128 has different buffer layout)
- Falls back to original set_window_origin when strides not provided

Test results:
- 150K blocks (overflow): CK vs Triton max diff 4.9e-4 (PASS)
- 1K blocks (no overflow): CK vs Triton max diff 4.9e-4 (PASS)
- 131K blocks (large): CK vs Triton max diff 1.2e-4 (PASS)

Made-with: Claude Code
2026-05-06 12:16:30 +00:00
..
2026-01-13 09:21:29 -08:00