composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-16 10:59:55 +00:00

Files

juuso-oskari 62e8f73545 Fix int32 overflow in CK-UA via pointer rebasing

When using large KV caches (>131K blocks for d64/GQA-8), the tensor
coordinate offset calculation overflows int32:
  offset = row_index * stride
  where stride = num_kv_heads * head_dim = 512

With 150K blocks at block_size=32:
  max_row = 4,799,968
  max_offset = 4,799,968 × 512 = 2,457,583,616 > 2^31

This caused 77.7% of output elements to be incorrect.

Solution: Pointer rebasing
- Add k_row_stride and v_row_stride parameters to pipeline
- Calculate int64 offset and rebase buffer pointer: base_ptr + (int64)offset
- Set window origin to 0 (small int32 relative to new base)
- Call init_raw() to update AMD buffer resource descriptor
- Enabled only for hdim <= 64 (hdim=128 has different buffer layout)
- Falls back to original set_window_origin when strides not provided

Test results:
- 150K blocks (overflow): CK vs Triton max diff 4.9e-4 (PASS)
- 1K blocks (no overflow): CK vs Triton max diff 4.9e-4 (PASS)
- 131K blocks (large): CK vs Triton max diff 1.2e-4 (PASS)

Made-with: Claude Code

2026-05-06 12:16:30 +00:00

[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a)

2026-03-31 15:19:43 +00:00

ck_tile

Fix int32 overflow in CK-UA via pointer rebasing

2026-05-06 12:16:30 +00:00

rapidjson

…