When kCachePtrInt32OverflowPossible=true, we now:
1. Save original K/V buffer pointers at pipeline start
2. Always rebase by computing offset from original base pointer
3. Use k_row_stride/v_row_stride passed from kernel args
This fixes the bug where successive rebases would compound, since each
rebase modified buf.p_data_ without tracking the original base.
Key insight: separate long_index_t variables for block_offset and
elem_offset avoid compiler type promotion issues that caused assembly
errors when computing the total offset in a single expression.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>