tensor_coordinate::get_offset() returns index_t (int32), causing overflow
when page_idx * block_size * stride > 2^31 (~131K blocks for d64/GQA-8).
Fix: rebase K/V data pointer for each page using int64 arithmetic instead
of set_window_origin with large offsets. After rebasing p_data_ and
buffer_size_, call init_raw() to refresh the AMD buffer resource descriptor,
then set_window_origin({0,0}) to reset cached coordinates.
Tested: num_blocks up to 2M with nkh=1/8, blk=32/64. All pass.
Made-with: Cursor
Added split-KV fields to UnifiedAttentionVarlenKargs (num_splits,
i_split, lse_acc_ptr, o_acc_ptr + strides). Modified operator() to
compute per-split KV range using blocks_per_split.
INCOMPLETE: The pipeline returns normalized o_acc but the split-KV
combine kernel needs unnormalized o_acc + lse. Need to modify the
pipeline to optionally return m and l values alongside o_acc.
The kernel changes compile but the epilogue needs the split path
(write to float accumulators instead of final output).
Made-with: Cursor