Added split-KV fields to UnifiedAttentionVarlenKargs (num_splits,
i_split, lse_acc_ptr, o_acc_ptr + strides). Modified operator() to
compute per-split KV range using blocks_per_split.
INCOMPLETE: The pipeline returns normalized o_acc but the split-KV
combine kernel needs unnormalized o_acc + lse. Need to modify the
pipeline to optionally return m and l values alongside o_acc.
The kernel changes compile but the epilogue needs the split path
(write to float accumulators instead of final output).
Made-with: Cursor