composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Files

juuso-oskari c0e985d075 CK-UA: document why per-issue SRD-rebase path was tried and dropped

Replace the speculative TODO-style comment next to the K_mem_load /
V_mem_load dispatch with a record of the actual experiment: we
implemented async_load_tile_raw_rebased (buffer_load_dword_lds with a
per-issue SRD whose 48-bit base absorbs the wave-uniform page offset),
verified correctness on multiple big-cache decode shapes, and measured
it against the existing async_load_tile_raw_long path on an isolated
GPU. Rebased was at best tied with long and at worst ~6% slower
(b=1 sk=1M d=64 GQA8: 2.46 ms vs 2.32 ms; b=8 sk=200k d=128 GQA8:
2.12 ms vs 2.02 ms). The workloads are compute / softmax bound, not
K/V load bandwidth bound, so the buffer_load throughput edge never
materialises, while the per-issue SRD construction adds real SGPR
pressure.

No functional change in this commit -- only the explanatory comment is
updated so the next person who eyes the same idea finds the receipts
before re-implementing.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-15 10:18:39 +00:00

[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a)

2026-03-31 15:19:43 +00:00

ck_tile

CK-UA: document why per-issue SRD-rebase path was tried and dropped

2026-05-15 10:18:39 +00:00

rapidjson

Update pre-commit to fixed versions, run remod for ck_tile (#2895 )

2025-10-16 15:29:17 -07:00