mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Files

juuso-oskari 1f69421434 CK-UA: dispatch K/V async load on cache_ptr_int32_overflow_possible

The shared-SRD buffer_load_dword_lds path that K_mem_load / V_mem_load use
wraps the per-lane voffset (int32 bytes) once
  num_blocks * page_size * row_stride * sizeof(T) > INT32_MAX,
silently returning wrong data on large paged-KV pools (e.g. >4 GB caches).

Add a second path, async_load_tile_raw_long, that issues the same load via
__builtin_amdgcn_global_load_lds with per-lane 64-bit base pointers, lifting
both 4 GB limits (SRD size + voffset). Per-issue LDS pointers are computed
explicitly because the intrinsic sets m0 itself, so the old m0_set / m0_inc
bookkeeping doesn't apply. The path also clamps lane_elem_off to the live
buffer range to mimic the original SRD's hardware OOB behaviour.

Dispatch is a wave-uniform runtime branch on a new
cache_ptr_int32_overflow_possible flag plumbed from
unified_attention_args through MakeKargs into the pipeline operator().
Small caches keep the original buffer_load throughput; only the (rare)
>4 GB cache pays the global_load_lds cost.

k_page_offsets / v_page_offsets are widened to long_index_t. The original
buffer_load path implicitly narrows back to int32 when forwarding through
async_get_vectorized_elements_raw, which is intentional and safe whenever
the overflow flag is false.

For diagnostics, also derive a constexpr KWaveSpanInN =
(LaneGroups - 1) * NumWarps + 1 inside the pipeline; when this exceeds
page_size a single buffer_load spans multiple random pages, so the
per-issue SRD-rebase optimisation (not implemented yet) would not apply
even on a sub-4 GB cache. Informational only today.

Test: ua-test-scripts correctness sweep (245/245 pass), plus
  test_single_shape.py -b 32 -sq 8192 -sk 120000 -hq 64 -hk 8 -d 64 \
      --num-blocks 1200000 --block-size 16 --test
which previously returned wrong data due to the int32 wrap and now passes
with max abs diff 1.22e-04 vs Triton.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-15 09:00:43 +00:00

algorithm

[rocm-libraries] ROCm/rocm-libraries#5334 (commit bb5a3c8)

2026-03-20 15:47:22 +00:00

arch

CK-UA: dispatch K/V async load on cache_ptr_int32_overflow_possible

2026-05-15 09:00:43 +00:00

container

[rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b)

2026-03-30 14:20:20 +00:00

numeric

[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2)

2026-03-27 20:37:23 +00:00

tensor

CK-UA: dispatch K/V async load on cache_ptr_int32_overflow_possible

2026-05-15 09:00:43 +00:00

utility

[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0)

2026-04-01 07:32:36 +00:00

config.hpp

[rocm-libraries] ROCm/rocm-libraries#5088 (commit 36ca523)

2026-03-10 16:47:43 +00:00

README.md

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device