mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Files

juuso-oskari 87658a9518 CK-UA: hoist wave-uniform warp id out of the async-load issue loop

tile_scatter_gather::async_load_raw / async_load_raw_long recompute
get_warp_id() (threadIdx.x/warp_size + a convergent v_readfirstlane) at
every K/V load issue to form the m0 / LDS-wave base. The value is
wave-uniform and constant for the window's lifetime, but LLVM cannot
hoist or CSE it across the load loop: v_readfirstlane is convergent and
the m0 set is an asm-volatile with a memory clobber, which together pin
the recompute to each issue.

Materialize the warp id once at window construction (cached_warp_id_,
set only for the global-memory gather windows that issue these loads)
and read the cached SGPR in both async paths. ISA: the per-issue
s_lshr ÷64 and v_readfirstlane drop out of the loop (warp-id readfirstlane
sites 36 -> 11, the ÷64 shift down to 2 static sites).

Matched sweeps (line-tables-identical codegen, GQA-6 d128 page64):
  bf16 prefill: -7.25% aggregate, 12/12 shapes improved, 0 regressions
                (CK/Triton at sq>=5000 moves ~0.83x -> ~0.90x)
  fp8  all:     -0.4..-1.3% aggregate (fp8 prefill is gated by the
                ds_bpermute repack, which masks the addressing savings)
Correctness vs torch reference: PASS (fp8 + bf16).

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-06-01 11:05:21 +00:00

algorithm

[rocm-libraries] ROCm/rocm-libraries#5334 (commit bb5a3c8)

2026-03-20 15:47:22 +00:00

arch

CK-UA: relocate amd_async_global_load_lds_raw to its own header

2026-05-27 13:15:55 +00:00

container

[rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b)

2026-03-30 14:20:20 +00:00

numeric

[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2)

2026-03-27 20:37:23 +00:00

tensor

CK-UA: hoist wave-uniform warp id out of the async-load issue loop

2026-06-01 11:05:21 +00:00

utility

[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0)

2026-04-01 07:32:36 +00:00

config.hpp

[rocm-libraries] ROCm/rocm-libraries#5088 (commit 36ca523)

2026-03-10 16:47:43 +00:00

README.md

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device