Files
composable_kernel/include/ck_tile/core
juuso-oskari 7772504f54 CK-UA: relocate amd_async_global_load_lds_raw to its own header
This helper (added by 1f6942143 to support the unified_attention >2GB-cache
decode path, then extended in 46e622539 with the clang<21 inline-asm
fallback) was inlined into amd_buffer_addressing.hpp and
amd_buffer_addressing_builtins.hpp purely for topical fit — both files
also house the other amd_async_* helpers. Functionally the helper has
exactly one caller (tile_scatter_gather::async_load_raw_long), and it
doesn't exist anywhere upstream.

Move it into its own header, include/ck_tile/core/arch/amd_global_load_lds_raw.hpp,
and revert the two long-standing HW-utility headers to bit-identical-to-
upstream. Net effect:

  amd_buffer_addressing.hpp           — 0 lines diff vs merge-base
  amd_buffer_addressing_builtins.hpp  — 0 lines diff vs merge-base
  amd_global_load_lds_raw.hpp         — new file (157 lines)
  tile_scatter_gather.hpp             — +1 include line

The CK_TILE_HAS_GLOBAL_LOAD_LDS_DWORDX4_BUILTIN macro lives with the
helper in the new file, so the toolchain gate also leaves no footprint
in the addressing headers.

Verified zero perf delta on the two key UA shapes vs. the pre-relocation
build (b=128/sk=16384/d=128/bf16 and b=1/sk=1M/d=128/bf16); both PASS at
~1.51 ms / 5674 GB/s and 0.77 ms / 5609 GB/s respectively, matching
prior runs within run-to-run noise.

Motivation: shrink the surface area an eventual upstream PR would have
to defend on long-standing core HW headers. Anyone reviewing now sees
the addition as a single new file rather than a +233-line edit across
two of CK's most central utility headers.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-27 13:15:55 +00:00
..
2024-04-15 19:27:12 -05:00

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device