Files
composable_kernel/include/ck_tile/core/arch
juuso-oskari 7772504f54 CK-UA: relocate amd_async_global_load_lds_raw to its own header
This helper (added by 1f6942143 to support the unified_attention >2GB-cache
decode path, then extended in 46e622539 with the clang<21 inline-asm
fallback) was inlined into amd_buffer_addressing.hpp and
amd_buffer_addressing_builtins.hpp purely for topical fit — both files
also house the other amd_async_* helpers. Functionally the helper has
exactly one caller (tile_scatter_gather::async_load_raw_long), and it
doesn't exist anywhere upstream.

Move it into its own header, include/ck_tile/core/arch/amd_global_load_lds_raw.hpp,
and revert the two long-standing HW-utility headers to bit-identical-to-
upstream. Net effect:

  amd_buffer_addressing.hpp           — 0 lines diff vs merge-base
  amd_buffer_addressing_builtins.hpp  — 0 lines diff vs merge-base
  amd_global_load_lds_raw.hpp         — new file (157 lines)
  tile_scatter_gather.hpp             — +1 include line

The CK_TILE_HAS_GLOBAL_LOAD_LDS_DWORDX4_BUILTIN macro lives with the
helper in the new file, so the toolchain gate also leaves no footprint
in the addressing headers.

Verified zero perf delta on the two key UA shapes vs. the pre-relocation
build (b=128/sk=16384/d=128/bf16 and b=1/sk=1M/d=128/bf16); both PASS at
~1.51 ms / 5674 GB/s and 0.77 ms / 5609 GB/s respectively, matching
prior runs within run-to-run noise.

Motivation: shrink the surface area an eventual upstream PR would have
to defend on long-standing core HW headers. Anyone reviewing now sees
the addition as a single new file rather than a +233-line edit across
two of CK's most central utility headers.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-27 13:15:55 +00:00
..