mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 11:16:59 +00:00
This helper (added by1f6942143to support the unified_attention >2GB-cache decode path, then extended in46e622539with the clang<21 inline-asm fallback) was inlined into amd_buffer_addressing.hpp and amd_buffer_addressing_builtins.hpp purely for topical fit — both files also house the other amd_async_* helpers. Functionally the helper has exactly one caller (tile_scatter_gather::async_load_raw_long), and it doesn't exist anywhere upstream. Move it into its own header, include/ck_tile/core/arch/amd_global_load_lds_raw.hpp, and revert the two long-standing HW-utility headers to bit-identical-to- upstream. Net effect: amd_buffer_addressing.hpp — 0 lines diff vs merge-base amd_buffer_addressing_builtins.hpp — 0 lines diff vs merge-base amd_global_load_lds_raw.hpp — new file (157 lines) tile_scatter_gather.hpp — +1 include line The CK_TILE_HAS_GLOBAL_LOAD_LDS_DWORDX4_BUILTIN macro lives with the helper in the new file, so the toolchain gate also leaves no footprint in the addressing headers. Verified zero perf delta on the two key UA shapes vs. the pre-relocation build (b=128/sk=16384/d=128/bf16 and b=1/sk=1M/d=128/bf16); both PASS at ~1.51 ms / 5674 GB/s and 0.77 ms / 5609 GB/s respectively, matching prior runs within run-to-run noise. Motivation: shrink the surface area an eventual upstream PR would have to defend on long-standing core HW headers. Anyone reviewing now sees the addition as a single new file rather than a +233-line edit across two of CK's most central utility headers. Co-authored-by: Cursor <cursoragent@cursor.com>
ck_tile/core
ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)
algorithm/
coordinate transform and some other reusable algorithm
arch/
contains some basic device building block like mma, buffer addressing, etc...
container/
contains basic container data structure, array/sequence/tuple/...
numeric/
data type, and data type related math
tensor/
tensor descriptors and tile level API
utility/
other utility function for both host/device