Files
composable_kernel/include
juuso-oskari 46e6225397 CK-UA: gate dwordx3/x4 global_load_lds builtin on clang≥21, inline-asm fallback
The size=12 and size=16 ImmArg overloads of __builtin_amdgcn_global_load_lds
for gfx950 only landed in AMD clang ~21 (present in ROCm ≥ 7.11 / clang 22,
absent in ROCm 7.1.1 / clang 20). Building this CK branch on the older
toolchain failed during semantic analysis of amd_buffer_addressing_builtins.hpp:

    error: invalid size value
       __builtin_amdgcn_global_load_lds(gptr, lptr, 16, ...);
    note: size must be 1, 2, or 4

The error is unavoidable as soon as the unified_attention pipeline is built —
its `if (cache_ptr_int32_overflow_possible)` dispatch is a runtime branch,
not `if constexpr`, so the `bytes ∈ {12, 16}` instantiations are compiled
regardless of whether any workload at runtime takes that path.

Fix: introduce CK_TILE_HAS_GLOBAL_LOAD_LDS_DWORDX4_BUILTIN, gated on
__clang_major__ >= 21 (overridable). When 0, emit
`global_load_lds_dwordx{1,3,4}` via inline asm, with M0 set explicitly
through `s_mov_b32` from the addrspace(3) `lptr` narrowed to its 32-bit
LDS byte offset and wave-uniformed via `readfirstlane`. The assembler
accepts the mnemonic and emits the same HW instruction the builtin
would lower to (verified zero perf delta vs. the builtin path across
the full decode regression sweep — all 8 (b, d, dtype) configs match
to within ≤ 1.5% run-to-run noise when the fallback is force-on).

Two simpler "issue N× size=4" decompositions were tried and rejected:
INST.OFFSET stepping by 4 reproduces the dwordx4 layout for no shape;
stepping by 256 with `gptr += 4` per issue happens to pass on one
big-cache decode shape (b=1 / sk=1M) but fails on b=128 / sk=16384 /
d=128 / bf16. The native dwordx4's in-LDS sub-issue ordering doesn't
reduce to any combination of dword INST.OFFSET steps we could find that
survives all decode shapes; asking the assembler for the literal
instruction sidesteps the question.

The dormant amd_buffer_addressing.hpp copy (used only when CK_TILE_USE_
BUFFER_ADDRESSING_BUILTIN is forced to 0, which doesn't happen on clang
≥ 20) gets the same treatment so toggling the macro doesn't reintroduce
the bug.

Allows building jukorhon/unified-attention-ck on ROCm 7.1.1 unchanged;
upgrading to a newer ROCm container remains the recommended option.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-27 12:45:18 +00:00
..