mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Files

Anton Gorenko ed8c9dd4f2 Add workaround for inefficient buffer_load to lds on 7.2

The 3rd argument of buffer_load_dwordx4 is a scalar register.
But the compiler gnerates a waterwall loop as if lanes can have
a different value, even though the original values comes from as scalar
register:

        v_mov_b32_e32 v187, s43
...
; %bb.65:
        s_mov_b64 exec, s[0:1]
        s_add_i32 s43, s40, 0x8400
        s_mov_b64 s[0:1], exec
        s_mov_b32 m0, s43
.LBB2_66:                               ; =>This Inner Loop Header: Depth=1
        v_readfirstlane_b32 s51, v187
        s_nop 1
        v_cmp_eq_u32_e32 vcc, s51, v187
        s_and_saveexec_b64 vcc, vcc
        s_nop 0
        buffer_load_dwordx4 v197, s[12:15], s51 offen lds
        s_xor_b64 exec, exec, vcc
        s_cbranch_execnz .LBB2_66

2026-06-19 16:20:21 +05:00

algorithm

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

arch

Add workaround for inefficient buffer_load to lds on 7.2

2026-06-19 16:20:21 +05:00

container

[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)

2026-06-05 12:27:41 +00:00

numeric

[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)

2026-06-05 12:27:41 +00:00

tensor

[rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e)

2026-06-19 06:57:14 +00:00

utility

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

config.hpp

[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f)

2026-06-05 12:27:41 +00:00

README.md

…

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device