mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 08:48:45 +00:00

Files

Hosang Yoon e7e8801dc3 [rocm-libraries] ROCm/rocm-libraries#7586 (commit c18f2c7)

[CK_TILE] Use gfx11 float buffer atomics in FMHA Bwd

## Motivation

FlashAttention CK backward on gfx11 can hit out-of-bounds/tail writes in
the dQ accumulator atomic-add path when sequence rows are padded at the
tile level but not marked invalid in the DQDKDV main tensor view.

With the generic global atomic fallback, an incorrectly-valid tail
element can issue an actual pointer-based `atomicAdd`. With the buffer
atomic path, the write is issued through a buffer resource with bounds
information and follows the same backend already used by gfx9/gfx12.

This fixes the gfx11 FMHA BWD failure without changing the gfx11 default
for unrelated CK Tile kernels.

## Technical Details

This PR enables the existing CK Tile AMD buffer float atomic-add path
only for generated FMHA BWD gfx11 translation units.

gfx11 normally uses the generic global atomic fallback for
floating-point `buffer_view::atomic_add`. That fallback performs the
atomic through a raw computed pointer and depends on the software
validity predicate to avoid invalid elements. In FMHA BWD dQ
accumulation, padded tail rows can reach this path, so using the buffer
atomic backend is safer: it uses a buffer resource with base pointer,
bounds information, and an element offset, matching the backend already
used by gfx9/gfx12.

Enabling `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT` globally for gfx11 is
too broad and can break unrelated gfx11 CK builds such as GEMM. Instead,
`config.hpp` now preserves an explicitly pre-defined
`CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT`, while keeping the existing
default disabled for gfx11.

## Test Plan

Validated the change with the FlashAttention CK full test suite with
backward pass enabled on gfx11.
pytest -q -s tests/test_flash_attn_ck.py

## Test Result

FlashAttention CK gfx11 test result:
260680 passed, 152076 skipped

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

2026-05-30 00:10:26 +00:00

algorithm

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

arch

[rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)

2026-05-29 17:02:45 +00:00

container

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

numeric

[rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)

2026-05-29 17:02:45 +00:00

tensor

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

utility

[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76)

2026-05-27 06:56:58 -07:00

config.hpp

[rocm-libraries] ROCm/rocm-libraries#7586 (commit c18f2c7)

2026-05-30 00:10:26 +00:00

README.md

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device