mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-10 16:28:38 +00:00

Files

Hosang Yoon 720cc88a31 [rocm-libraries] ROCm/rocm-libraries#6253 (commit 61934c6)

[CK_TILE] Enable canonical-NaN BF16 conversion for FMHA on RDNA (#6253)

## Motivation

- On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA
forward adds noticeable overhead and causes a meaningful performance gap
versus fp16. The asm-based path (mode 3) does not improve this on RDNA
and can perform even worse.
- In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower
than the corresponding fp16 path.
- This PR reduces that gap by switching FMHA forward to a different BF16
conversion strategy based on Triton’s canonical-NaN
round-to-nearest-even behavior.

## Technical Details

- Add a new `standard_cnan` BF16 conversion mode to CK Tile.
- Implement a canonical-NaN RTN `float -> bf16` conversion path based on
the Triton implementation.
- Enable this conversion mode by default for FMHA forward builds
targeting gfx11/gfx12.
- Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some
`hdim=128` cases to keep kernel selection aligned with the updated
conversion behavior.

## Test Plan

./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result
- all tests passed when running `test_ck_tile_fmha`
- BF16 FMHA forward performance improves by up to ~5% on gfx11.
- BF16 FMHA forward performance improves by up to ~10% on gfx12.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-04-20 14:52:24 -04:00

algorithm

[rocm-libraries] ROCm/rocm-libraries#5334 (commit bb5a3c8)

2026-03-20 15:45:45 +00:00

arch

[rocm-libraries] ROCm/rocm-libraries#5857 (commit d77cd41)

2026-04-20 14:28:23 +00:00

container

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 11:17:11 -04:00

numeric

[rocm-libraries] ROCm/rocm-libraries#6253 (commit 61934c6)

2026-04-20 14:52:24 -04:00

tensor

[rocm-libraries] ROCm/rocm-libraries#6168 (commit 2968835)

2026-04-20 15:32:24 +00:00

utility

[rocm-libraries] ROCm/rocm-libraries#5508 (commit 0ad0aca)

2026-04-14 09:25:01 +02:00

config.hpp

[rocm-libraries] ROCm/rocm-libraries#6253 (commit 61934c6)

2026-04-20 14:52:24 -04:00

README.md

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device