mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 16:59:10 +00:00

Files

Anton Gorenko 66d6714376 [rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd)

[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388)

## Motivation

Improve precision of mxfp4 without performance penalties.

## Technical Details

Since performance of scale MFMAs is the same when neither A nor B is
fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the
second GEMM, while types of Q, K, V stay the same.
This allows to improve overall precision significantly because fp6 has
32 non-negative values used for P quantization compared to just 8 values
for fp4.

It was found that there is a compiler bug with
`__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in
LCOMPILER-561) but a workaround seems to fix all failing instances.

## Test Plan

```
ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4
```

## Test Result

The tests must pass.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-05-26 06:55:17 -07:00

algorithm

[rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f)

2026-05-20 17:25:22 +03:00

arch

[rocm-libraries] ROCm/rocm-libraries#7104 (commit 0fab8d8)

2026-05-26 10:49:36 +00:00

container

[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)

2026-05-22 02:43:50 +00:00

numeric

[rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd)

2026-05-26 06:55:17 -07:00

tensor

[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)

2026-05-22 02:43:50 +00:00

utility

[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)

2026-05-22 02:43:50 +00:00

config.hpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

README.md

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device