mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-11 16:59:10 +00:00
[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388) ## Motivation Improve precision of mxfp4 without performance penalties. ## Technical Details Since performance of scale MFMAs is the same when neither A nor B is fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the second GEMM, while types of Q, K, V stay the same. This allows to improve overall precision significantly because fp6 has 32 non-negative values used for P quantization compared to just 8 values for fp4. It was found that there is a compiler bug with `__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in LCOMPILER-561) but a workaround seems to fix all failing instances. ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
ck_tile/core
ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)
algorithm/
coordinate transform and some other reusable algorithm
arch/
contains some basic device building block like mma, buffer addressing, etc...
container/
contains basic container data structure, array/sequence/tuple/...
numeric/
data type, and data type related math
tensor/
tensor descriptors and tile level API
utility/
other utility function for both host/device