mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-10 16:28:38 +00:00
[CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388) ## Motivation Improve precision of mxfp4 without performance penalties. ## Technical Details Since performance of scale MFMAs is the same when neither A nor B is fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the second GEMM, while types of Q, K, V stay the same. This allows to improve overall precision significantly because fp6 has 32 non-negative values used for P quantization compared to just 8 values for fp4. It was found that there is a compiler bug with `__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in LCOMPILER-561) but a workaround seems to fix all failing instances. ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.