Files
composable_kernel/example/ck_tile
Copilot ceddfcc13c [CK_TILE] Refine FMHA Readme (#6003)
Updates the FMHA README to document fp8 precision support more
accurately, replacing the outdated "experimental" section and incomplete
CLI arg descriptions.

## Changes

- **`-prec` arg**: expanded supported values from `fp16/bf16/fp8/bf8` →
`fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4`
- **`-qscale` arg**: replaced single-line `1: per-tensor quantization`
with all four modes: `pt/1`, `bs/2`, `kvbs/3`, `mx/4`
- **FP8 support section**: replaced "FP8 experimental support" paragraph
with:
  - Supported targets: gfx942/gfx950 + ROCm 6.0+
- Table distinguishing `fp8` / `fp8bf16` / `fp8fp32` by Q/K/V input type
and output type
  - Table for all `-qscale` modes with descriptions
- Note that `-vlayout=r` (`seqlen*hdim` for V) is the only supported
layout for fp8 types

<!-- START COPILOT ORIGINAL PROMPT -->



<details>

<summary>Original prompt</summary>

Please open a PR against base branch `develop` in repository
`ROCm/rocm-libraries` applying the following documentation updates
within the composable kernel path.

## Scope
Update the file:
- `projects/composablekernel/example/ck_tile/01_fmha/README.md`

## Changes to apply
Apply the combined edits described in the diffs below (two consecutive
patches). Ensure the final file content includes **both** sets of
changes.

### Patch 1
- In the CLI args section:
  - Update `-qscale` description lines to include:
    - `pt or 1, per-tensor scale`
    - `bs or 2, block scale`
    - `kvbs or 3, Q per-tensor, K/V per-page block scale`
    - `mx or 4, microscaling (exclusively for mxfp8/mxfp4)`
- Update `-prec` supported data types from `fp16/bf16/fp8/bf8` to
`fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4`.

- Replace the existing "FP8 experimental support" section with an "FP8
support" section stating:
  - FP8 FMHA kernels supported on gfx942/gfx950 with ROCm 6.0+
- Precision selectable via `-prec=fp8` (or `fp8bf16`, `fp8fp32`) for
`tile_example_fmha_fwd`

- Add a table describing `-qscale` modes:
  - `n` or `0`: No quantization scale (default)
  - `pt` or `1`: Per-tensor quantization scale
  - `bs` or `2`: Per-block quantization scale
  - `kvbs` or `3`: Q per-tensor + K/V per-page block scale
- `mx` or `4`: Microscaling (MX format), exclusively for `mxfp8` and
`mxfp4`

- Add/keep note that currently only `-vlayout=r` (`seqlen*hdim` for V
matrix) is supported for fp8 data types.

### Patch 2
Further refine the "FP8 support" paragraph to explain the difference
between `fp8`, `fp8bf16`, and `fp8fp32` via a table:

| `-prec` value | Q/K/V input type | Output type | Description |
|---|---|---|---|
| `fp8` | fp8 | fp8 | Fully fp8: both inputs and output are in fp8 |
| `fp8bf16` | fp8 | bf16 | Mixed precision: fp8 inputs, bf16 output —
useful when the consumer expects a wider-range output format |
| `fp8fp32` | fp8 | fp32 | Mixed precision: fp8 inputs, fp32 output —
highest-precision output, suitable for debugging or further fp32
processing |

Keep the rest of the `-qscale` table and the `-vlayout=r` limitation
note.

## Notes
- PR title must be: `[CK_TILE] Add fp8 in FMHA readme`
- Ensure markdown formatting renders correctly (tables, code
formatting).
- Only modify the file listed above.


The following is the prior conversation context from the user's chat
exploration (may be truncated):

User: 能幫我上這個pr嗎 在composable kernel裡的路徑

diff --git a/example/ck_tile/01_fmha/README.md
b/example/ck_tile/01_fmha/README.md
index 0b526f4e9fc..1627435863b 100644
--- a/example/ck_tile/01_fmha/README.md
+++ b/example/ck_tile/01_fmha/README.md
@@ -62,14 +62,17 @@ args:
         -d_v    head dim for v, -1 means equal to d (default:-1)
-scale_s scale factor of S. 0 means equal to 1/sqrt(hdim). (default:0)
      -qscale    n or 0, no scaling (default:n)
-                1: per-tensor quantization.
+                pt or 1, per-tensor scale
+                bs or 2, block scale
+                kvbs or 3, Q per-tensor, K/V per-page block scale
+                mx or 4, microscaling (exclusively for mxfp8/mxfp4)
       -iperm    permute input (default:1)
                 if true, will be b*h*s*d, else b*s*h*d
       -operm    permute output (default:1)
        -bias    n or 0, no bias (default:n)
e(lementwise) or 1, elementwise bias with 1*1*s*s. e:1, 1*h*s*s. e:2,
b*h*s*s
                 a(libi) or 2, alibi with 1*h. a:1, b*h
-       -prec    data type. fp16/bf16/fp8/bf8 (default:fp16)
+ -prec data type. fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4
(default:fp16)
-mask 0: no mask, 1: top-left(same as 't'), 2:bottom-right(same as 'b')
(default:0)
                 't', top-left causal mask, 'b', bottom-r causal mask
't:l,r', top-left sliding window attn(swa) with FA style left right size
@@ -161,7 +164,17 @@ We support sequence padding and variable-length
processing in both batch and gro
 
Both approaches optimize memory access patterns while supporting
flexible sequence length requirements commonly found in transformer
inference scenarios.
 
-## FP8 experimental support
-As described in [this
blog](https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482),
we have an experimental support for fp8 fmha kernels, you can evaluate
the performance by setting the arg `-prec=fp8` to the
`tile_example_fmha_fwd`, on a gfx942 machine and ROCm 6.0+.
+## FP8 support
+FP8 FMHA kernels are supported on gfx942/gfx950 machines with ROCm
6.0+. You can select fp8 precision by setting the arg `-prec=fp8` (or
`fp8bf16`, `fp8fp32`) to the `tile_example_fmha_fwd`.
 
-Currently we only support `-vlayout=r`( `seqlen*hdim` for V matrix) for
fp8 and fp8bf16 now. Full feature support will come later.
+The following quantization scale modes are available via `-qscale`:
+
+| `-qscale` value | Description |
+|---|---|
+| `n` or `0` | No quantization sca...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: asleepzzz <4926646+asleepzzz@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2026-04-08 14:58:53 +08:00
..

CK Tile Example Suite

This directory contains a comprehensive suite of examples demonstrating the CK Tile programming model for high-performance GPU kernels. Each example illustrates a key deep learning or HPC operation, implemented using tile-based parallelism, modular pipelines, and data movement policy.


What is CK Tile?

CK Tile is a composable GPU programming API that expresses kernels as a composition of "tiles"—rectangular blocks of computation and data movement. The pipeline & policy orchestrates data movement (global <-> LDS <-> registers), computation, and synchronization, enabling high efficiency and flexibility.


Example Index

Example Operation Description
01_fmha Fused Multi-Head Attention Tile-based FMHA with masking, quantization, and epilogue fusion
02_layernorm2d LayerNorm2D Blockwise layer normalization with fusion and quantization
03_gemm GEMM Matrix multiplication with tilewise parallelism
04_img2col im2col Image-to-column transformation for GEMM-based convolution
05_reduce Reduction Tilewise sum, max, mean reductions
06_permute Permute Generic tensor permutation (up to rank-8)
09_topk_softmax TopK-Softmax Rowwise softmax and top-k selection for MoE gating
10_rmsnorm2d RMSNorm2D Root mean square normalization for LLMs
11_add_rmsnorm2d_rdquant Add + RMSNorm2D + RDQuant Fused add, RMSNorm, and rowwise dynamic quantization
12_smoothquant SmoothQuant Per-channel scaling and quantization for int8 inference
13_moe_sorting MoE Sorting Token-to-expert rearrangement for MoE dispatch
14_moe_smoothquant MoE-SmoothQuant Expert-dependent quantization fused with top-k selection
15_fused_moe Fused MoE End-to-end fused MoE block: sorting, group-GEMM, activation, weighting
16_batched_gemm Batched GEMM Parallel computation of multiple GEMMs
17_grouped_gemm Grouped GEMM Multiple independent GEMMs with different shapes
18_flatmm FLATMM Flattened matrix multiplication for packed layouts
19_gemm_multi_d Multi-D GEMM GEMM with multiple side inputs (bias, residual, etc.)
35_batched_transpose Batched Transpose NCHW <-> NHWC and other layout conversions
36_copy Copy Minimal example for tile-based memory movement
37_transpose Block Transpose High-performance tiled transpose for large tensors

Technical Highlights


How to Build & Run

mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make -j

Each example produces its own executable in build/bin/.


Learning and Extending


References


Back to Composable Kernel Examples