[DOCS] Documentation Addition (Readme updates) (#2495)

* GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>
2026-04-19 22:39:03 +00:00 · 2025-10-16 03:10:57 -07:00
parent 013ba3c737
commit 92c67a824f
120 changed files with 8188 additions and 221 deletions
--- a/example/ck_tile/02_layernorm2d/README.md
+++ b/example/ck_tile/02_layernorm2d/README.md
@@ -1,44 +1,35 @@
-# Layernorm2D forward
+# LayerNorm2D Forward with CK Tile

-This folder contains example for Layernorm2D forward using `ck_tile` tile-programming implementation.
+This example demonstrates efficient 2D layer normalization using the CK Tile programming model, leveraging tile-based parallelism and advanced fusion for transformer and LLM workloads.

-# Implementation and feature support
+---

-## welford online algorithm
-We use welfold algorithm to update `mean`/`variance` block by block. For `N <=4096` case we can compute `mean`/`var`/`normalization` within one loop, we call it `one-pass`. For large N case, it is hard to keep `mean`/`var` inside register/LDS and then computation `normalization`, so we need to load input twice, first time to compute `mean`/`var` block-by-block, then load input another time to compute the `normalization`. We call it `two-pass`.
+## Algorithm and Math

-## mean/variance save
-In training case the mean/variance need to store out (TBD, not supported yet)
+LayerNorm computes, for each row $x$:
+$$
+\mu = \frac{1}{N} \sum_{i=1}^N x_i,\quad \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2
+$$
+$$
+\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}},\quad y_i = \gamma \hat{x}_i + \beta
+$$

-## prenorm/postnorm
+- **Welford's Algorithm**: Used for numerically stable, blockwise mean/variance computation. For $N \leq 4096$, a one-pass algorithm is used; for large $N$, a two-pass approach is adopted.

-![](misc/pnorm.png)
+--

-since [prenorm/postnorm](https://arxiv.org/pdf/1906.01787) is quite common in LLM blocks, this example boosts this feature by kernel fusion. Note that `prenorm`/`postnorm` always need to do elementwise-add a `shortcut` before the actual layernorm computation, and optionally store out the result to global. You can use `-fadd=1` to test `pre-add+store`, or `-fadd=2` to test `pre-add` without store out (not codegen by default).
+## Features

-## smooth-quant/dynamic-quant
-we support smooth/dynamic quantization for `int8` output, by setting `-fquant=1` and `-prec_o=int8`. In this case the output will doing a rowwise dynamic quantization like below. Note that smooth-quant require input a `(1*N)` size per-channel scale(in fp32 in our example, though this is customizable), then elememt-wise multiply the tensor for each row, then compute the rowwise dynamic quant. if set `-fquant=2` will have the input per-channel scale stage, only the dynamic quant. This case is supported in our kernel but by default not generated (TBD: add some filter in generate.py support on-demand codegen)
-![](misc/dquant.png)
+- **Prenorm/Postnorm Fusion**: Fused residual addition before/after normalization for transformer blocks.
+- **Smooth/Dynamic Quantization**: Rowwise int8 quantization with per-token scale, supporting smoothquant for LLMs.
+- **Flexible Precision**: Supports fp16, bf16, int8 output.
+- **Efficient for Large N**: Two-pass pipeline for $N > 4096$.
+- **Highly Modular**: Easily extendable for new fusion or quantization strategies.

-```
-# assume output int8, hidden_states is [m, n] shape and in fp16/bf16
-# [m, 1]
-per_token_amax, _ = torch.max(
-     input=torch.abs(hidden_states), 
-     dim=-1, 
-     keepdim=True
-)
-per_token_scale = per_token_amax.to(dtype=torch.float32) / 127.0
+---

-# quant hidden_states
-hidden_states = (hidden_states / per_token_scale).to(dtype=torch.int8)
+## Build & Run

-return hidden_states, per_token_scale
-# hidden_states now is int8 will feed to next layer as intput
-# per_token_scale will be used as dequant factor later layer
-```
-
-## build
 ```
 # in the root of ck_tile
 mkdir build && cd build
@@ -47,7 +38,7 @@ make tile_example_layernorm2d_fwd -j
 ```
 This will result in an executable `build/bin/tile_example_layernorm2d_fwd`

-## example
+## Example
 ```
 args:
          -m    m dimension (default:3328)
@@ -69,7 +60,43 @@ args:
   -jsonfile    json file name to dump results (default:layernorm2d_fwd.json)

 ```
+---

+## Technical Details
+
+## Welford online algorithm
+We use welfold algorithm to update `mean`/`variance` block by block. For `N <=4096` case we can compute `mean`/`var`/`normalization` within one loop, we call it `one-pass`. For large N case, it is hard to keep `mean`/`var` inside register/LDS and then computation `normalization`, so we need to load input twice, first time to compute `mean`/`var` block-by-block, then load input another time to compute the `normalization`. We call it `two-pass`.
+
+## mean/variance save
+In training case the mean/variance need to store out (TBD, not supported yet).
+
+## prenorm/postnorm
+
+![](misc/pnorm.png)
+
+Since [prenorm/postnorm](https://arxiv.org/pdf/1906.01787) is quite common in LLM blocks, this example boosts this feature by kernel fusion. Note that `prenorm`/`postnorm` always need to do elementwise-add a `shortcut` before the actual layernorm computation, and optionally store out the result to global. You can use `-fadd=1` to test `pre-add+store`, or `-fadd=2` to test `pre-add` without store out (not codegen by default).
+
+## smooth-quant/dynamic-quant
+We support smooth/dynamic quantization for `int8` output, by setting `-fquant=1` and `-prec_o=int8`. In this case the output will doing a rowwise dynamic quantization like below. Note that smooth-quant require input a `(1*N)` size per-channel scale(in fp32 in our example, though this is customizable), then elememt-wise multiply the tensor for each row, then compute the rowwise dynamic quant. if set `-fquant=2` will have the input per-channel scale stage, only the dynamic quant. This case is supported in our kernel but by default not generated (TBD: add some filter in generate.py support on-demand codegen)
+![](misc/dquant.png)
+
+```
+# assume output int8, hidden_states is [m, n] shape and in fp16/bf16
+# [m, 1]
+per_token_amax, _ = torch.max(
+     input=torch.abs(hidden_states), 
+     dim=-1, 
+     keepdim=True
+)
+per_token_scale = per_token_amax.to(dtype=torch.float32) / 127.0
+
+# quant hidden_states
+hidden_states = (hidden_states / per_token_scale).to(dtype=torch.int8)
+
+return hidden_states, per_token_scale
+# hidden_states now is int8 will feed to next layer as intput
+# per_token_scale will be used as dequant factor later layer
+```
 ## limitations
 Note that `fquant=2`, `fadd=2`, `prec_sm/prec_sy` other than `fp32` are not by default generated. Though our kernel template suppor this. (TBD: add some flag in generate.py) to generate those instance on demand. Beside, `N>8192` case will by default using two-pass pipeline, and `-fquant=1/2` are not supported yet. If need suport `N>8192` and `fused+residual+store`, you can use this example together with `12_smoothquant`, to construct layernorm+residual, and smoothquant, 2 kernels for this purpose.

@@ -83,5 +110,25 @@ Note that `fquant=2`, `fadd=2`, `prec_sm/prec_sy` other than `fp32` are not by d

 # standard fp16 layernorm 2d, m=10. n=1024, fused-smooth-quant+fused-add-store, output in int8
 ./build/bin/tile_example_layernorm2d_fwd  -m=10 -n=1024 -prec_o=int8 -fquant=1 -fadd=1
-
 ```
+---
+
+## Source Structure
+
+- **Kernel**: `layernorm2d_fwd.hpp` (tile-programming kernel template)
+- **Executable**: `layernorm2d_fwd.cpp` (argument parsing, kernel launch)
+- **Codegen**: `generate.py` (instantiates kernels for different configs)
+- **Misc**: `misc/` (algorithm diagrams, e.g., prenorm/postnorm, quantization)
+
+---
+
+## Related CK Tile Examples
+
+- [01_fmha](../01_fmha/README.md): Fused multi-head attention (FMHA)
+- [03_gemm](../03_gemm/README.md): Tile-programming GEMM
+- [12_smoothquant](../12_smoothquant/README.md): Standalone smoothquant kernel
+
+For and distribution, see `include/ck_tile/tile_program/tile_distribution/`.
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/03_gemm/README.md
+++ b/example/ck_tile/03_gemm/README.md
@@ -1,10 +1,45 @@
-# GEMM Matrix Multiplication
+# GEMM with CK Tile

-This folder contains example for GEMM using ck_tile tile-programming implementation. Currently, it only supports the basic feature of the CK Tile GEMM, but creates the placeholders for the future support on different GEMM pipeline and different GEMM modules. In the near future, we will gradually migrate all the GEMM features from old CK to CK Tile.
+This example demonstrates matrix multiplication (GEMM) using the CK Tile programming model, focusing on tile-based parallelism and modular kernel design.

-## build
-```
-# in the root of ck_tile
+---
+
+## Algorithm and Math
+
+GEMM computes:
+$$
+C = A \times B
+$$
+where $A$ is $[M, K]$, $B$ is $[N, K]$, and $C$ is $[M, N]$.
+
+- **BlockTile GEMM**: Each Block Tile computes a tile of $C$ by loading tiles of $A$ and $B$, performing blockwise matrix multiply-accumulation, and writing results back with the epilogue.
+
+---
+
+## Tile Programming Model
+
+- **Configuration**: The Configuration of how the kernel going to be initialized with Block Tile Dimension, Warps Layout, Warp Tile Dimension, and other improvements.
+- **Block Tile**: Each block tile allocates in the compute unit of AMD GPU grabbing the .
+- **Pipeline**: Modular design allows swapping different memory/computation pipelines (e.g., basic, memory-bound, compute).
+- **Block GEMM**: Block Level implementation on how to coordinate the warps iteration and memory layout in block tile.
+- **Warp GEMM**: Each Warp's GEMM Calculation
+- **Epilogue**: Transferring the Accumulated result from register to global memory.
+
+---
+
+## Features
+
+- **Flexible Layouts**: Supports row/column-major and custom strides for $A$, $B$, $C$.
+- **Split K**: Split the Block Tile also on K Dimension and add it back after the matrix multiply-accumulation. Have a higher performance when M and N is small and K is large.
+- **Preshuffled GEMM**: In inference task, shuffle the GEMM of B (weight) matrix in the warp layout and bypass the shared memory to do the GEMM calculation. Best performance solution for GEMM.
+- **Precision**: Supports fp16, bf16, fp8, bf8, int4 (for B Matrix).
+- **Validation**: CPU/GPU validation and error tolerance options.
+
+---
+
+## Build & Run
+
+```bash
 mkdir build && cd build
 # you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
 ../script/cmake-ck-dev.sh  ../ <arch>
@@ -34,9 +69,30 @@ args:
     -warmup    number of iterations before benchmark the kernel (default:50)
     -repeat    number of iterations to benchmark the kernel (default:100)
      -timer    gpu:gpu timer, cpu:cpu timer (default:gpu)
-    -split_k    splitK value (default:1)
+          -split_k    splitK value (default:1)
       -init    0:random, 1:linear, 2:constant(1) (default:0)
 -persistent    0:non-persistent, 1:persistent (default:0)
       -json    0: No Json, 1: Dump Results in Json format (default:0)
   -jsonfile    json file name to dump results (default:gemm.json)
 ```
+
+
+## Source Structure
+
+- **Executables**: `gemm_basic.cpp`, `universal_gemm.cpp` (different kinds of GEMM implementation)
+- **Utils**: `gemm_utils.hpp` (helper functions)
+- **Build**: `CMakeLists.txt`, `run_gemm_example.inc`
+- **Scripts**: `script/` (build and run helpers)
+
+---
+
+## Related CK Tile Examples
+
+- [01_fmha](../01_fmha/README.md): Fused multi-head attention (FMHA)
+- [18_flatmm](../18_flatmm/README.md): Preshuffled GEMM alternative solution
+- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
+
+For distribution, see `include/ck_tile/tile_program/tile_distribution/`.
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/04_img2col/README.md
+++ b/example/ck_tile/04_img2col/README.md
@@ -1,13 +1,51 @@
-# Image to Column
+# Image to Column (im2col) with CK Tile

-This folder contains example for Image to Column using ck_tile tile-programming implementation.
+This example demonstrates the im2col transformation using the CK Tile programming model, a key step for converting convolution into GEMM for efficient GPU execution.

-## build
-```
-# in the root of ck_tile
+---
+
+## Algorithm and Math
+
+Given an input image tensor $X$ and convolution kernel size, im2col rearranges sliding windows of $X$ into columns:
+- For each patch, flatten and stack as a column in the output matrix.
+- Enables convolution as matrix multiplication: $\text{im2col}(X) \times W$.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile (block of patches).
+- **Pipeline**: Modular, can be extended for fused operations (e.g., quantization, activation).
+
+---
+
+## Build & Run
+
+```bash
 mkdir build && cd build
 # you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
 ../script/cmake-ck-dev.sh  ../ <arch>
 make tile_example_img2col -j
+./bin/tile_example_img2col -?
 ```
-This will result in an executable `build/bin/tile_example_img2col`
+
+---
+
+## Source Structure
+
+- **Kernel**: `image_to_column.hpp` (tile-programming kernel template)
+- **Executable**: `image_to_column.cpp` (argument parsing, kernel launch)
+- **Build**: `CMakeLists.txt`
+
+---
+
+## Related CK Tile Examples
+
+- [03_gemm](../03_gemm/README.md): GEMM with tiles (im2col output as input)
+- [05_reduce](../05_reduce/README.md): Reductions with tiles
+- [06_permute](../06_permute/README.md): Permutation with tiles
+
+For distribution, see `include/ck_tile/tile_program/tile_distribution/`.
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/05_reduce/README.md
+++ b/example/ck_tile/05_reduce/README.md
@@ -0,0 +1,53 @@
+# Reduction with CK Tile
+
+This example demonstrates parallel reduction (sum, max, etc.) using the CK Tile programming model, a core operation for normalization, statistics, and aggregation in deep learning.
+
+---
+
+## Algorithm and Math
+
+Given a tensor $X$ and a reduction axis, compute:
+- **Sum**: $Y = \sum_i X_i$
+- **Max**: $Y = \max_i X_i$
+- **Mean**: $Y = \frac{1}{N} \sum_i X_i$
+
+- **Tilewise Reduction**: Each thread block reduces a tile (block) of the input, using shared memory and register accumulation for efficiency.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile (block) of the input tensor.
+- **Pipeline**: Modular, can be extended for fused reductions or post-processing.
+
+---
+
+## Build & Run
+
+```bash
+mkdir build && cd build
+sh ../script/cmake-ck-dev.sh ../ <arch>
+make tile_example_reduce -j
+./bin/tile_example_reduce -?
+```
+
+---
+
+## Source Structure
+
+- **Kernel**: `reduce.hpp` (tile-programming kernel template)
+- **Executable**: `reduce.cpp` (argument parsing, kernel launch)
+- **Build**: `CMakeLists.txt`
+
+---
+
+## Related CK Tile Examples
+
+- [03_gemm](../03_gemm/README.md): GEMM with tiles
+- [04_img2col](../04_img2col/README.md): im2col transformation
+- [06_permute](../06_permute/README.md): Permutation with tiles
+
+For distribution, see `include/ck_tile/tile_program/tile_distribution/`.
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/06_permute/README.md
+++ b/example/ck_tile/06_permute/README.md
@@ -1,8 +1,31 @@
-# permute
+# Permute with CK Tile

-This folder contains example for permute kernel, which is similiar to [torch.permute](https://pytorch.org/docs/stable/generated/torch.permute.html) (combined with [torch.contiguous](https://pytorch.org/docs/stable/generated/torch.Tensor.contiguous.html)). Currently we implement a generic permute kernel that support up to rank 8 arbitrary permutation with a single kernel instance. Performance is not the first consideration, we prefer a simple and general kernel implementation using `ck_tile` in this example.
+This example demonstrates generic tensor permutation which is similiar to [torch.permute](https://pytorch.org/docs/stable/generated/torch.permute.html) (combined with [torch.contiguous](https://pytorch.org/docs/stable/generated/torch.Tensor.contiguous.html)). Currently we implement a generic permute kernel that support up to rank 8 arbitrary permutation with a single kernel instance. Performance is not the first consideration, we prefer a simple and general kernel implementation using `ck_tile` in this example.


+---
+
+## Algorithm and Math
+
+Given a tensor $X$ of shape $[d_0, d_1, ..., d_{n-1}]$ and a permutation $\pi$, compute:
+$$
+Y_{i_0, i_1, ..., i_{n-1}} = X_{i_{\pi(0)}, i_{\pi(1)}, ..., i_{\pi(n-1)}}
+$$
+
+- **Tilewise Permute**: Each thread block processes a tile (block) of the input, computes the permuted indices, and writes to the output.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile of the input tensor.
+- **Alternative Implementation**: For rank-7 tensors, a swizzled layout is supported for matrix core-friendly data loading.
+
+---
+
+## Build & Run
+
+### Arguments
 ```
 args:
          -v    weather do CPU validation or not (default:1)
@@ -10,18 +33,18 @@ args:
      -shape    the shape of the input tensor (default:2,3,4)
       -perm    permute perm (default:2,1,0)
 ```
-
-## build
 ```
 # in the root of ck_tile
 mkdir build && cd build
 ../script/cmake-ck-dev.sh  ../ <arch>  # you can replace this <arch> to gfx90a, gfx942...
 make tile_example_permute -j
 ```
+
 This will result in an executable `build/bin/tile_example_permute`


-## some examples
+### Further Examples
+
 ```
 # torch
 x=torch.randn(2,3,4,6)
@@ -31,16 +54,41 @@ y=x.permute(0,3,2,1).contiguous()
 ./build/bin/tile_example_permute -shape=2,3,4,6 -perm=0,3,2,1
 ```

-or you can try the smoke_test
+You can try the smoke_test:
+
 ```
 # in the root of ck_tile, after you build this example
 sh example/ck_tile/06_permute/script/smoke_test.sh
 ```

-### alternative implementation
-we have an alternative implementation under `alternative_impl/` folder, that can swizzle the tensor to be more friendly for data loading for matrix core layout. This can be enabled when dealing with a `rank-7` tensor, with a fixed pattern of either `0,1,4,2,5,3,6` or `0,1,2,4,5,3,6`. There are other shape limitation of this implementation, check the source code of `permute.cpp` for detail.
+### Alternative Implementation
+
+We have an alternative implementation under `alternative_impl/` folder, that can swizzle the tensor to be more friendly for data loading for matrix core layout. This can be enabled when dealing with a `rank-7` tensor, with a fixed pattern of either `0,1,4,2,5,3,6` or `0,1,2,4,5,3,6`. There are other shape limitation of this implementation, check the source code of `permute.cpp` for detail.
+
 ```
 # example
 ./build/bin/tile_example_permute -shape=3,6,4,32,16,2,8 -perm=0,1,4,2,5,3,6 # b_n0_k0_n1_k1_n2_k2
 ./build/bin/tile_example_permute -shape=3,8,4,16,16,4,8 -perm=0,1,2,4,5,3,6 # b_n0_n1_k0_k1_n2_k2
 ```
+
+---
+
+## Source Structure
+
+- **Kernel**: `permute.hpp` (tile-programming kernel template)
+- **Executable**: `permute.cpp` (argument parsing, kernel launch)
+- **Alternative**: `alternative_impl/` (swizzled layout for rank-7 tensors)
+- **Build**: `CMakeLists.txt`, `script/`
+
+---
+
+## Related CK Tile Examples
+
+- [03_gemm](../03_gemm/README.md): GEMM with tiles
+- [05_reduce](../05_reduce/README.md): Reductions with tiles
+- [35_batched_transpose](../35_batched_transpose/README.md): Batched transpose with tiles
+
+For distribution, `include/ck_tile/tile_program/tile_distribution/`.
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/09_topk_softmax/README.md
+++ b/example/ck_tile/09_topk_softmax/README.md
@@ -1,9 +1,31 @@
-# topk-softmax
+# TopK-Softmax with CK Tile

-This folder contains example for topk-softmax kernel using ck_tile tile-programming implementation. This kernel is often used in Moe model, before launching the fused-moe-gemm block. The input is a `token*expert` 2d matrix. The op will do a softmax per row(`expert`), then find the `topk` value for each row. Output is a `token*topk`  weight(usually fp32) and index(int32) 2d tensor.
+This example demonstrates a tile-programming implementation of TopK-Softmax, commonly used in Mixture-of-Experts (MoE) models to select top-k experts per token after softmax.  This kernel is often used in MoE model, before launching the fused-moe-gemm block. The input is a `token*expert` 2d matrix. The op will do a softmax per row(`expert`), then find the `topk` value for each row. Output is a `token*topk` weight (typically fp32) and index(int32) 2D tensor.

-## build
-```
+---
+
+## Algorithm and Math
+
+Given a matrix $X$ of shape $[\text{tokens}, \text{experts}]$:
+1. **Softmax per row**: $S_{i,j} = \frac{\exp(X_{i,j})}{\sum_k \exp(X_{i,k})}$
+2. **TopK selection**: For each row $i$, select the $k$ largest $S_{i,j}$ and their indices.
+
+**Output**:  
+- $[\text{tokens}, k]$ weights (fp32)
+- $[\text{tokens}, k]$ indices (int32)
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile (block of rows).
+- **Pipeline**: Modular, can be extended for fused operations.
+
+---
+
+## Build & Run
+
+```bash
 # in the root of ck_tile
 mkdir build && cd build
 ../script/cmake-ck-dev.sh  ../ <arch>  # you can replace this <arch> to gfx90a, gfx942...
@@ -11,8 +33,9 @@ make tile_example_topk_softmax -j
 ```
 This will result in an executable `build/bin/tile_example_topk_softmax`

-## example
-```
+### Arguments
+
+```bash
 args:
          -v    weather do CPU validation or not (default:1)
       -pr_i    input data type. fp16/fp32 (representing 8/16/32 bit data) (default:fp16)
@@ -28,3 +51,24 @@ args:
   -jsonfile    json file name to dump results (default:topk_softmax.json)

 ```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`topk_softmax_api.hpp`](topk_softmax_api.hpp) (tile-programming kernel template)
+- **Executable**: [`topk_softmax.cpp`](topk_softmax.cpp) (argument parsing, kernel launch)
+- **Build**: `CMakeLists.txt`, `script/`
+
+---
+
+## Related CK Tile Examples
+
+- [15_fused_moe](../15_fused_moe/README.md): Fused MoE block using TopK-Softmax
+- [05_reduce](../05_reduce/README.md): Reductions with tiles
+- [03_gemm](../03_gemm/README.md): GEMM with tiles
+
+For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/10_rmsnorm2d/README.md
+++ b/example/ck_tile/10_rmsnorm2d/README.md
@@ -1,18 +1,45 @@
-# Rmsnorm2D forward
+# RMSNorm2D Forward with CK Tile

-This folder contains example for Rmsnorm2D forward using ck_tile tile-programming implementation.
+This example demonstrates 2D Root Mean Square Layer Normalization (RMSNorm) using the CK Tile programming model, a normalization technique widely used in LLMs and transformers.

-## build
-```
+---
+
+## Algorithm and Math
+
+For each row $x$:
+$$
+\text{rms}(x) = \sqrt{\frac{1}{N} \sum_{i=1}^N x_i^2 + \epsilon}
+$$
+$$
+y_i = \frac{x_i}{\text{rms}(x)} \cdot \gamma_i
+$$
+where $\gamma$ is a learnable scale parameter.
+
+- **Tilewise RMSNorm**: Each thread block processes a tile (row or block), computes the mean square, normalizes, and applies scale.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile of the input matrix.
+- **Pipeline**: Modular, can be extended for fused operations.
+
+---
+
+## Build & Run
+
+```bash
 # in the root of ck_tile
 mkdir build && cd build
 sh ../script/cmake-ck-dev.sh  ../ <arch>  # you can replace this <arch> to gfx90a, gfx942...
 make tile_rmsnorm2d_fwd -j`nproc`
 ```
+
 This will result in an executable `build/bin/tile_rmsnorm2d_fwd`

-## cmdline
-```
+### Arguments
+
+```bash
 args:
           -m    m dimension (default:3328)
           -n    n dimension (default:4096)
@@ -37,3 +64,24 @@ args:
        -json    0: No Json, 1: Dump Results in Json format (default:0)
    -jsonfile    json file name to dump results (default:rmsnorm2d_fwd.json)
 ```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`rmsnorm2d_fwd.hpp`](rmsnorm2d_fwd.hpp) (tile-programming kernel template)
+- **Executable**: [`rmsnorm2d_fwd.cpp`](rmsnorm2d_fwd.cpp) (argument parsing, kernel launch)
+- **Build**: `CMakeLists.txt`, `generate.py`, `script/`
+
+---
+
+## Related CK Tile Examples
+
+- [02_layernorm2d](../02_layernorm2d/README.md): LayerNorm2D with tiles
+- [12_smoothquant](../12_smoothquant/README.md): SmoothQuant with tiles
+- [05_reduce](../05_reduce/README.md): Reductions with tiles
+
+For distribution, see  [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/11_add_rmsnorm2d_rdquant/README.md
+++ b/example/ck_tile/11_add_rmsnorm2d_rdquant/README.md
@@ -1,9 +1,35 @@
-# Add + Rmsnorm2D + rowwise dynamic quantization forward
+# Add + RMSNorm2D + Rowwise Dynamic Quantization (RDQuant) with CK Tile

-This folder contains example for add + Rmsnorm2D + rowwise dynamic quantization forward using ck_tile tile-programming implementation. Rdquant is short for rowwise dynamic quantization here.
+This example demonstrates a fused kernel for elementwise addition, 2D RMSNorm, and rowwise dynamic quantization using the CK Tile programming model. This pattern is common in LLMs for efficient normalization and quantized inference.

-## build
-```
+---
+
+## Algorithm and Math
+
+Given input $X$ and residual $R$:
+1. **Elementwise Add**: $Z = X + R$
+2. **RMSNorm**: $\text{rms}(Z) = \sqrt{\frac{1}{N} \sum_{i=1}^N Z_i^2 + \epsilon}$, $Y_i = \frac{Z_i}{\text{rms}(Z)} \cdot \gamma_i$
+3. **Rowwise Dynamic Quantization**:
+   - For each row, $s = \max(|Y|) / 127$
+   - $Q_i = \text{round}(Y_i / s)$, $Q_i \in \text{int8}$
+
+**Output**:  
+- Quantized tensor $Q$ (int8)
+- Per-row scale $s$ (fp32)
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile (row or block).
+- **Tile Engine**: Loads tiles, performs add, RMSNorm, and quantization.
+- **Pipeline**: Modular, can be extended for further fusion.
+
+---
+
+## Build & Run
+
+```bash
 # in the root of ck_tile
 mkdir build && cd build
 sh ../script/cmake-ck-dev.sh  ../ <arch>  # you can replace this <arch> to gfx90a, gfx942...
@@ -11,8 +37,9 @@ make tile_add_rmsnorm2d_rdquant_fwd -j`nproc`
 ```
 This will result in an executable `build/bin/tile_add_rmsnorm2d_rdquant_fwd`

-## cmdline
-```
+### Arguments
+
+```bash
 args:
          -m    m dimension (default:3328)
          -n    n dimension (default:4096)
@@ -28,3 +55,24 @@ args:
       -json    0: No Json, 1: Dump Results in Json format (default:0)
   -jsonfile    json file name to dump results (default:add_rmsnorm2d_rdquant_fwd.json)
 ```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`add_rmsnorm2d_rdquant_fwd.hpp`](add_rmsnorm2d_rdquant_fwd.hpp) (tile-programming kernel template)
+- **Executable**: [`add_rmsnorm2d_rdquant_fwd.cpp`](add_rmsnorm2d_rdquant_fwd.cpp), [`example_add_rmsnorm2d_rdquant_fwd.cpp`](example_add_rmsnorm2d_rdquant_fwd.cpp)
+- **Build**: `CMakeLists.txt`, `instances/`, `script/`
+
+---
+
+## Related CK Tile Examples
+
+- [10_rmsnorm2d](../10_rmsnorm2d/README.md): RMSNorm2D with tiles
+- [12_smoothquant](../12_smoothquant/README.md): SmoothQuant with tiles
+- [02_layernorm2d](../02_layernorm2d/README.md): LayerNorm2D with tiles
+
+For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/12_smoothquant/README.md
+++ b/example/ck_tile/12_smoothquant/README.md
@@ -1,10 +1,33 @@
-# smoothquant
+# SmoothQuant with CK Tile

-This folder contains example for smoothquant using ck_tile tile-programming implementation.
+This example demonstrates SmoothQuant, a quantization technique for transformer models, using the CK Tile programming model. SmoothQuant enables efficient int8 inference by scaling activations and weights to balance quantization error.

-## build
-```
-# in the root of ck_tile
+---
+
+## Algorithm and Math
+
+Given input $X$ and per-channel scale $S$:
+1. **Scale**: $Y_{i,j} = X_{i,j} \cdot S_j$
+2. **Rowwise Dynamic Quantization**:
+   - For each row, $s = \max(|Y|) / 127$
+   - $Q_{i,j} = \text{round}(Y_{i,j} / s)$, $Q_{i,j} \in \text{int8}$
+
+**Output**:  
+- Quantized tensor $Q$ (int8)
+- Per-row scale $s$ (fp32)
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile (row or block).
+- **Pipeline**: Modular, can be extended for further fusion.
+
+---
+
+## Build & Run
+
+```bash
 mkdir build && cd build
 sh ../script/cmake-ck-dev.sh  ../ <arch>  # you can replace this <arch> to gfx90a, gfx942...
 make tile_smoothquant -j`nproc`
--- a/example/ck_tile/13_moe_sorting/README.md
+++ b/example/ck_tile/13_moe_sorting/README.md
@@ -1,17 +1,44 @@
-# moe-sorting
+# MoE Sorting with CK Tile

-This folder contains example for moe-sorting kernel using ck_tile tile-programming implementation. This kernel is often used in Moe model, before launching the fused-moe-gemm block. The input&weight is a `token*topk` 2d matrix. The op rearange the input weight ids into different experts and feed into fuse moe gemm kernel.
+This example demonstrates MoE (Mixture-of-Experts) sorting using the CK Tile programming model. MoE sorting rearranges token-to-expert assignments for efficient dispatch to expert GEMMs, a key step in large language models with MoE layers. This kernel is often used in Moe model, before launching the fused-moe-gemm block. The input&weight is a `token*topk` 2d matrix. The op rearange the input weight ids into different experts and feed into fuse moe gemm kernel.

-## build
-```
+---
+
+## Algorithm and Math
+
+Given:
+- **Input**: $[\text{tokens}, \text{topk}]$ indices and weights (from TopK-Softmax)
+- **Goal**: Rearrange tokens so each expert receives its assigned tokens in contiguous blocks
+
+**Steps:**
+1. For each token, for each of its top-k experts, assign the token to the expert's input buffer.
+2. Output:
+   - Expert-wise token lists (indices)
+   - Corresponding weights
+
+This enables efficient batched GEMM per expert.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile (block of tokens or experts).
+- **Pipeline**: Modular, can be extended for further fusion or dispatch.
+
+---
+
+## Build & Run
+
+```bash
 # in the root of ck_tile
 mkdir build && cd build
 sh ../script/cmake-ck-dev.sh  ../ <arch>  # you can replace this <arch> to gfx90a, gfx942...
 make tile_example_moe_sorting -j`nproc`
 ```
-This will result in an executable `build/bin/tile_example_moe_sorting`

-## example
+This will result in an executable `build/bin/tile_example_moe_sorting`.
+
+### Arguments
 ```
 args:
                 -v    turn CPU validation on (1) or off (0). (default:1)
@@ -39,3 +66,24 @@ args:
              -json    0: No Json, 1: Dump Results in Json format (default:0)
          -jsonfile    json file name to dump results (default:moe_sorting.json)
 ```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`moe_sorting_api.hpp`](moe_sorting_api.hpp) (tile-programming kernel template)
+- **Executable**: [`moe_sorting.cpp`](moe_sorting.cpp), [`moe_sorting_api.cpp`](moe_sorting_api.cpp)
+- **Build**: `CMakeLists.txt`, `script/`
+
+---
+
+## Related CK Tile Examples
+
+- [09_topk_softmax](../09_topk_softmax/README.md): TopK-Softmax for MoE gating
+- [15_fused_moe](../15_fused_moe/README.md): Fused MoE block
+- [03_gemm](../03_gemm/README.md): GEMM with tiles
+
+For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/14_moe_smoothquant/README.md
+++ b/example/ck_tile/14_moe_smoothquant/README.md
@@ -1,18 +1,78 @@
-# moe-smoothquant
+# MoE-SmoothQuant with CK Tile

-This folder contains example for moe-smoothquant using ck_tile tile-programming implementation.
+This example demonstrates MoE-SmoothQuant, a fused quantization operation for Mixture-of-Experts (MoE) models, using the CK Tile programming model. Unlike standard SmoothQuant, the input scale is expert-dependent, and the operation is fused with top-k expert selection. Specifically, it quantizes the top-k experts' outputs for each token using their respective expert scales. The input scale is from different expert `[expert, hidden]`, and we need reuse the `topk-id` from previous `topk-softmax` and select the corresponding `expert` from current topk, and expand the output/per-token-scale by `topk`. 
+
+This diagram depicts moe-smoothquant using ck_tile tile-programming implementation.
 ![](misc/moe-sm.png)

-Unlike standard smoothquant op, the input scale is from different expert `[expert, hidden]`, we need reuse the `topk-id` from previous `topk-softmax` and select the corresponding `expert` from current topk, and expand the output/per-token-scale by `topk`
+---

-## build
-```
-# in the root of ck_tile
+## Algorithm and Math
+
+Given:
+- **Input**: $X$ of shape $[\text{tokens}, \text{topk}, \text{hidden}]$
+- **Expert scales**: $S$ of shape $[\text{experts}, \text{hidden}]$
+- **TopK indices**: $I$ of shape $[\text{tokens}, \text{topk}]$
+
+**Steps:**
+1. For each token $t$ and its $k$ selected experts:
+   - Select scale $S_{I_{t,k}, :}$ for the $k$-th expert.
+   - Scale: $Y_{t,k,j} = X_{t,k,j} \cdot S_{I_{t,k}, j}$
+2. **Rowwise Dynamic Quantization** (per token-expert pair):
+   - $s_{t,k} = \max_j |Y_{t,k,j}| / 127$
+   - $Q_{t,k,j} = \text{round}(Y_{t,k,j} / s_{t,k})$, $Q_{t,k,j} \in \text{int8}$
+
+**Output**:  
+- Quantized tensor $Q$ (int8)
+- Per-token-expert scale $s$ (fp32)
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile (block of tokens, experts, or hidden units).
+- **Tile Engine**: Loads input, selects expert scales via top-k indices, applies scaling and quantization, and writes results.
+- **Pipeline**: Modular, can be extended for further fusion.
+
+---
+
+## Build & Run
+
+```bash
 mkdir build && cd build
-sh ../script/cmake-ck-dev.sh  ../ <arch>  # you can replace this <arch> to gfx90a, gfx942...
+sh ../script/cmake-ck-dev.sh ../ <arch>
 make tile_example_moe_smoothquant -j`nproc`
+./bin/tile_example_moe_smoothquant -?
 ```
-This will result in an executable `build/bin/tile_example_moe_smoothquant`
+
+---
+
+## Source Structure
+
+- **Kernel**: [`moe_smoothquant.hpp`](moe_smoothquant.hpp) (tile-programming kernel template)
+- **Executable**: [`moe_smoothquant.cpp`](moe_smoothquant.cpp)
+- **Build**: `CMakeLists.txt`, `instances/`, `misc/`, `script/`
+
+---
+
+## Technical Notes
+
+- **Expert-dependent scaling**: Each token's top-k experts use their own per-hidden-unit scale, requiring indirect indexing and efficient memory access.
+- **Fused with top-k**: The kernel uses top-k indices from gating to select the correct expert scale for each token.
+- **Rowwise quantization**: Each token-expert pair is quantized independently for maximum accuracy.
+
+---
+
+## Related CK Tile Examples
+
+- [09_topk_softmax](../09_topk_softmax/README.md): TopK-Softmax for MoE gating
+- [13_moe_sorting](../13_moe_sorting/README.md): MoE sorting for expert dispatch
+- [12_smoothquant](../12_smoothquant/README.md): Standard SmoothQuant
+
+For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)

 ## example
 ```
--- a/example/ck_tile/15_fused_moe/README.md
+++ b/example/ck_tile/15_fused_moe/README.md
@@ -1,5 +1,59 @@
-# fused-moe
-Implementing the fused-moe block operator using ck-tile. This is a scatter/gather-group-gemm based solution, similiar to that of [vllm moe](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), but we introduce more kernel fusion to boost performance
+# Fused-MoE with CK Tile
+
+This example implements a highly optimized fused Mixture-of-Experts (MoE) block using the CK Tile programming model. The design fuses MoE sorting, group-GEMM, activation, and top-k weighting into a single kernel, minimizing memory traffic and maximizing throughput for large language models.
+
+---
+
+## Algorithm and Math
+
+### MoE Block Structure
+
+Given:
+- **Input**: $X$ of shape $[\text{tokens}, \text{hidden}]$
+- **TopK indices/weights**: $I, W$ from gating (shape $[\text{tokens}, \text{topk}]$)
+- **Expert weights**: $[\text{experts}, \text{hidden}, \text{hidden}]$
+
+**Steps:**
+1. **MoE Sorting**: Rearrange tokens so each expert receives its assigned tokens in contiguous blocks (see [13_moe_sorting](../13_moe_sorting/README.md)).
+2. **Group-GEMM**: For each expert, perform GEMM on its assigned tokens:
+   $$
+   Y^{(e)} = X^{(e)} W^{(e)}
+   $$
+3. **Activation + TopK Weighting**: Apply activation (e.g., GELU) and multiply by top-k weights.
+4. **Scatter/Gather**: Write results back to the original token order.
+
+### Technical Details
+
+- **Scatter/Gather Group-GEMM**: Uses indirect indexing to map tokens to experts and back.
+- **Block Partitioning**: Tokens are partitioned into slices per expert, with padding for alignment.
+- **Atomic Accumulation**: Second GEMM uses atomics for accumulation to support overlapping tokens.
+- **Buffer Zeroing**: Output buffer is zeroed in the sorting step, eliminating extra kernels.
+- **Pre-shuffled Weights**: Expert weights are pre-shuffled for coalesced memory access.
+- **Micro-kernel Pipeline**: Uses block-inline-asm micro-kernels for peak performance, while retaining composability.
+
+
+## Build & Run
+
+```bash
+mkdir build && cd build
+sh ../script/cmake-ck-dev.sh ../ <arch>
+make tile_example_fused_moe -j
+./bin/tile_example_fused_moe -?
+```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`fused_moe.hpp`](fused_moe.hpp), [`fused_moegemm.hpp`](fused_moegemm.hpp), [`fused_moesorting.hpp`](fused_moesorting.hpp)
+- **Executable**: [`main.cpp`](main.cpp)
+- **Build**: `CMakeLists.txt`, `instances/`, `misc/`
+
+---
+
+## Technical Notes
+
+ This is a scatter/gather-group-gemm based solution, similiar to that of [vllm moe](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), but we introduce more kernel fusion to boost performance
 ![](misc/moe-0.png)

 The benifit of this fused-moe:
@@ -107,4 +161,14 @@ args:
     -repeat    hot iter (default:20)
       -json    0: No Json, 1: Dump Results in Json format (default:0)
   -jsonfile    json file name to dump results (default:fused_moe.json)
-```
+```
+## Related CK Tile Examples
+
+- [13_moe_sorting](../13_moe_sorting/README.md): MoE sorting for expert dispatch
+- [09_topk_softmax](../09_topk_softmax/README.md): TopK-Softmax for MoE gating
+- [03_gemm](../03_gemm/README.md): GEMM with tiles
+
+For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/16_batched_gemm/README.md
+++ b/example/ck_tile/16_batched_gemm/README.md
@@ -1,19 +1,55 @@
-# Batched GEMM
+# Batched GEMM with CK Tile

-This folder contains example for batched GEMM using ck_tile tile-programming implementation.
+This example demonstrates batched matrix multiplication (Batched GEMM) using the CK Tile programming model, enabling efficient parallel computation of multiple independent GEMMs in a single kernel launch.

-## build
-```
-# in the root of ck_tile
+---
+
+## Algorithm and Math
+
+Given:
+- $A$: $[\text{batch}, M, K]$
+- $B$: $[\text{batch}, K, N]$
+- $C$: $[\text{batch}, M, N]$
+
+For each batch $b$:
+$$
+C^{(b)} = A^{(b)} \times B^{(b)}
+$$
+
+- **Tilewise Batched GEMM**: Each thread block processes a tile of $C$ for a specific batch, loading corresponding tiles from $A$ and $B$, performing blockwise matrix multiply-accumulate, and writing results.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile of $C$ for a given batch.
+- **Pipeline**: Modular, supports different memory/computation pipelines.
+
+---
+
+## Features
+
+- **Flexible Layouts**: Supports row/column-major and custom strides for $A$, $B$, $C$.
+- **Batching**: Efficiently computes multiple GEMMs in parallel.
+- **Precision**: Supports fp16, bf16, fp8, bf8.
+- **Validation**: CPU/GPU validation and error tolerance options.
+
+---
+
+## Build & Run
+
+```bash
 mkdir build && cd build
 # you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
 ../script/cmake-ck-dev.sh  ../ <arch>
 make tile_example_batched_gemm -j
 ```
+
 This will result in an executable `build/bin/tile_example_batched_gemm`

-## example
-```
+### Arguments
+
+```bash
 args:
               -m    m dimension (default:512)
               -n    n dimension (default:1024)
@@ -36,4 +72,25 @@ args:
         -split_k    splitK value (default:1)
            -json    0: No Json, 1: Dump Results in Json format (default:0)
         -jsonfile    json file name to dump results (default:cktile_batched_gemm.json)
-```
+```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`batched_gemm.hpp`](batched_gemm.hpp) (tile-programming kernel template)
+- **Executable**: [`batched_gemm.cpp`](batched_gemm.cpp)
+- **Build**: `CMakeLists.txt`, `run_batched_gemm_example.inc`
+
+---
+
+## Related CK Tile Examples
+
+- [03_gemm](../03_gemm/README.md): Single GEMM with tiles
+- [15_fused_moe](../15_fused_moe/README.md): Fused MoE block (uses group/batched GEMM)
+- [13_moe_sorting](../13_moe_sorting/README.md): MoE sorting for expert dispatch
+
+For distribution, [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/17_grouped_gemm/README.md
+++ b/example/ck_tile/17_grouped_gemm/README.md
@@ -4,7 +4,7 @@ The `Grouped GEMM` operators are versions of GEMM that run multiple GEMM operati

 ### Preshuffle and Persistence

-The grouped GEMM examples include two advanced optimization features:
+The grouped GEMM examples include the following advanced optimization features:

 #### Weight Preshuffle
 Weight preshuffle is an optimization technique that reorganizes the B matrix (weights) in memory to improve data access patterns and reduce memory bandwidth requirements. This is particularly beneficial for inference workloads where the same weights are reused across multiple batches.
@@ -21,13 +21,13 @@ Persistence mode is a GPU optimization where thread blocks remain active on the
 - **Usage**: `invoke_gemm<ALayout, BLayout, CLayout, true>` enables persistence

 #### Multi-D Operations
-Multi-D operations extend the standard GEMM operation by supporting additional element-wise operations on the result tensor. This feature is particularly useful for workloads that require post-processing of the GEMM output.
+Multi-D operations extend the standard GEMM operation by supporting additional elementwise operations on the result tensor. This feature is particularly useful for workloads that require post-processing of the GEMM output.

 - **Implementation**: Available in `grouped_gemm_multi_d.cpp`
 - **Operation**: E = C × D₀ × D₁ (where C = A × B is the standard GEMM result)
 - **Configuration**: Uses `GemmConfigV3`, `GemmConfigV4`, `GemmConfigMemory` template configuration with 2 D tensors
- **Data Types**: Supports fp16 
- **Benefits**: Enables complex operations like scaling, activation functions, or other element-wise transformations in a single kernel call
+- **Data Types**: Supports fp16, fp8
+- **Benefits**: Enables complex operations like scaling, activation functions, or other elementwise transformations in a single kernel call
 - **Build Target**: `make tile_example_grouped_gemm_multi_d -j`

 Multi-D operations supports both persistence and non-persistence modes.
@@ -37,9 +37,7 @@ Weight preshuffle supports only on non-persistence mode.
 ```
 # in the root of ck_tile
 mkdir build && cd build
-# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
-../script/cmake-ck-dev.sh  ../ <arch>
-# The basic pipeline method on the gemm calculation
+../script/cmake-ck-dev.sh ../ <arch>
 make tile_example_grouped_gemm -j
 # The preshuffle example
 make tile_example_grouped_gemm_preshuffle -j
@@ -84,4 +82,23 @@ K[i] = 512 + 384 * i
 stride_A[i] = K[i]
 stride_B[i] = K[i]
 stride_C[i] = N[i]
-```
+```
+
+## Source Structure
+
+- **Kernel**: [`grouped_gemm.hpp`](grouped_gemm.hpp) (tile-programming kernel template)
+- **Executables**: [`grouped_gemm.cpp`](grouped_gemm.cpp)
+- **Build**: `CMakeLists.txt`, `run_grouped_gemm_example.inc`
+
+---
+
+## Related CK Tile Examples
+
+- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
+- [15_fused_moe](../15_fused_moe/README.md): Fused MoE block (uses grouped GEMM)
+- [03_gemm](../03_gemm/README.md): Single GEMM with tiles
+
+For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/18_flatmm/README.md
+++ b/example/ck_tile/18_flatmm/README.md
@@ -1,8 +1,28 @@
-# FLATMM Matrix Multiplication
+# FLATMM Matrix Multiplication with CK Tile

-This folder contains example for FLATMM using ck_tile tile-programming implementation. Currently, it only supports the basic feature of the CK Tile FLATMM, but creates the placeholders for the future support on different FLATMM pipeline and different FLATMM modules. In the near future, we will gradually migrate all the FLATMM features from old CK to CK Tile.
+This example demonstrates FLATMM (flattened matrix multiplication) using the CK Tile programming model. FLATMM is a variant of GEMM optimized for certain memory layouts and batch processing patterns. Currently, it only supports the basic feature of the CK Tile FLATMM, but creates the placeholders for the future support on different FLATMM pipeline and different FLATMM modules. In the near future, we will gradually migrate all the FLATMM features from old CK to CK Tile.
+
+---
+
+## Algorithm and Math
+
+Given:
+- $A$: $[\text{batch}, M, K]$
+- $B$: $[\text{batch}, K, N]$
+- $C$: $[\text{batch}, M, N]$
+
+For each batch $b$:
+$$
+C^{(b)} = A^{(b)} \times B^{(b)}
+$$
+
+- **FLATMM**: An alternative solution as the Preshuffled GEMM in /03_gemm
+
+
+---
+
+## Build & Run

-## build
 ```
 # in the root of ck_tile
 mkdir build && cd build
@@ -13,7 +33,7 @@ make tile_example_flatmm_basic -j
 ```
 This will result in an executable `build/bin/tile_example_flatmm_basic`

-## example
+### Arguments
 ```
 args:
          -m    m dimension (default:256)
@@ -36,3 +56,24 @@ args:
       -json    0: No Json, 1: Dump Results in Json format (default:0)
   -jsonfile    json file name to dump results (default:flatmm_basic.json)
 ```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`flatmm_basic.hpp`](flatmm_basic.hpp) (tile-programming kernel template)
+- **Executable**: [`flatmm_basic.cpp`](flatmm_basic.cpp)
+- **Build**: `CMakeLists.txt`, `run_flatmm_example.inc`, `script/`
+
+---
+
+## Related CK Tile Examples
+
+- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
+- [03_gemm](../03_gemm/README.md): Single GEMM with tiles
+- [17_grouped_gemm](../17_grouped_gemm/README.md): Grouped GEMM with tiles
+
+For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/19_gemm_multi_d/README.md
+++ b/example/ck_tile/19_gemm_multi_d/README.md
@@ -1,8 +1,45 @@
-#Multiple D GEMM
+# Multiple D GEMM with CK Tile

-This folder contains example for Multiple D GEMM using ck_tile tile-programming implementation.
+This example demonstrates GEMM with multiple D tensors (multi-output GEMM) using the CK Tile programming model. This is useful for fused operations where the GEMM output is combined with multiple side inputs (e.g., bias, residual, or other elementwise sources).
+
+---
+
+## Algorithm and Math
+
+Given:
+- $A$: $[M, K]$
+- $B$: $[K, N]$
+- $D_0, D_1, ..., D_n$: $[M, N]$ (multiple side inputs)
+- $E$: $[M, N]$ (output)
+
+The operation:
+$$
+E = f(A \times B, D_0, D_1, ..., D_n)
+$$
+where $f$ is a fused elementwise function (e.g., add, multiply, activation).
+
+- **Tilewise Multi-D GEMM**: Each thread block processes a tile of $E$, loading corresponding tiles from $A$, $B$, and all $D_i$, performing blockwise GEMM and fused elementwise operations.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile of $E$.
+- **Pipeline**: Modular, supports different memory/computation pipelines and multi-D fusion.
+
+---
+
+## Features
+
+- **Multiple D Inputs**: Supports arbitrary number of side inputs for fusion.
+- **Flexible Layouts**: Supports row/column-major and custom strides for all tensors.
+- **SplitK**: Supports K-batching for large K dimensions.
+- **Validation**: GPU validation and benchmarking options.
+
+---
+
+## Build & Run

-## build
 ```
 #in the root of ck_tile
 mkdir build && cd build
@@ -14,7 +51,8 @@ make tile_example_gemm_multi_d_fp16 -j
 ```
 This will result in an executable `build/bin/tile_example_gemm_multi_d_fp16`

-## example
+### Arguments
+
 ```
 args:
          -m    m dimension (default:3840)
@@ -35,3 +73,25 @@ args:
       -json    0: No Json, 1: Dump Results in Json format (default:0)
   -jsonfile    json file name to dump results (default:cktile_gemm_multi_d_fp16.json)
 ```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`gemm_multi_d_fp16.hpp`](gemm_multi_d_fp16.hpp) (tile-programming kernel template)
+- **Executable**: [`gemm_multi_d_fp16.cpp`](gemm_multi_d_fp16.cpp)
+- **Utils**: [`utils.hpp`](utils.hpp)
+- **Build**: `CMakeLists.txt`, `run_gemm_multi_d_fp16_example.inc`
+
+---
+
+## Related CK Tile Examples
+
+- [03_gemm](../03_gemm/README.md): Single GEMM with tiles
+- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
+- [17_grouped_gemm](../17_grouped_gemm/README.md): Grouped GEMM with tiles
+
+For distribution, see [`include/ck_tile/tile_engine/`](../../../include/ck_tile/tile_engine/) and [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/35_batched_transpose/README.md
+++ b/example/ck_tile/35_batched_transpose/README.md
@@ -1,8 +1,40 @@
-# Batched Transpose
-This folder contains example for batched Transpose using ck_tile tile-programming implementation. Currently, it supports the batched transpose with NCHW to NHWC or NHWC to NCHW. So in this way from NCHW you could transpose to either NHWC or NWCH(two transposes). Now the transpose read with single data point. We would soon put it in vectorized transpose.
+# Batched Transpose with CK Tile

-## build
-```
+This example demonstrates batched tensor transpose using the CK Tile programming model. It supports common layout conversions such as NCHW <-> NHWC, which are essential for deep learning frameworks and hardware accelerators.  Currently, it supports the batched transpose with NCHW to NHWC or NHWC to NCHW. So in this way from NCHW you could transpose to either NHWC or NWCH(two transposes). Now the transpose read with single data point. We would soon put it in vectorized transpose.
+
+---
+
+## Algorithm and Math
+
+Given a batch of tensors $X$ of shape $[N, C, H, W]$, the transpose operation rearranges axes to produce $Y$ of shape $[N, H, W, C]$ (NCHW to NHWC) or other permutations.
+
+For each element:
+$$
+Y_{n, h, w, c} = X_{n, c, h, w}
+$$
+
+- **Tilewise Batched Transpose**: Each thread block processes a tile (block) of the input, computes the permuted indices, and writes to the output.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile of the input tensor for a given batch.
+- **Pipeline**: Modular, can be extended for vectorized or fused operations.
+
+---
+
+## Features
+
+- **Flexible Layouts**: Supports NCHW <-> NHWC and other axis permutations.
+- **Batching**: Efficiently transposes multiple tensors in parallel.
+- **Validation**: CPU validation and benchmarking options.
+
+---
+
+## Build & Run
+
+```bash
 # in the root of ck_tile
 mkdir build && cd build
 # you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
@@ -10,10 +42,12 @@ mkdir build && cd build
 # Make the transpose executable
 make tile_example_batched_transpose -j
 ```
+
 This will result in an executable `build/bin/tile_example_batched_transpose`

-## example
-```
+### Arguments
+
+```bash
 args:
          -N    input batch size (default:2)
          -C    input channel size. (default:16)
@@ -26,4 +60,25 @@ args:
     -k_name    t to 1 will print kernel name (default:0)
     -warmup    warmup iterations to run this kernel (default:50)
     -repeat    number of iterations to run this kernel (default:100)
-```
+```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`batched_transpose_example.hpp`](batched_transpose_example.hpp) (tile-programming kernel template)
+- **Executables**: [`batched_transpose_example.cpp`](batched_transpose_example.cpp), [`batched_transpose_api.cpp`](batched_transpose_api.cpp)
+- **Build**: `CMakeLists.txt`, `script/`
+
+---
+
+## Related CK Tile Examples
+
+- [06_permute](../06_permute/README.md): Generic permutation with tiles
+- [03_gemm](../03_gemm/README.md): GEMM with tiles
+- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
+
+For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/37_transpose/README.md
+++ b/example/ck_tile/37_transpose/README.md
@@ -0,0 +1,83 @@
+# Batched Transpose (Block Transpose) with CK Tile
+
+This example demonstrates a high-performance batched block transpose kernel using the CK Tile programming model, with a focus on architectures like gfx950. The kernel is optimized for tiled memory access and is suitable for layout conversions such as NCHW <-> NHWC in deep learning. This transpose load has some constraints in input tile distribution.
+
+---
+
+## Algorithm and Math
+
+Given a batch of tensors $X$ of shape $[N, C, H, W]$, the block transpose operation rearranges axes to produce $Y$ of shape $[N, H, W, C]$ (NCHW to NHWC) or other permutations.
+
+For each element:
+$$
+Y_{n, h, w, c} = X_{n, c, h, w}
+$$
+
+- **Blockwise Transpose**: Each thread block processes a tile (block) of the input, reading and writing in a coalesced, tiled fashion for optimal memory throughput.
+
+---
+
+## Tile Programming Model
+
+- **Tiles**: Each thread block processes a tile of the input tensor for a given batch.
+- **Policy**: [`transpose_policy.hpp`](transpose_policy.hpp) defines tile/block size and memory access patterns for optimal performance.
+
+---
+
+## Features
+
+- **Optimized for Architecture**: Designed for architectures like gfx950 with constraints on input tile distribution.
+- **Flexible Layouts**: Supports NCHW <-> NHWC and other axis permutations.
+- **Batching**: Efficiently transposes multiple tensors in parallel.
+- **Validation**: CPU validation and benchmarking options.
+
+---
+
+## Build & Run
+
+```bash
+# in the root of ck_tile
+mkdir build && cd build
+# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
+sh ../script/cmake-ck-dev.sh  ../ <arch>
+# Make the transpose executable
+make tile_example_transpose -j
+```
+
+This will result in an executable `build/bin/tile_example_transpose`
+
+### Arguments
+
+```bash
+args:
+          -N    input batch size (default:2)
+          -C    input channel size. (default:64)
+          -H    input height size. (default:1)
+          -W    input width size. (default:64)
+          -v    whether do CPU validation or not (default: 1)
+  -layout_in    input tensor data layout - NCHW by default
+ -layout_out    output tensor data layout - NHWC by default
+       -seed    seed to be used, -1 means random every time (default:-1)
+     -k_name    t to 1 will print kernel name (default:0)
+```
+
+---
+
+## Source Structure
+
+- **Kernel**: [`block_transpose.hpp`](block_transpose.hpp), [`batched_transpose_kernel.hpp`](batched_transpose_kernel.hpp), [`transpose_policy.hpp`](transpose_policy.hpp)
+- **Executables**: [`transpose_example.cpp`](transpose_example.cpp), [`transpose_api.cpp`](transpose_api.cpp)
+- **Build**: `CMakeLists.txt`
+
+---
+
+## Related CK Tile Examples
+
+- [35_batched_transpose](../35_batched_transpose/README.md): Batched transpose with tiles
+- [06_permute](../06_permute/README.md): Generic permutation with tiles
+- [03_gemm](../03_gemm/README.md): GEMM with tiles
+
+For tile engine and distribution, see [`include/ck_tile/tile_engine/`](../../../include/ck_tile/tile_engine/) and [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
+
+---
+[Back to CK Tile Examples](../README.md)
--- a/example/ck_tile/README.md
+++ b/example/ck_tile/README.md
@@ -0,0 +1,75 @@
+# CK Tile Example Suite
+
+This directory contains a comprehensive suite of examples demonstrating the CK Tile programming model for high-performance GPU kernels. Each example illustrates a key deep learning or HPC operation, implemented using tile-based parallelism, modular pipelines, and data movement policy.
+
+---
+
+## What is CK Tile?
+
+CK Tile is a composable GPU programming API that expresses kernels as a composition of "tiles"—rectangular blocks of computation and data movement. The pipeline & policy orchestrates data movement (global <-> LDS <-> registers), computation, and synchronization, enabling high efficiency and flexibility.
+
+---
+
+## Example Index
+
+| Example | Operation | Description |
+|---------|-----------|-------------|
+| [01_fmha](01_fmha/README.md) | Fused Multi-Head Attention | Tile-based FMHA with masking, quantization, and epilogue fusion |
+| [02_layernorm2d](02_layernorm2d/README.md) | LayerNorm2D | Blockwise layer normalization with fusion and quantization |
+| [03_gemm](03_gemm/README.md) | GEMM | Matrix multiplication with tilewise parallelism |
+| [04_img2col](04_img2col/README.md) | im2col | Image-to-column transformation for GEMM-based convolution |
+| [05_reduce](05_reduce/README.md) | Reduction | Tilewise sum, max, mean reductions |
+| [06_permute](06_permute/README.md) | Permute | Generic tensor permutation (up to rank-8) |
+| [09_topk_softmax](09_topk_softmax/README.md) | TopK-Softmax | Rowwise softmax and top-k selection for MoE gating |
+| [10_rmsnorm2d](10_rmsnorm2d/README.md) | RMSNorm2D | Root mean square normalization for LLMs |
+| [11_add_rmsnorm2d_rdquant](11_add_rmsnorm2d_rdquant/README.md) | Add + RMSNorm2D + RDQuant | Fused add, RMSNorm, and rowwise dynamic quantization |
+| [12_smoothquant](12_smoothquant/README.md) | SmoothQuant | Per-channel scaling and quantization for int8 inference |
+| [13_moe_sorting](13_moe_sorting/README.md) | MoE Sorting | Token-to-expert rearrangement for MoE dispatch |
+| [14_moe_smoothquant](14_moe_smoothquant/README.md) | MoE-SmoothQuant | Expert-dependent quantization fused with top-k selection |
+| [15_fused_moe](15_fused_moe/README.md) | Fused MoE | End-to-end fused MoE block: sorting, group-GEMM, activation, weighting |
+| [16_batched_gemm](16_batched_gemm/README.md) | Batched GEMM | Parallel computation of multiple GEMMs |
+| [17_grouped_gemm](17_grouped_gemm/README.md) | Grouped GEMM | Multiple independent GEMMs with different shapes |
+| [18_flatmm](18_flatmm/README.md) | FLATMM | Flattened matrix multiplication for packed layouts |
+| [19_gemm_multi_d](19_gemm_multi_d/README.md) | Multi-D GEMM | GEMM with multiple side inputs (bias, residual, etc.) |
+| [35_batched_transpose](35_batched_transpose/README.md) | Batched Transpose | NCHW <-> NHWC and other layout conversions |
+| [36_copy](36_copy/README.md) | Copy | Minimal example for tile-based memory movement |
+| [37_transpose](37_transpose/README.md) | Block Transpose | High-performance tiled transpose for large tensors |
+
+---
+
+## Technical Highlights
+
+- **Tile Distribution**: See [`include/ck_tile/tile_program/tile_distribution/`](../../include/ck_tile/tile_program/tile_distribution/) for mapping tiles to thread blocks.
+- **Block Tile Pipelines**: See [`include/ck_tile/tile_program/block_tile_pipeline/`](../../include/ck_tile/tile_program/block_tile_pipeline/) for memory/computation pipelines.
+- **Policies and Utilities**: Many examples use custom policies for tile/block size and memory access.
+
+---
+
+## How to Build & Run
+
+```bash
+mkdir build && cd build
+sh ../script/cmake-ck-dev.sh ../ <arch>
+make -j
+```
+Each example produces its own executable in `build/bin/`.
+
+---
+
+## Learning and Extending
+
+- **Start Simple**: Try [03_gemm](03_gemm/README.md) or [36_copy](36_copy/README.md) to learn tile basics.
+- **Explore Fusion**: See [11_add_rmsnorm2d_rdquant](11_add_rmsnorm2d_rdquant/README.md), [15_fused_moe](15_fused_moe/README.md), or [14_moe_smoothquant](14_moe_smoothquant/README.md) for advanced fusion.
+- **Experiment**: Modify tile sizes, layouts, or pipelines to explore performance and flexibility.
+
+---
+
+## References
+
+- [CK Tile Programming API Documentation](https://github.com/ROCm/composable_kernel/tree/develop/include/ck_tile)
+- [Block Tile Pipeline Source](https://github.com/ROCm/composable_kernel/tree/develop/include/ck_tile/tile_program/block_tile_pipeline)
+- [Tile Distribution Source](https://github.com/ROCm/composable_kernel/tree/develop/include/ck_tile/tile_program/tile_distribution)
+
+---
+
+[Back to Composable Kernel Examples](../README.md)