[DOCS] Documentation Addition (Readme updates) (#2495)

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
This commit is contained in:
Vidyasagar Ananthan
2025-10-16 03:10:57 -07:00
committed by GitHub
parent 013ba3c737
commit 92c67a824f
120 changed files with 8188 additions and 221 deletions

View File

@@ -1,44 +1,35 @@
# Layernorm2D forward
# LayerNorm2D Forward with CK Tile
This folder contains example for Layernorm2D forward using `ck_tile` tile-programming implementation.
This example demonstrates efficient 2D layer normalization using the CK Tile programming model, leveraging tile-based parallelism and advanced fusion for transformer and LLM workloads.
# Implementation and feature support
---
## welford online algorithm
We use welfold algorithm to update `mean`/`variance` block by block. For `N <=4096` case we can compute `mean`/`var`/`normalization` within one loop, we call it `one-pass`. For large N case, it is hard to keep `mean`/`var` inside register/LDS and then computation `normalization`, so we need to load input twice, first time to compute `mean`/`var` block-by-block, then load input another time to compute the `normalization`. We call it `two-pass`.
## Algorithm and Math
## mean/variance save
In training case the mean/variance need to store out (TBD, not supported yet)
LayerNorm computes, for each row $x$:
$$
\mu = \frac{1}{N} \sum_{i=1}^N x_i,\quad \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2
$$
$$
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}},\quad y_i = \gamma \hat{x}_i + \beta
$$
## prenorm/postnorm
- **Welford's Algorithm**: Used for numerically stable, blockwise mean/variance computation. For $N \leq 4096$, a one-pass algorithm is used; for large $N$, a two-pass approach is adopted.
![](misc/pnorm.png)
--
since [prenorm/postnorm](https://arxiv.org/pdf/1906.01787) is quite common in LLM blocks, this example boosts this feature by kernel fusion. Note that `prenorm`/`postnorm` always need to do elementwise-add a `shortcut` before the actual layernorm computation, and optionally store out the result to global. You can use `-fadd=1` to test `pre-add+store`, or `-fadd=2` to test `pre-add` without store out (not codegen by default).
## Features
## smooth-quant/dynamic-quant
we support smooth/dynamic quantization for `int8` output, by setting `-fquant=1` and `-prec_o=int8`. In this case the output will doing a rowwise dynamic quantization like below. Note that smooth-quant require input a `(1*N)` size per-channel scale(in fp32 in our example, though this is customizable), then elememt-wise multiply the tensor for each row, then compute the rowwise dynamic quant. if set `-fquant=2` will have the input per-channel scale stage, only the dynamic quant. This case is supported in our kernel but by default not generated (TBD: add some filter in generate.py support on-demand codegen)
![](misc/dquant.png)
- **Prenorm/Postnorm Fusion**: Fused residual addition before/after normalization for transformer blocks.
- **Smooth/Dynamic Quantization**: Rowwise int8 quantization with per-token scale, supporting smoothquant for LLMs.
- **Flexible Precision**: Supports fp16, bf16, int8 output.
- **Efficient for Large N**: Two-pass pipeline for $N > 4096$.
- **Highly Modular**: Easily extendable for new fusion or quantization strategies.
```
# assume output int8, hidden_states is [m, n] shape and in fp16/bf16
# [m, 1]
per_token_amax, _ = torch.max(
input=torch.abs(hidden_states),
dim=-1,
keepdim=True
)
per_token_scale = per_token_amax.to(dtype=torch.float32) / 127.0
---
# quant hidden_states
hidden_states = (hidden_states / per_token_scale).to(dtype=torch.int8)
## Build & Run
return hidden_states, per_token_scale
# hidden_states now is int8 will feed to next layer as intput
# per_token_scale will be used as dequant factor later layer
```
## build
```
# in the root of ck_tile
mkdir build && cd build
@@ -47,7 +38,7 @@ make tile_example_layernorm2d_fwd -j
```
This will result in an executable `build/bin/tile_example_layernorm2d_fwd`
## example
## Example
```
args:
-m m dimension (default:3328)
@@ -69,7 +60,43 @@ args:
-jsonfile json file name to dump results (default:layernorm2d_fwd.json)
```
---
## Technical Details
## Welford online algorithm
We use welfold algorithm to update `mean`/`variance` block by block. For `N <=4096` case we can compute `mean`/`var`/`normalization` within one loop, we call it `one-pass`. For large N case, it is hard to keep `mean`/`var` inside register/LDS and then computation `normalization`, so we need to load input twice, first time to compute `mean`/`var` block-by-block, then load input another time to compute the `normalization`. We call it `two-pass`.
## mean/variance save
In training case the mean/variance need to store out (TBD, not supported yet).
## prenorm/postnorm
![](misc/pnorm.png)
Since [prenorm/postnorm](https://arxiv.org/pdf/1906.01787) is quite common in LLM blocks, this example boosts this feature by kernel fusion. Note that `prenorm`/`postnorm` always need to do elementwise-add a `shortcut` before the actual layernorm computation, and optionally store out the result to global. You can use `-fadd=1` to test `pre-add+store`, or `-fadd=2` to test `pre-add` without store out (not codegen by default).
## smooth-quant/dynamic-quant
We support smooth/dynamic quantization for `int8` output, by setting `-fquant=1` and `-prec_o=int8`. In this case the output will doing a rowwise dynamic quantization like below. Note that smooth-quant require input a `(1*N)` size per-channel scale(in fp32 in our example, though this is customizable), then elememt-wise multiply the tensor for each row, then compute the rowwise dynamic quant. if set `-fquant=2` will have the input per-channel scale stage, only the dynamic quant. This case is supported in our kernel but by default not generated (TBD: add some filter in generate.py support on-demand codegen)
![](misc/dquant.png)
```
# assume output int8, hidden_states is [m, n] shape and in fp16/bf16
# [m, 1]
per_token_amax, _ = torch.max(
input=torch.abs(hidden_states),
dim=-1,
keepdim=True
)
per_token_scale = per_token_amax.to(dtype=torch.float32) / 127.0
# quant hidden_states
hidden_states = (hidden_states / per_token_scale).to(dtype=torch.int8)
return hidden_states, per_token_scale
# hidden_states now is int8 will feed to next layer as intput
# per_token_scale will be used as dequant factor later layer
```
## limitations
Note that `fquant=2`, `fadd=2`, `prec_sm/prec_sy` other than `fp32` are not by default generated. Though our kernel template suppor this. (TBD: add some flag in generate.py) to generate those instance on demand. Beside, `N>8192` case will by default using two-pass pipeline, and `-fquant=1/2` are not supported yet. If need suport `N>8192` and `fused+residual+store`, you can use this example together with `12_smoothquant`, to construct layernorm+residual, and smoothquant, 2 kernels for this purpose.
@@ -83,5 +110,25 @@ Note that `fquant=2`, `fadd=2`, `prec_sm/prec_sy` other than `fp32` are not by d
# standard fp16 layernorm 2d, m=10. n=1024, fused-smooth-quant+fused-add-store, output in int8
./build/bin/tile_example_layernorm2d_fwd -m=10 -n=1024 -prec_o=int8 -fquant=1 -fadd=1
```
---
## Source Structure
- **Kernel**: `layernorm2d_fwd.hpp` (tile-programming kernel template)
- **Executable**: `layernorm2d_fwd.cpp` (argument parsing, kernel launch)
- **Codegen**: `generate.py` (instantiates kernels for different configs)
- **Misc**: `misc/` (algorithm diagrams, e.g., prenorm/postnorm, quantization)
---
## Related CK Tile Examples
- [01_fmha](../01_fmha/README.md): Fused multi-head attention (FMHA)
- [03_gemm](../03_gemm/README.md): Tile-programming GEMM
- [12_smoothquant](../12_smoothquant/README.md): Standalone smoothquant kernel
For and distribution, see `include/ck_tile/tile_program/tile_distribution/`.
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,10 +1,45 @@
# GEMM Matrix Multiplication
# GEMM with CK Tile
This folder contains example for GEMM using ck_tile tile-programming implementation. Currently, it only supports the basic feature of the CK Tile GEMM, but creates the placeholders for the future support on different GEMM pipeline and different GEMM modules. In the near future, we will gradually migrate all the GEMM features from old CK to CK Tile.
This example demonstrates matrix multiplication (GEMM) using the CK Tile programming model, focusing on tile-based parallelism and modular kernel design.
## build
```
# in the root of ck_tile
---
## Algorithm and Math
GEMM computes:
$$
C = A \times B
$$
where $A$ is $[M, K]$, $B$ is $[N, K]$, and $C$ is $[M, N]$.
- **BlockTile GEMM**: Each Block Tile computes a tile of $C$ by loading tiles of $A$ and $B$, performing blockwise matrix multiply-accumulation, and writing results back with the epilogue.
---
## Tile Programming Model
- **Configuration**: The Configuration of how the kernel going to be initialized with Block Tile Dimension, Warps Layout, Warp Tile Dimension, and other improvements.
- **Block Tile**: Each block tile allocates in the compute unit of AMD GPU grabbing the .
- **Pipeline**: Modular design allows swapping different memory/computation pipelines (e.g., basic, memory-bound, compute).
- **Block GEMM**: Block Level implementation on how to coordinate the warps iteration and memory layout in block tile.
- **Warp GEMM**: Each Warp's GEMM Calculation
- **Epilogue**: Transferring the Accumulated result from register to global memory.
---
## Features
- **Flexible Layouts**: Supports row/column-major and custom strides for $A$, $B$, $C$.
- **Split K**: Split the Block Tile also on K Dimension and add it back after the matrix multiply-accumulation. Have a higher performance when M and N is small and K is large.
- **Preshuffled GEMM**: In inference task, shuffle the GEMM of B (weight) matrix in the warp layout and bypass the shared memory to do the GEMM calculation. Best performance solution for GEMM.
- **Precision**: Supports fp16, bf16, fp8, bf8, int4 (for B Matrix).
- **Validation**: CPU/GPU validation and error tolerance options.
---
## Build & Run
```bash
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
../script/cmake-ck-dev.sh ../ <arch>
@@ -34,9 +69,30 @@ args:
-warmup number of iterations before benchmark the kernel (default:50)
-repeat number of iterations to benchmark the kernel (default:100)
-timer gpu:gpu timer, cpu:cpu timer (default:gpu)
-split_k splitK value (default:1)
-split_k splitK value (default:1)
-init 0:random, 1:linear, 2:constant(1) (default:0)
-persistent 0:non-persistent, 1:persistent (default:0)
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:gemm.json)
```
## Source Structure
- **Executables**: `gemm_basic.cpp`, `universal_gemm.cpp` (different kinds of GEMM implementation)
- **Utils**: `gemm_utils.hpp` (helper functions)
- **Build**: `CMakeLists.txt`, `run_gemm_example.inc`
- **Scripts**: `script/` (build and run helpers)
---
## Related CK Tile Examples
- [01_fmha](../01_fmha/README.md): Fused multi-head attention (FMHA)
- [18_flatmm](../18_flatmm/README.md): Preshuffled GEMM alternative solution
- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
For distribution, see `include/ck_tile/tile_program/tile_distribution/`.
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,13 +1,51 @@
# Image to Column
# Image to Column (im2col) with CK Tile
This folder contains example for Image to Column using ck_tile tile-programming implementation.
This example demonstrates the im2col transformation using the CK Tile programming model, a key step for converting convolution into GEMM for efficient GPU execution.
## build
```
# in the root of ck_tile
---
## Algorithm and Math
Given an input image tensor $X$ and convolution kernel size, im2col rearranges sliding windows of $X$ into columns:
- For each patch, flatten and stack as a column in the output matrix.
- Enables convolution as matrix multiplication: $\text{im2col}(X) \times W$.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile (block of patches).
- **Pipeline**: Modular, can be extended for fused operations (e.g., quantization, activation).
---
## Build & Run
```bash
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
../script/cmake-ck-dev.sh ../ <arch>
make tile_example_img2col -j
./bin/tile_example_img2col -?
```
This will result in an executable `build/bin/tile_example_img2col`
---
## Source Structure
- **Kernel**: `image_to_column.hpp` (tile-programming kernel template)
- **Executable**: `image_to_column.cpp` (argument parsing, kernel launch)
- **Build**: `CMakeLists.txt`
---
## Related CK Tile Examples
- [03_gemm](../03_gemm/README.md): GEMM with tiles (im2col output as input)
- [05_reduce](../05_reduce/README.md): Reductions with tiles
- [06_permute](../06_permute/README.md): Permutation with tiles
For distribution, see `include/ck_tile/tile_program/tile_distribution/`.
---
[Back to CK Tile Examples](../README.md)

View File

@@ -0,0 +1,53 @@
# Reduction with CK Tile
This example demonstrates parallel reduction (sum, max, etc.) using the CK Tile programming model, a core operation for normalization, statistics, and aggregation in deep learning.
---
## Algorithm and Math
Given a tensor $X$ and a reduction axis, compute:
- **Sum**: $Y = \sum_i X_i$
- **Max**: $Y = \max_i X_i$
- **Mean**: $Y = \frac{1}{N} \sum_i X_i$
- **Tilewise Reduction**: Each thread block reduces a tile (block) of the input, using shared memory and register accumulation for efficiency.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile (block) of the input tensor.
- **Pipeline**: Modular, can be extended for fused reductions or post-processing.
---
## Build & Run
```bash
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make tile_example_reduce -j
./bin/tile_example_reduce -?
```
---
## Source Structure
- **Kernel**: `reduce.hpp` (tile-programming kernel template)
- **Executable**: `reduce.cpp` (argument parsing, kernel launch)
- **Build**: `CMakeLists.txt`
---
## Related CK Tile Examples
- [03_gemm](../03_gemm/README.md): GEMM with tiles
- [04_img2col](../04_img2col/README.md): im2col transformation
- [06_permute](../06_permute/README.md): Permutation with tiles
For distribution, see `include/ck_tile/tile_program/tile_distribution/`.
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,8 +1,31 @@
# permute
# Permute with CK Tile
This folder contains example for permute kernel, which is similiar to [torch.permute](https://pytorch.org/docs/stable/generated/torch.permute.html) (combined with [torch.contiguous](https://pytorch.org/docs/stable/generated/torch.Tensor.contiguous.html)). Currently we implement a generic permute kernel that support up to rank 8 arbitrary permutation with a single kernel instance. Performance is not the first consideration, we prefer a simple and general kernel implementation using `ck_tile` in this example.
This example demonstrates generic tensor permutation which is similiar to [torch.permute](https://pytorch.org/docs/stable/generated/torch.permute.html) (combined with [torch.contiguous](https://pytorch.org/docs/stable/generated/torch.Tensor.contiguous.html)). Currently we implement a generic permute kernel that support up to rank 8 arbitrary permutation with a single kernel instance. Performance is not the first consideration, we prefer a simple and general kernel implementation using `ck_tile` in this example.
---
## Algorithm and Math
Given a tensor $X$ of shape $[d_0, d_1, ..., d_{n-1}]$ and a permutation $\pi$, compute:
$$
Y_{i_0, i_1, ..., i_{n-1}} = X_{i_{\pi(0)}, i_{\pi(1)}, ..., i_{\pi(n-1)}}
$$
- **Tilewise Permute**: Each thread block processes a tile (block) of the input, computes the permuted indices, and writes to the output.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile of the input tensor.
- **Alternative Implementation**: For rank-7 tensors, a swizzled layout is supported for matrix core-friendly data loading.
---
## Build & Run
### Arguments
```
args:
-v weather do CPU validation or not (default:1)
@@ -10,18 +33,18 @@ args:
-shape the shape of the input tensor (default:2,3,4)
-perm permute perm (default:2,1,0)
```
## build
```
# in the root of ck_tile
mkdir build && cd build
../script/cmake-ck-dev.sh ../ <arch> # you can replace this <arch> to gfx90a, gfx942...
make tile_example_permute -j
```
This will result in an executable `build/bin/tile_example_permute`
## some examples
### Further Examples
```
# torch
x=torch.randn(2,3,4,6)
@@ -31,16 +54,41 @@ y=x.permute(0,3,2,1).contiguous()
./build/bin/tile_example_permute -shape=2,3,4,6 -perm=0,3,2,1
```
or you can try the smoke_test
You can try the smoke_test:
```
# in the root of ck_tile, after you build this example
sh example/ck_tile/06_permute/script/smoke_test.sh
```
### alternative implementation
we have an alternative implementation under `alternative_impl/` folder, that can swizzle the tensor to be more friendly for data loading for matrix core layout. This can be enabled when dealing with a `rank-7` tensor, with a fixed pattern of either `0,1,4,2,5,3,6` or `0,1,2,4,5,3,6`. There are other shape limitation of this implementation, check the source code of `permute.cpp` for detail.
### Alternative Implementation
We have an alternative implementation under `alternative_impl/` folder, that can swizzle the tensor to be more friendly for data loading for matrix core layout. This can be enabled when dealing with a `rank-7` tensor, with a fixed pattern of either `0,1,4,2,5,3,6` or `0,1,2,4,5,3,6`. There are other shape limitation of this implementation, check the source code of `permute.cpp` for detail.
```
# example
./build/bin/tile_example_permute -shape=3,6,4,32,16,2,8 -perm=0,1,4,2,5,3,6 # b_n0_k0_n1_k1_n2_k2
./build/bin/tile_example_permute -shape=3,8,4,16,16,4,8 -perm=0,1,2,4,5,3,6 # b_n0_n1_k0_k1_n2_k2
```
---
## Source Structure
- **Kernel**: `permute.hpp` (tile-programming kernel template)
- **Executable**: `permute.cpp` (argument parsing, kernel launch)
- **Alternative**: `alternative_impl/` (swizzled layout for rank-7 tensors)
- **Build**: `CMakeLists.txt`, `script/`
---
## Related CK Tile Examples
- [03_gemm](../03_gemm/README.md): GEMM with tiles
- [05_reduce](../05_reduce/README.md): Reductions with tiles
- [35_batched_transpose](../35_batched_transpose/README.md): Batched transpose with tiles
For distribution, `include/ck_tile/tile_program/tile_distribution/`.
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,9 +1,31 @@
# topk-softmax
# TopK-Softmax with CK Tile
This folder contains example for topk-softmax kernel using ck_tile tile-programming implementation. This kernel is often used in Moe model, before launching the fused-moe-gemm block. The input is a `token*expert` 2d matrix. The op will do a softmax per row(`expert`), then find the `topk` value for each row. Output is a `token*topk` weight(usually fp32) and index(int32) 2d tensor.
This example demonstrates a tile-programming implementation of TopK-Softmax, commonly used in Mixture-of-Experts (MoE) models to select top-k experts per token after softmax. This kernel is often used in MoE model, before launching the fused-moe-gemm block. The input is a `token*expert` 2d matrix. The op will do a softmax per row(`expert`), then find the `topk` value for each row. Output is a `token*topk` weight (typically fp32) and index(int32) 2D tensor.
## build
```
---
## Algorithm and Math
Given a matrix $X$ of shape $[\text{tokens}, \text{experts}]$:
1. **Softmax per row**: $S_{i,j} = \frac{\exp(X_{i,j})}{\sum_k \exp(X_{i,k})}$
2. **TopK selection**: For each row $i$, select the $k$ largest $S_{i,j}$ and their indices.
**Output**:
- $[\text{tokens}, k]$ weights (fp32)
- $[\text{tokens}, k]$ indices (int32)
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile (block of rows).
- **Pipeline**: Modular, can be extended for fused operations.
---
## Build & Run
```bash
# in the root of ck_tile
mkdir build && cd build
../script/cmake-ck-dev.sh ../ <arch> # you can replace this <arch> to gfx90a, gfx942...
@@ -11,8 +33,9 @@ make tile_example_topk_softmax -j
```
This will result in an executable `build/bin/tile_example_topk_softmax`
## example
```
### Arguments
```bash
args:
-v weather do CPU validation or not (default:1)
-pr_i input data type. fp16/fp32 (representing 8/16/32 bit data) (default:fp16)
@@ -28,3 +51,24 @@ args:
-jsonfile json file name to dump results (default:topk_softmax.json)
```
---
## Source Structure
- **Kernel**: [`topk_softmax_api.hpp`](topk_softmax_api.hpp) (tile-programming kernel template)
- **Executable**: [`topk_softmax.cpp`](topk_softmax.cpp) (argument parsing, kernel launch)
- **Build**: `CMakeLists.txt`, `script/`
---
## Related CK Tile Examples
- [15_fused_moe](../15_fused_moe/README.md): Fused MoE block using TopK-Softmax
- [05_reduce](../05_reduce/README.md): Reductions with tiles
- [03_gemm](../03_gemm/README.md): GEMM with tiles
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,18 +1,45 @@
# Rmsnorm2D forward
# RMSNorm2D Forward with CK Tile
This folder contains example for Rmsnorm2D forward using ck_tile tile-programming implementation.
This example demonstrates 2D Root Mean Square Layer Normalization (RMSNorm) using the CK Tile programming model, a normalization technique widely used in LLMs and transformers.
## build
```
---
## Algorithm and Math
For each row $x$:
$$
\text{rms}(x) = \sqrt{\frac{1}{N} \sum_{i=1}^N x_i^2 + \epsilon}
$$
$$
y_i = \frac{x_i}{\text{rms}(x)} \cdot \gamma_i
$$
where $\gamma$ is a learnable scale parameter.
- **Tilewise RMSNorm**: Each thread block processes a tile (row or block), computes the mean square, normalizes, and applies scale.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile of the input matrix.
- **Pipeline**: Modular, can be extended for fused operations.
---
## Build & Run
```bash
# in the root of ck_tile
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch> # you can replace this <arch> to gfx90a, gfx942...
make tile_rmsnorm2d_fwd -j`nproc`
```
This will result in an executable `build/bin/tile_rmsnorm2d_fwd`
## cmdline
```
### Arguments
```bash
args:
-m m dimension (default:3328)
-n n dimension (default:4096)
@@ -37,3 +64,24 @@ args:
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:rmsnorm2d_fwd.json)
```
---
## Source Structure
- **Kernel**: [`rmsnorm2d_fwd.hpp`](rmsnorm2d_fwd.hpp) (tile-programming kernel template)
- **Executable**: [`rmsnorm2d_fwd.cpp`](rmsnorm2d_fwd.cpp) (argument parsing, kernel launch)
- **Build**: `CMakeLists.txt`, `generate.py`, `script/`
---
## Related CK Tile Examples
- [02_layernorm2d](../02_layernorm2d/README.md): LayerNorm2D with tiles
- [12_smoothquant](../12_smoothquant/README.md): SmoothQuant with tiles
- [05_reduce](../05_reduce/README.md): Reductions with tiles
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,9 +1,35 @@
# Add + Rmsnorm2D + rowwise dynamic quantization forward
# Add + RMSNorm2D + Rowwise Dynamic Quantization (RDQuant) with CK Tile
This folder contains example for add + Rmsnorm2D + rowwise dynamic quantization forward using ck_tile tile-programming implementation. Rdquant is short for rowwise dynamic quantization here.
This example demonstrates a fused kernel for elementwise addition, 2D RMSNorm, and rowwise dynamic quantization using the CK Tile programming model. This pattern is common in LLMs for efficient normalization and quantized inference.
## build
```
---
## Algorithm and Math
Given input $X$ and residual $R$:
1. **Elementwise Add**: $Z = X + R$
2. **RMSNorm**: $\text{rms}(Z) = \sqrt{\frac{1}{N} \sum_{i=1}^N Z_i^2 + \epsilon}$, $Y_i = \frac{Z_i}{\text{rms}(Z)} \cdot \gamma_i$
3. **Rowwise Dynamic Quantization**:
- For each row, $s = \max(|Y|) / 127$
- $Q_i = \text{round}(Y_i / s)$, $Q_i \in \text{int8}$
**Output**:
- Quantized tensor $Q$ (int8)
- Per-row scale $s$ (fp32)
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile (row or block).
- **Tile Engine**: Loads tiles, performs add, RMSNorm, and quantization.
- **Pipeline**: Modular, can be extended for further fusion.
---
## Build & Run
```bash
# in the root of ck_tile
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch> # you can replace this <arch> to gfx90a, gfx942...
@@ -11,8 +37,9 @@ make tile_add_rmsnorm2d_rdquant_fwd -j`nproc`
```
This will result in an executable `build/bin/tile_add_rmsnorm2d_rdquant_fwd`
## cmdline
```
### Arguments
```bash
args:
-m m dimension (default:3328)
-n n dimension (default:4096)
@@ -28,3 +55,24 @@ args:
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:add_rmsnorm2d_rdquant_fwd.json)
```
---
## Source Structure
- **Kernel**: [`add_rmsnorm2d_rdquant_fwd.hpp`](add_rmsnorm2d_rdquant_fwd.hpp) (tile-programming kernel template)
- **Executable**: [`add_rmsnorm2d_rdquant_fwd.cpp`](add_rmsnorm2d_rdquant_fwd.cpp), [`example_add_rmsnorm2d_rdquant_fwd.cpp`](example_add_rmsnorm2d_rdquant_fwd.cpp)
- **Build**: `CMakeLists.txt`, `instances/`, `script/`
---
## Related CK Tile Examples
- [10_rmsnorm2d](../10_rmsnorm2d/README.md): RMSNorm2D with tiles
- [12_smoothquant](../12_smoothquant/README.md): SmoothQuant with tiles
- [02_layernorm2d](../02_layernorm2d/README.md): LayerNorm2D with tiles
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,10 +1,33 @@
# smoothquant
# SmoothQuant with CK Tile
This folder contains example for smoothquant using ck_tile tile-programming implementation.
This example demonstrates SmoothQuant, a quantization technique for transformer models, using the CK Tile programming model. SmoothQuant enables efficient int8 inference by scaling activations and weights to balance quantization error.
## build
```
# in the root of ck_tile
---
## Algorithm and Math
Given input $X$ and per-channel scale $S$:
1. **Scale**: $Y_{i,j} = X_{i,j} \cdot S_j$
2. **Rowwise Dynamic Quantization**:
- For each row, $s = \max(|Y|) / 127$
- $Q_{i,j} = \text{round}(Y_{i,j} / s)$, $Q_{i,j} \in \text{int8}$
**Output**:
- Quantized tensor $Q$ (int8)
- Per-row scale $s$ (fp32)
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile (row or block).
- **Pipeline**: Modular, can be extended for further fusion.
---
## Build & Run
```bash
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch> # you can replace this <arch> to gfx90a, gfx942...
make tile_smoothquant -j`nproc`

View File

@@ -1,17 +1,44 @@
# moe-sorting
# MoE Sorting with CK Tile
This folder contains example for moe-sorting kernel using ck_tile tile-programming implementation. This kernel is often used in Moe model, before launching the fused-moe-gemm block. The input&weight is a `token*topk` 2d matrix. The op rearange the input weight ids into different experts and feed into fuse moe gemm kernel.
This example demonstrates MoE (Mixture-of-Experts) sorting using the CK Tile programming model. MoE sorting rearranges token-to-expert assignments for efficient dispatch to expert GEMMs, a key step in large language models with MoE layers. This kernel is often used in Moe model, before launching the fused-moe-gemm block. The input&weight is a `token*topk` 2d matrix. The op rearange the input weight ids into different experts and feed into fuse moe gemm kernel.
## build
```
---
## Algorithm and Math
Given:
- **Input**: $[\text{tokens}, \text{topk}]$ indices and weights (from TopK-Softmax)
- **Goal**: Rearrange tokens so each expert receives its assigned tokens in contiguous blocks
**Steps:**
1. For each token, for each of its top-k experts, assign the token to the expert's input buffer.
2. Output:
- Expert-wise token lists (indices)
- Corresponding weights
This enables efficient batched GEMM per expert.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile (block of tokens or experts).
- **Pipeline**: Modular, can be extended for further fusion or dispatch.
---
## Build & Run
```bash
# in the root of ck_tile
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch> # you can replace this <arch> to gfx90a, gfx942...
make tile_example_moe_sorting -j`nproc`
```
This will result in an executable `build/bin/tile_example_moe_sorting`
## example
This will result in an executable `build/bin/tile_example_moe_sorting`.
### Arguments
```
args:
-v turn CPU validation on (1) or off (0). (default:1)
@@ -39,3 +66,24 @@ args:
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:moe_sorting.json)
```
---
## Source Structure
- **Kernel**: [`moe_sorting_api.hpp`](moe_sorting_api.hpp) (tile-programming kernel template)
- **Executable**: [`moe_sorting.cpp`](moe_sorting.cpp), [`moe_sorting_api.cpp`](moe_sorting_api.cpp)
- **Build**: `CMakeLists.txt`, `script/`
---
## Related CK Tile Examples
- [09_topk_softmax](../09_topk_softmax/README.md): TopK-Softmax for MoE gating
- [15_fused_moe](../15_fused_moe/README.md): Fused MoE block
- [03_gemm](../03_gemm/README.md): GEMM with tiles
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,18 +1,78 @@
# moe-smoothquant
# MoE-SmoothQuant with CK Tile
This folder contains example for moe-smoothquant using ck_tile tile-programming implementation.
This example demonstrates MoE-SmoothQuant, a fused quantization operation for Mixture-of-Experts (MoE) models, using the CK Tile programming model. Unlike standard SmoothQuant, the input scale is expert-dependent, and the operation is fused with top-k expert selection. Specifically, it quantizes the top-k experts' outputs for each token using their respective expert scales. The input scale is from different expert `[expert, hidden]`, and we need reuse the `topk-id` from previous `topk-softmax` and select the corresponding `expert` from current topk, and expand the output/per-token-scale by `topk`.
This diagram depicts moe-smoothquant using ck_tile tile-programming implementation.
![](misc/moe-sm.png)
Unlike standard smoothquant op, the input scale is from different expert `[expert, hidden]`, we need reuse the `topk-id` from previous `topk-softmax` and select the corresponding `expert` from current topk, and expand the output/per-token-scale by `topk`
---
## build
```
# in the root of ck_tile
## Algorithm and Math
Given:
- **Input**: $X$ of shape $[\text{tokens}, \text{topk}, \text{hidden}]$
- **Expert scales**: $S$ of shape $[\text{experts}, \text{hidden}]$
- **TopK indices**: $I$ of shape $[\text{tokens}, \text{topk}]$
**Steps:**
1. For each token $t$ and its $k$ selected experts:
- Select scale $S_{I_{t,k}, :}$ for the $k$-th expert.
- Scale: $Y_{t,k,j} = X_{t,k,j} \cdot S_{I_{t,k}, j}$
2. **Rowwise Dynamic Quantization** (per token-expert pair):
- $s_{t,k} = \max_j |Y_{t,k,j}| / 127$
- $Q_{t,k,j} = \text{round}(Y_{t,k,j} / s_{t,k})$, $Q_{t,k,j} \in \text{int8}$
**Output**:
- Quantized tensor $Q$ (int8)
- Per-token-expert scale $s$ (fp32)
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile (block of tokens, experts, or hidden units).
- **Tile Engine**: Loads input, selects expert scales via top-k indices, applies scaling and quantization, and writes results.
- **Pipeline**: Modular, can be extended for further fusion.
---
## Build & Run
```bash
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch> # you can replace this <arch> to gfx90a, gfx942...
sh ../script/cmake-ck-dev.sh ../ <arch>
make tile_example_moe_smoothquant -j`nproc`
./bin/tile_example_moe_smoothquant -?
```
This will result in an executable `build/bin/tile_example_moe_smoothquant`
---
## Source Structure
- **Kernel**: [`moe_smoothquant.hpp`](moe_smoothquant.hpp) (tile-programming kernel template)
- **Executable**: [`moe_smoothquant.cpp`](moe_smoothquant.cpp)
- **Build**: `CMakeLists.txt`, `instances/`, `misc/`, `script/`
---
## Technical Notes
- **Expert-dependent scaling**: Each token's top-k experts use their own per-hidden-unit scale, requiring indirect indexing and efficient memory access.
- **Fused with top-k**: The kernel uses top-k indices from gating to select the correct expert scale for each token.
- **Rowwise quantization**: Each token-expert pair is quantized independently for maximum accuracy.
---
## Related CK Tile Examples
- [09_topk_softmax](../09_topk_softmax/README.md): TopK-Softmax for MoE gating
- [13_moe_sorting](../13_moe_sorting/README.md): MoE sorting for expert dispatch
- [12_smoothquant](../12_smoothquant/README.md): Standard SmoothQuant
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)
## example
```

View File

@@ -1,5 +1,59 @@
# fused-moe
Implementing the fused-moe block operator using ck-tile. This is a scatter/gather-group-gemm based solution, similiar to that of [vllm moe](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), but we introduce more kernel fusion to boost performance
# Fused-MoE with CK Tile
This example implements a highly optimized fused Mixture-of-Experts (MoE) block using the CK Tile programming model. The design fuses MoE sorting, group-GEMM, activation, and top-k weighting into a single kernel, minimizing memory traffic and maximizing throughput for large language models.
---
## Algorithm and Math
### MoE Block Structure
Given:
- **Input**: $X$ of shape $[\text{tokens}, \text{hidden}]$
- **TopK indices/weights**: $I, W$ from gating (shape $[\text{tokens}, \text{topk}]$)
- **Expert weights**: $[\text{experts}, \text{hidden}, \text{hidden}]$
**Steps:**
1. **MoE Sorting**: Rearrange tokens so each expert receives its assigned tokens in contiguous blocks (see [13_moe_sorting](../13_moe_sorting/README.md)).
2. **Group-GEMM**: For each expert, perform GEMM on its assigned tokens:
$$
Y^{(e)} = X^{(e)} W^{(e)}
$$
3. **Activation + TopK Weighting**: Apply activation (e.g., GELU) and multiply by top-k weights.
4. **Scatter/Gather**: Write results back to the original token order.
### Technical Details
- **Scatter/Gather Group-GEMM**: Uses indirect indexing to map tokens to experts and back.
- **Block Partitioning**: Tokens are partitioned into slices per expert, with padding for alignment.
- **Atomic Accumulation**: Second GEMM uses atomics for accumulation to support overlapping tokens.
- **Buffer Zeroing**: Output buffer is zeroed in the sorting step, eliminating extra kernels.
- **Pre-shuffled Weights**: Expert weights are pre-shuffled for coalesced memory access.
- **Micro-kernel Pipeline**: Uses block-inline-asm micro-kernels for peak performance, while retaining composability.
## Build & Run
```bash
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make tile_example_fused_moe -j
./bin/tile_example_fused_moe -?
```
---
## Source Structure
- **Kernel**: [`fused_moe.hpp`](fused_moe.hpp), [`fused_moegemm.hpp`](fused_moegemm.hpp), [`fused_moesorting.hpp`](fused_moesorting.hpp)
- **Executable**: [`main.cpp`](main.cpp)
- **Build**: `CMakeLists.txt`, `instances/`, `misc/`
---
## Technical Notes
This is a scatter/gather-group-gemm based solution, similiar to that of [vllm moe](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), but we introduce more kernel fusion to boost performance
![](misc/moe-0.png)
The benifit of this fused-moe:
@@ -107,4 +161,14 @@ args:
-repeat hot iter (default:20)
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:fused_moe.json)
```
```
## Related CK Tile Examples
- [13_moe_sorting](../13_moe_sorting/README.md): MoE sorting for expert dispatch
- [09_topk_softmax](../09_topk_softmax/README.md): TopK-Softmax for MoE gating
- [03_gemm](../03_gemm/README.md): GEMM with tiles
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,19 +1,55 @@
# Batched GEMM
# Batched GEMM with CK Tile
This folder contains example for batched GEMM using ck_tile tile-programming implementation.
This example demonstrates batched matrix multiplication (Batched GEMM) using the CK Tile programming model, enabling efficient parallel computation of multiple independent GEMMs in a single kernel launch.
## build
```
# in the root of ck_tile
---
## Algorithm and Math
Given:
- $A$: $[\text{batch}, M, K]$
- $B$: $[\text{batch}, K, N]$
- $C$: $[\text{batch}, M, N]$
For each batch $b$:
$$
C^{(b)} = A^{(b)} \times B^{(b)}
$$
- **Tilewise Batched GEMM**: Each thread block processes a tile of $C$ for a specific batch, loading corresponding tiles from $A$ and $B$, performing blockwise matrix multiply-accumulate, and writing results.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile of $C$ for a given batch.
- **Pipeline**: Modular, supports different memory/computation pipelines.
---
## Features
- **Flexible Layouts**: Supports row/column-major and custom strides for $A$, $B$, $C$.
- **Batching**: Efficiently computes multiple GEMMs in parallel.
- **Precision**: Supports fp16, bf16, fp8, bf8.
- **Validation**: CPU/GPU validation and error tolerance options.
---
## Build & Run
```bash
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
../script/cmake-ck-dev.sh ../ <arch>
make tile_example_batched_gemm -j
```
This will result in an executable `build/bin/tile_example_batched_gemm`
## example
```
### Arguments
```bash
args:
-m m dimension (default:512)
-n n dimension (default:1024)
@@ -36,4 +72,25 @@ args:
-split_k splitK value (default:1)
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:cktile_batched_gemm.json)
```
```
---
## Source Structure
- **Kernel**: [`batched_gemm.hpp`](batched_gemm.hpp) (tile-programming kernel template)
- **Executable**: [`batched_gemm.cpp`](batched_gemm.cpp)
- **Build**: `CMakeLists.txt`, `run_batched_gemm_example.inc`
---
## Related CK Tile Examples
- [03_gemm](../03_gemm/README.md): Single GEMM with tiles
- [15_fused_moe](../15_fused_moe/README.md): Fused MoE block (uses group/batched GEMM)
- [13_moe_sorting](../13_moe_sorting/README.md): MoE sorting for expert dispatch
For distribution, [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -4,7 +4,7 @@ The `Grouped GEMM` operators are versions of GEMM that run multiple GEMM operati
### Preshuffle and Persistence
The grouped GEMM examples include two advanced optimization features:
The grouped GEMM examples include the following advanced optimization features:
#### Weight Preshuffle
Weight preshuffle is an optimization technique that reorganizes the B matrix (weights) in memory to improve data access patterns and reduce memory bandwidth requirements. This is particularly beneficial for inference workloads where the same weights are reused across multiple batches.
@@ -21,13 +21,13 @@ Persistence mode is a GPU optimization where thread blocks remain active on the
- **Usage**: `invoke_gemm<ALayout, BLayout, CLayout, true>` enables persistence
#### Multi-D Operations
Multi-D operations extend the standard GEMM operation by supporting additional element-wise operations on the result tensor. This feature is particularly useful for workloads that require post-processing of the GEMM output.
Multi-D operations extend the standard GEMM operation by supporting additional elementwise operations on the result tensor. This feature is particularly useful for workloads that require post-processing of the GEMM output.
- **Implementation**: Available in `grouped_gemm_multi_d.cpp`
- **Operation**: E = C × D₀ × D₁ (where C = A × B is the standard GEMM result)
- **Configuration**: Uses `GemmConfigV3`, `GemmConfigV4`, `GemmConfigMemory` template configuration with 2 D tensors
- **Data Types**: Supports fp16
- **Benefits**: Enables complex operations like scaling, activation functions, or other element-wise transformations in a single kernel call
- **Data Types**: Supports fp16, fp8
- **Benefits**: Enables complex operations like scaling, activation functions, or other elementwise transformations in a single kernel call
- **Build Target**: `make tile_example_grouped_gemm_multi_d -j`
Multi-D operations supports both persistence and non-persistence modes.
@@ -37,9 +37,7 @@ Weight preshuffle supports only on non-persistence mode.
```
# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
../script/cmake-ck-dev.sh ../ <arch>
# The basic pipeline method on the gemm calculation
../script/cmake-ck-dev.sh ../ <arch>
make tile_example_grouped_gemm -j
# The preshuffle example
make tile_example_grouped_gemm_preshuffle -j
@@ -84,4 +82,23 @@ K[i] = 512 + 384 * i
stride_A[i] = K[i]
stride_B[i] = K[i]
stride_C[i] = N[i]
```
```
## Source Structure
- **Kernel**: [`grouped_gemm.hpp`](grouped_gemm.hpp) (tile-programming kernel template)
- **Executables**: [`grouped_gemm.cpp`](grouped_gemm.cpp)
- **Build**: `CMakeLists.txt`, `run_grouped_gemm_example.inc`
---
## Related CK Tile Examples
- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
- [15_fused_moe](../15_fused_moe/README.md): Fused MoE block (uses grouped GEMM)
- [03_gemm](../03_gemm/README.md): Single GEMM with tiles
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,8 +1,28 @@
# FLATMM Matrix Multiplication
# FLATMM Matrix Multiplication with CK Tile
This folder contains example for FLATMM using ck_tile tile-programming implementation. Currently, it only supports the basic feature of the CK Tile FLATMM, but creates the placeholders for the future support on different FLATMM pipeline and different FLATMM modules. In the near future, we will gradually migrate all the FLATMM features from old CK to CK Tile.
This example demonstrates FLATMM (flattened matrix multiplication) using the CK Tile programming model. FLATMM is a variant of GEMM optimized for certain memory layouts and batch processing patterns. Currently, it only supports the basic feature of the CK Tile FLATMM, but creates the placeholders for the future support on different FLATMM pipeline and different FLATMM modules. In the near future, we will gradually migrate all the FLATMM features from old CK to CK Tile.
---
## Algorithm and Math
Given:
- $A$: $[\text{batch}, M, K]$
- $B$: $[\text{batch}, K, N]$
- $C$: $[\text{batch}, M, N]$
For each batch $b$:
$$
C^{(b)} = A^{(b)} \times B^{(b)}
$$
- **FLATMM**: An alternative solution as the Preshuffled GEMM in /03_gemm
---
## Build & Run
## build
```
# in the root of ck_tile
mkdir build && cd build
@@ -13,7 +33,7 @@ make tile_example_flatmm_basic -j
```
This will result in an executable `build/bin/tile_example_flatmm_basic`
## example
### Arguments
```
args:
-m m dimension (default:256)
@@ -36,3 +56,24 @@ args:
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:flatmm_basic.json)
```
---
## Source Structure
- **Kernel**: [`flatmm_basic.hpp`](flatmm_basic.hpp) (tile-programming kernel template)
- **Executable**: [`flatmm_basic.cpp`](flatmm_basic.cpp)
- **Build**: `CMakeLists.txt`, `run_flatmm_example.inc`, `script/`
---
## Related CK Tile Examples
- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
- [03_gemm](../03_gemm/README.md): Single GEMM with tiles
- [17_grouped_gemm](../17_grouped_gemm/README.md): Grouped GEMM with tiles
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,8 +1,45 @@
#Multiple D GEMM
# Multiple D GEMM with CK Tile
This folder contains example for Multiple D GEMM using ck_tile tile-programming implementation.
This example demonstrates GEMM with multiple D tensors (multi-output GEMM) using the CK Tile programming model. This is useful for fused operations where the GEMM output is combined with multiple side inputs (e.g., bias, residual, or other elementwise sources).
---
## Algorithm and Math
Given:
- $A$: $[M, K]$
- $B$: $[K, N]$
- $D_0, D_1, ..., D_n$: $[M, N]$ (multiple side inputs)
- $E$: $[M, N]$ (output)
The operation:
$$
E = f(A \times B, D_0, D_1, ..., D_n)
$$
where $f$ is a fused elementwise function (e.g., add, multiply, activation).
- **Tilewise Multi-D GEMM**: Each thread block processes a tile of $E$, loading corresponding tiles from $A$, $B$, and all $D_i$, performing blockwise GEMM and fused elementwise operations.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile of $E$.
- **Pipeline**: Modular, supports different memory/computation pipelines and multi-D fusion.
---
## Features
- **Multiple D Inputs**: Supports arbitrary number of side inputs for fusion.
- **Flexible Layouts**: Supports row/column-major and custom strides for all tensors.
- **SplitK**: Supports K-batching for large K dimensions.
- **Validation**: GPU validation and benchmarking options.
---
## Build & Run
## build
```
#in the root of ck_tile
mkdir build && cd build
@@ -14,7 +51,8 @@ make tile_example_gemm_multi_d_fp16 -j
```
This will result in an executable `build/bin/tile_example_gemm_multi_d_fp16`
## example
### Arguments
```
args:
-m m dimension (default:3840)
@@ -35,3 +73,25 @@ args:
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:cktile_gemm_multi_d_fp16.json)
```
---
## Source Structure
- **Kernel**: [`gemm_multi_d_fp16.hpp`](gemm_multi_d_fp16.hpp) (tile-programming kernel template)
- **Executable**: [`gemm_multi_d_fp16.cpp`](gemm_multi_d_fp16.cpp)
- **Utils**: [`utils.hpp`](utils.hpp)
- **Build**: `CMakeLists.txt`, `run_gemm_multi_d_fp16_example.inc`
---
## Related CK Tile Examples
- [03_gemm](../03_gemm/README.md): Single GEMM with tiles
- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
- [17_grouped_gemm](../17_grouped_gemm/README.md): Grouped GEMM with tiles
For distribution, see [`include/ck_tile/tile_engine/`](../../../include/ck_tile/tile_engine/) and [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -1,8 +1,40 @@
# Batched Transpose
This folder contains example for batched Transpose using ck_tile tile-programming implementation. Currently, it supports the batched transpose with NCHW to NHWC or NHWC to NCHW. So in this way from NCHW you could transpose to either NHWC or NWCH(two transposes). Now the transpose read with single data point. We would soon put it in vectorized transpose.
# Batched Transpose with CK Tile
## build
```
This example demonstrates batched tensor transpose using the CK Tile programming model. It supports common layout conversions such as NCHW <-> NHWC, which are essential for deep learning frameworks and hardware accelerators. Currently, it supports the batched transpose with NCHW to NHWC or NHWC to NCHW. So in this way from NCHW you could transpose to either NHWC or NWCH(two transposes). Now the transpose read with single data point. We would soon put it in vectorized transpose.
---
## Algorithm and Math
Given a batch of tensors $X$ of shape $[N, C, H, W]$, the transpose operation rearranges axes to produce $Y$ of shape $[N, H, W, C]$ (NCHW to NHWC) or other permutations.
For each element:
$$
Y_{n, h, w, c} = X_{n, c, h, w}
$$
- **Tilewise Batched Transpose**: Each thread block processes a tile (block) of the input, computes the permuted indices, and writes to the output.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile of the input tensor for a given batch.
- **Pipeline**: Modular, can be extended for vectorized or fused operations.
---
## Features
- **Flexible Layouts**: Supports NCHW <-> NHWC and other axis permutations.
- **Batching**: Efficiently transposes multiple tensors in parallel.
- **Validation**: CPU validation and benchmarking options.
---
## Build & Run
```bash
# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
@@ -10,10 +42,12 @@ mkdir build && cd build
# Make the transpose executable
make tile_example_batched_transpose -j
```
This will result in an executable `build/bin/tile_example_batched_transpose`
## example
```
### Arguments
```bash
args:
-N input batch size (default:2)
-C input channel size. (default:16)
@@ -26,4 +60,25 @@ args:
-k_name t to 1 will print kernel name (default:0)
-warmup warmup iterations to run this kernel (default:50)
-repeat number of iterations to run this kernel (default:100)
```
```
---
## Source Structure
- **Kernel**: [`batched_transpose_example.hpp`](batched_transpose_example.hpp) (tile-programming kernel template)
- **Executables**: [`batched_transpose_example.cpp`](batched_transpose_example.cpp), [`batched_transpose_api.cpp`](batched_transpose_api.cpp)
- **Build**: `CMakeLists.txt`, `script/`
---
## Related CK Tile Examples
- [06_permute](../06_permute/README.md): Generic permutation with tiles
- [03_gemm](../03_gemm/README.md): GEMM with tiles
- [16_batched_gemm](../16_batched_gemm/README.md): Batched GEMM with tiles
For distribution, see [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

View File

@@ -0,0 +1,83 @@
# Batched Transpose (Block Transpose) with CK Tile
This example demonstrates a high-performance batched block transpose kernel using the CK Tile programming model, with a focus on architectures like gfx950. The kernel is optimized for tiled memory access and is suitable for layout conversions such as NCHW <-> NHWC in deep learning. This transpose load has some constraints in input tile distribution.
---
## Algorithm and Math
Given a batch of tensors $X$ of shape $[N, C, H, W]$, the block transpose operation rearranges axes to produce $Y$ of shape $[N, H, W, C]$ (NCHW to NHWC) or other permutations.
For each element:
$$
Y_{n, h, w, c} = X_{n, c, h, w}
$$
- **Blockwise Transpose**: Each thread block processes a tile (block) of the input, reading and writing in a coalesced, tiled fashion for optimal memory throughput.
---
## Tile Programming Model
- **Tiles**: Each thread block processes a tile of the input tensor for a given batch.
- **Policy**: [`transpose_policy.hpp`](transpose_policy.hpp) defines tile/block size and memory access patterns for optimal performance.
---
## Features
- **Optimized for Architecture**: Designed for architectures like gfx950 with constraints on input tile distribution.
- **Flexible Layouts**: Supports NCHW <-> NHWC and other axis permutations.
- **Batching**: Efficiently transposes multiple tensors in parallel.
- **Validation**: CPU validation and benchmarking options.
---
## Build & Run
```bash
# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
sh ../script/cmake-ck-dev.sh ../ <arch>
# Make the transpose executable
make tile_example_transpose -j
```
This will result in an executable `build/bin/tile_example_transpose`
### Arguments
```bash
args:
-N input batch size (default:2)
-C input channel size. (default:64)
-H input height size. (default:1)
-W input width size. (default:64)
-v whether do CPU validation or not (default: 1)
-layout_in input tensor data layout - NCHW by default
-layout_out output tensor data layout - NHWC by default
-seed seed to be used, -1 means random every time (default:-1)
-k_name t to 1 will print kernel name (default:0)
```
---
## Source Structure
- **Kernel**: [`block_transpose.hpp`](block_transpose.hpp), [`batched_transpose_kernel.hpp`](batched_transpose_kernel.hpp), [`transpose_policy.hpp`](transpose_policy.hpp)
- **Executables**: [`transpose_example.cpp`](transpose_example.cpp), [`transpose_api.cpp`](transpose_api.cpp)
- **Build**: `CMakeLists.txt`
---
## Related CK Tile Examples
- [35_batched_transpose](../35_batched_transpose/README.md): Batched transpose with tiles
- [06_permute](../06_permute/README.md): Generic permutation with tiles
- [03_gemm](../03_gemm/README.md): GEMM with tiles
For tile engine and distribution, see [`include/ck_tile/tile_engine/`](../../../include/ck_tile/tile_engine/) and [`include/ck_tile/tile_program/tile_distribution/`](../../../include/ck_tile/tile_program/tile_distribution/).
---
[Back to CK Tile Examples](../README.md)

75
example/ck_tile/README.md Normal file
View File

@@ -0,0 +1,75 @@
# CK Tile Example Suite
This directory contains a comprehensive suite of examples demonstrating the CK Tile programming model for high-performance GPU kernels. Each example illustrates a key deep learning or HPC operation, implemented using tile-based parallelism, modular pipelines, and data movement policy.
---
## What is CK Tile?
CK Tile is a composable GPU programming API that expresses kernels as a composition of "tiles"—rectangular blocks of computation and data movement. The pipeline & policy orchestrates data movement (global <-> LDS <-> registers), computation, and synchronization, enabling high efficiency and flexibility.
---
## Example Index
| Example | Operation | Description |
|---------|-----------|-------------|
| [01_fmha](01_fmha/README.md) | Fused Multi-Head Attention | Tile-based FMHA with masking, quantization, and epilogue fusion |
| [02_layernorm2d](02_layernorm2d/README.md) | LayerNorm2D | Blockwise layer normalization with fusion and quantization |
| [03_gemm](03_gemm/README.md) | GEMM | Matrix multiplication with tilewise parallelism |
| [04_img2col](04_img2col/README.md) | im2col | Image-to-column transformation for GEMM-based convolution |
| [05_reduce](05_reduce/README.md) | Reduction | Tilewise sum, max, mean reductions |
| [06_permute](06_permute/README.md) | Permute | Generic tensor permutation (up to rank-8) |
| [09_topk_softmax](09_topk_softmax/README.md) | TopK-Softmax | Rowwise softmax and top-k selection for MoE gating |
| [10_rmsnorm2d](10_rmsnorm2d/README.md) | RMSNorm2D | Root mean square normalization for LLMs |
| [11_add_rmsnorm2d_rdquant](11_add_rmsnorm2d_rdquant/README.md) | Add + RMSNorm2D + RDQuant | Fused add, RMSNorm, and rowwise dynamic quantization |
| [12_smoothquant](12_smoothquant/README.md) | SmoothQuant | Per-channel scaling and quantization for int8 inference |
| [13_moe_sorting](13_moe_sorting/README.md) | MoE Sorting | Token-to-expert rearrangement for MoE dispatch |
| [14_moe_smoothquant](14_moe_smoothquant/README.md) | MoE-SmoothQuant | Expert-dependent quantization fused with top-k selection |
| [15_fused_moe](15_fused_moe/README.md) | Fused MoE | End-to-end fused MoE block: sorting, group-GEMM, activation, weighting |
| [16_batched_gemm](16_batched_gemm/README.md) | Batched GEMM | Parallel computation of multiple GEMMs |
| [17_grouped_gemm](17_grouped_gemm/README.md) | Grouped GEMM | Multiple independent GEMMs with different shapes |
| [18_flatmm](18_flatmm/README.md) | FLATMM | Flattened matrix multiplication for packed layouts |
| [19_gemm_multi_d](19_gemm_multi_d/README.md) | Multi-D GEMM | GEMM with multiple side inputs (bias, residual, etc.) |
| [35_batched_transpose](35_batched_transpose/README.md) | Batched Transpose | NCHW <-> NHWC and other layout conversions |
| [36_copy](36_copy/README.md) | Copy | Minimal example for tile-based memory movement |
| [37_transpose](37_transpose/README.md) | Block Transpose | High-performance tiled transpose for large tensors |
---
## Technical Highlights
- **Tile Distribution**: See [`include/ck_tile/tile_program/tile_distribution/`](../../include/ck_tile/tile_program/tile_distribution/) for mapping tiles to thread blocks.
- **Block Tile Pipelines**: See [`include/ck_tile/tile_program/block_tile_pipeline/`](../../include/ck_tile/tile_program/block_tile_pipeline/) for memory/computation pipelines.
- **Policies and Utilities**: Many examples use custom policies for tile/block size and memory access.
---
## How to Build & Run
```bash
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make -j
```
Each example produces its own executable in `build/bin/`.
---
## Learning and Extending
- **Start Simple**: Try [03_gemm](03_gemm/README.md) or [36_copy](36_copy/README.md) to learn tile basics.
- **Explore Fusion**: See [11_add_rmsnorm2d_rdquant](11_add_rmsnorm2d_rdquant/README.md), [15_fused_moe](15_fused_moe/README.md), or [14_moe_smoothquant](14_moe_smoothquant/README.md) for advanced fusion.
- **Experiment**: Modify tile sizes, layouts, or pipelines to explore performance and flexibility.
---
## References
- [CK Tile Programming API Documentation](https://github.com/ROCm/composable_kernel/tree/develop/include/ck_tile)
- [Block Tile Pipeline Source](https://github.com/ROCm/composable_kernel/tree/develop/include/ck_tile/tile_program/block_tile_pipeline)
- [Tile Distribution Source](https://github.com/ROCm/composable_kernel/tree/develop/include/ck_tile/tile_program/tile_distribution)
---
[Back to Composable Kernel Examples](../README.md)