Files
composable_kernel/example/ck_tile
Kiefer van Teutem b5f8bef97f [rocm-libraries] ROCm/rocm-libraries#6088 (commit 6ac353c)
[CK Tile][MFMA/WMMA unification] Add support for packed datatypes (tiny types) (#6088)

## Motivation
This MR makes all the changes required for the unified architecture to
be able to deal with packed datatypes i.e. int4, fp4, fp6, and bf6. The
crux is that layout parameters should be interpreted as describing the
pure mathematical matrix fragments, while the ext_vectors and tile
distribution encodings describe everything in terms of packed datatype
units. This matches how packed types are dealt with in ck_tile and
should play nicely with the load and store tile ops once we integrate
the unified framework into CK tile.

The bf6 datatype was added to CK tile in the form of pk_bf6x16_t and
pk_bf6x32_t, which did not exist before.

The ext_vector implementations of pk_fp6x16_t and pk_bf6x16_t (vec size
1 and 2) were extended to make the subscripting operator work as
expected.

The layout test was adapted to be compatible with all packed datatypes,
and all new intrinsics were added to the test.

This MR adds ALL intrinsics across ALL architectures which use packed
datatypes, as well as ALL scale intrinsics:

mfma_scale_f32_16x16x128_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6,
BF6xBF6)
mfma_scale_f32_32x32x64_f8f6f4 gfx950 (F8xF8, BF8xBF8, F4xF4, F6xF6,
BF6xBF6)
wmma_i32_16x16x16_iu4_w32
wmma_i32_16x16x16_iu4_w32_gfx12
wmma_i32_16x16x32_iu4_w32_gfx12

## Testing
All intrinsics were tested on all architectures.
2026-05-20 12:36:13 +00:00
..

CK Tile Example Suite

This directory contains a comprehensive suite of examples demonstrating the CK Tile programming model for high-performance GPU kernels. Each example illustrates a key deep learning or HPC operation, implemented using tile-based parallelism, modular pipelines, and data movement policy.


What is CK Tile?

CK Tile is a composable GPU programming API that expresses kernels as a composition of "tiles"—rectangular blocks of computation and data movement. The pipeline & policy orchestrates data movement (global <-> LDS <-> registers), computation, and synchronization, enabling high efficiency and flexibility.


Example Index

Example Operation Description
01_fmha Fused Multi-Head Attention Tile-based FMHA with masking, quantization, and epilogue fusion
02_layernorm2d LayerNorm2D Blockwise layer normalization with fusion and quantization
03_gemm GEMM Matrix multiplication with tilewise parallelism
04_img2col im2col Image-to-column transformation for GEMM-based convolution
05_reduce Reduction Tilewise sum, max, mean reductions
06_permute Permute Generic tensor permutation (up to rank-8)
09_topk_softmax TopK-Softmax Rowwise softmax and top-k selection for MoE gating
10_rmsnorm2d RMSNorm2D Root mean square normalization for LLMs
11_add_rmsnorm2d_rdquant Add + RMSNorm2D + RDQuant Fused add, RMSNorm, and rowwise dynamic quantization
12_smoothquant SmoothQuant Per-channel scaling and quantization for int8 inference
13_moe_sorting MoE Sorting Token-to-expert rearrangement for MoE dispatch
14_moe_smoothquant MoE-SmoothQuant Expert-dependent quantization fused with top-k selection
15_fused_moe Fused MoE End-to-end fused MoE block: sorting, group-GEMM, activation, weighting
16_batched_gemm Batched GEMM Parallel computation of multiple GEMMs
17_grouped_gemm Grouped GEMM Multiple independent GEMMs with different shapes
18_flatmm FLATMM Flattened matrix multiplication for packed layouts
19_gemm_multi_d Multi-D GEMM GEMM with multiple side inputs (bias, residual, etc.)
35_batched_transpose Batched Transpose NCHW <-> NHWC and other layout conversions
36_copy Copy Minimal example for tile-based memory movement
37_transpose Block Transpose High-performance tiled transpose for large tensors

Technical Highlights


How to Build & Run

mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make -j

Each example produces its own executable in build/bin/.


Learning and Extending


References


Back to Composable Kernel Examples