Files
composable_kernel/example/ck_tile
juuso-oskari 3ab4df37e2 Refactor CK-UA dispatcher around KernelVariant + select_config
The previous dispatcher was a 4-deep nested-if cascade that picked one
of seven DISPATCH_* macros based on (hdim, num_queries_per_kv, dtype,
mask, tile_tier, use_bs32). The macro names hid both the traits class
and the dispatch path, so reasoning about "what kernel runs for shape
X" required reading the whole file.

Replace it with two named layers:

  1. KernelVariant enum -- a flat list of every compiled instance.
  2. select_config(args) -- the only place runtime decisions live;
     reads the problem and emits a KernelConfig{variant, ...}.

The final switch over the variant calls into per-variant dispatch
helpers that fan out over (dtype, mask) via the existing DISPATCH_*
macros. Behaviour is unchanged: each old (hdim, nqpkv, tier, p32) tuple
maps 1:1 to a KernelVariant, and the same instance is launched.

Follow-up commits in this series will:
  - add a dedicated d=128 MHA decode variant
  - delete the _p32 ("bs32") family now that the multi-page-tile fix
    in the pipeline makes kBlockN independent of page_size

Test: ua-test-scripts/test_unified_attention_ck_correctness.py
      stays at 236/240 (same 4 pre-existing int32-overflow failures).
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 09:27:59 +00:00
..

CK Tile Example Suite

This directory contains a comprehensive suite of examples demonstrating the CK Tile programming model for high-performance GPU kernels. Each example illustrates a key deep learning or HPC operation, implemented using tile-based parallelism, modular pipelines, and data movement policy.


What is CK Tile?

CK Tile is a composable GPU programming API that expresses kernels as a composition of "tiles"—rectangular blocks of computation and data movement. The pipeline & policy orchestrates data movement (global <-> LDS <-> registers), computation, and synchronization, enabling high efficiency and flexibility.


Example Index

Example Operation Description
01_fmha Fused Multi-Head Attention Tile-based FMHA with masking, quantization, and epilogue fusion
02_layernorm2d LayerNorm2D Blockwise layer normalization with fusion and quantization
03_gemm GEMM Matrix multiplication with tilewise parallelism
04_img2col im2col Image-to-column transformation for GEMM-based convolution
05_reduce Reduction Tilewise sum, max, mean reductions
06_permute Permute Generic tensor permutation (up to rank-8)
09_topk_softmax TopK-Softmax Rowwise softmax and top-k selection for MoE gating
10_rmsnorm2d RMSNorm2D Root mean square normalization for LLMs
11_add_rmsnorm2d_rdquant Add + RMSNorm2D + RDQuant Fused add, RMSNorm, and rowwise dynamic quantization
12_smoothquant SmoothQuant Per-channel scaling and quantization for int8 inference
13_moe_sorting MoE Sorting Token-to-expert rearrangement for MoE dispatch
14_moe_smoothquant MoE-SmoothQuant Expert-dependent quantization fused with top-k selection
15_fused_moe Fused MoE End-to-end fused MoE block: sorting, group-GEMM, activation, weighting
16_batched_gemm Batched GEMM Parallel computation of multiple GEMMs
17_grouped_gemm Grouped GEMM Multiple independent GEMMs with different shapes
18_flatmm FLATMM Flattened matrix multiplication for packed layouts
19_gemm_multi_d Multi-D GEMM GEMM with multiple side inputs (bias, residual, etc.)
35_batched_transpose Batched Transpose NCHW <-> NHWC and other layout conversions
36_copy Copy Minimal example for tile-based memory movement
37_transpose Block Transpose High-performance tiled transpose for large tensors

Technical Highlights


How to Build & Run

mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make -j

Each example produces its own executable in build/bin/.


Learning and Extending


References


Back to Composable Kernel Examples