Files
composable_kernel/example/ck_tile
juuso-oskari 06e1a70e7a CK-UA: constexpr page_size (Tier 3) — prefill_d64 fp8 -15.8%, prefill_d128 fp8 -6.3%
Promote the runtime `page_size` argument to a non-type template parameter
`kPageSize_` on UnifiedAttentionPipeline. Thread it through
unified_attention_kernel_traits and dispatch_variant<V> so the host-side
dispatcher routes on args.page_blk_size ∈ {16, 32, 64} to a constexpr-
pinned prefill instance; values outside that menu (or any decode variant)
fall back to the existing kPageSize_=0 runtime-page-size instance.

Two wins fold together on the prefill tiers:

1. Strength-reduction. Every `/ page_size`, `* page_size`, and `% page_size`
   in the per-tile address chain collapses to a literal-folded shift /
   multiply-by-magic (`/ 32` → shr 5, etc).

2. Wider Tier-0/Tier-2 gate. The scalar-promote + LDS-cache fast path now
   uses the *real* precondition `KY0_step_N <= kPageSize` at compile time
   instead of the conservative `KY0_step_N <= 16` hedge — so prefill_d128
   bf16/fp16 (KY0_step_N=32), prefill_d64 fp8 (KY0_step_N=32), and
   prefill_d64 bf16/fp16 (KY0_step_N=64) also enter the fast path at
   their natural page sizes.

Measured impact (sq=sk=75600, MI355, n=30 iters, GQA-8):

  variant            KY0_step_N  ps   before   after    Δ
  prefill_d128 fp8   16          32   119.0    111.5    -6.3 %
  prefill_d128 bf16  32          32   132.7    130.3    -1.8 %
  prefill_d64  fp8   32          32    80.9     68.1   -15.8 %
  prefill_d64  bf16  64          64    74.4     73.4    -1.3 %

Decode variants stay on the kPageSize_=0 instances (Tier-0 gate gates them
out anyway — <8 warps — and the binary-size cost isn't justified). All
sweep_fp8.sh shapes + 21 multi-seed multi-sk-length prefill shapes
correctness-PASS. Pre-existing Tier-2 LDS-cache limit (4096 entries)
documented in the pipeline header — same constraint applies to the
kPageSize_=0 fallback so this is not a regression.

36 new prefill instance files: prefill_d{64,128} × {fp16, bf16, fp8} ×
{mask, nmask} × {ps16, ps32, ps64}.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 12:46:39 +00:00
..

CK Tile Example Suite

This directory contains a comprehensive suite of examples demonstrating the CK Tile programming model for high-performance GPU kernels. Each example illustrates a key deep learning or HPC operation, implemented using tile-based parallelism, modular pipelines, and data movement policy.


What is CK Tile?

CK Tile is a composable GPU programming API that expresses kernels as a composition of "tiles"—rectangular blocks of computation and data movement. The pipeline & policy orchestrates data movement (global <-> LDS <-> registers), computation, and synchronization, enabling high efficiency and flexibility.


Example Index

Example Operation Description
01_fmha Fused Multi-Head Attention Tile-based FMHA with masking, quantization, and epilogue fusion
02_layernorm2d LayerNorm2D Blockwise layer normalization with fusion and quantization
03_gemm GEMM Matrix multiplication with tilewise parallelism
04_img2col im2col Image-to-column transformation for GEMM-based convolution
05_reduce Reduction Tilewise sum, max, mean reductions
06_permute Permute Generic tensor permutation (up to rank-8)
09_topk_softmax TopK-Softmax Rowwise softmax and top-k selection for MoE gating
10_rmsnorm2d RMSNorm2D Root mean square normalization for LLMs
11_add_rmsnorm2d_rdquant Add + RMSNorm2D + RDQuant Fused add, RMSNorm, and rowwise dynamic quantization
12_smoothquant SmoothQuant Per-channel scaling and quantization for int8 inference
13_moe_sorting MoE Sorting Token-to-expert rearrangement for MoE dispatch
14_moe_smoothquant MoE-SmoothQuant Expert-dependent quantization fused with top-k selection
15_fused_moe Fused MoE End-to-end fused MoE block: sorting, group-GEMM, activation, weighting
16_batched_gemm Batched GEMM Parallel computation of multiple GEMMs
17_grouped_gemm Grouped GEMM Multiple independent GEMMs with different shapes
18_flatmm FLATMM Flattened matrix multiplication for packed layouts
19_gemm_multi_d Multi-D GEMM GEMM with multiple side inputs (bias, residual, etc.)
35_batched_transpose Batched Transpose NCHW <-> NHWC and other layout conversions
36_copy Copy Minimal example for tile-based memory movement
37_transpose Block Transpose High-performance tiled transpose for large tensors

Technical Highlights


How to Build & Run

mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make -j

Each example produces its own executable in build/bin/.


Learning and Extending


References


Back to Composable Kernel Examples