mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Files

Kiefer van Teutem 2089713f94 [rocm-libraries] ROCm/rocm-libraries#8227 (commit 75c30d5)

=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?=
 =?UTF-8?q?Remove=20unification=20Flag=20structs=20in=20favor=20of=20new?=
 =?UTF-8?q?=20WarpGemmParams=20(#8227)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Recently, the way flags are sent down to the intrinsics was changed in
CK Tile. At the point where the WarpGemm is invoked, an arbitrary number
of template parameters can be passed, and these are passed down all the
way to the lowest level intrinsics wrappers. Here
`WarpGemmParamsParser<>` is used to extract flags for the intrinsics.

In this MR we adapt the the unification framework (amdgcn_mma struct and
MmaPipelines) to work in the same way. By doing this, there is no longer
a point in our custom intrinsic Flag structs, so these are removed.

Unrelated but I also tried removing the MmaPipeline flags because they
arn't used for anything except CTranspose, which is already available.
This also make test_amdgcn_mma_pipeline completely redundant so removed
that as well.

2026-06-26 12:00:58 +00:00

01_fmha

[rocm-libraries] ROCm/rocm-libraries#8492 (commit 46b6a06)

2026-06-17 06:22:26 +00:00

02_layernorm2d

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 13:54:08 -08:00

03_gemm

[rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e)

2026-06-19 06:57:14 +00:00

04_img2col

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

05_reduce

[rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)

2026-06-04 15:00:17 +00:00

06_permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

09_topk_softmax

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

10_rmsnorm2d

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 13:54:08 -08:00

11_add_rmsnorm2d_rdquant

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

12_smoothquant

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 11:17:11 -04:00

13_moe_sorting

Update unsigned long literals and format specifiers to work correctly in Windows (#3483 )

2026-01-02 22:16:41 -07:00

14_moe_smoothquant

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 11:17:11 -04:00

15_fused_moe

[rocm-libraries] ROCm/rocm-libraries#6807 (commit ddda8ac)

2026-05-26 11:07:18 -04:00

16_batched_gemm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

17_grouped_gemm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

18_flatmm

[rocm-libraries] ROCm/rocm-libraries#7359 (commit dd62f9f)

2026-05-29 17:02:45 +00:00

19_gemm_multi_d

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

20_grouped_convolution

[rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)

2026-06-18 01:22:50 +00:00

21_elementwise

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

22_gemm_multi_abd

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

35_batched_transpose

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

36_pooling

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

37_transpose

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

38_block_scale_gemm

[rocm-libraries] ROCm/rocm-libraries#7199 (commit 23f7320)

2026-06-05 11:41:49 +00:00

40_streamk_gemm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

41_batched_contraction

[rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)

2026-06-04 15:00:17 +00:00

42_mx_gemm

[rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01)

2026-06-15 08:28:55 +00:00

49_sageattention

[rocm-libraries] ROCm/rocm-libraries#7833 (commit 8a444cd)

2026-05-29 15:29:58 +00:00

50_sparse_attn

[rocm-libraries] ROCm/rocm-libraries#7833 (commit 8a444cd)

2026-05-29 15:29:58 +00:00

51_tile_distr_enc_reg_map

[rocm-libraries] ROCm/rocm-libraries#8227 (commit 75c30d5)

2026-06-26 12:00:58 +00:00

52_cshuffle_lds

[rocm-libraries] ROCm/rocm-libraries#5383 (commit b660b8c)

2026-04-14 20:43:23 -07:00

CMakeLists.txt

[rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)

2026-04-30 11:32:23 -07:00

README.md

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

remod.py

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

README.md

CK Tile Example Suite

This directory contains a comprehensive suite of examples demonstrating the CK Tile programming model for high-performance GPU kernels. Each example illustrates a key deep learning or HPC operation, implemented using tile-based parallelism, modular pipelines, and data movement policy.

What is CK Tile?

CK Tile is a composable GPU programming API that expresses kernels as a composition of "tiles"—rectangular blocks of computation and data movement. The pipeline & policy orchestrates data movement (global <-> LDS <-> registers), computation, and synchronization, enabling high efficiency and flexibility.

Example Index

Example	Operation	Description
01_fmha	Fused Multi-Head Attention	Tile-based FMHA with masking, quantization, and epilogue fusion
02_layernorm2d	LayerNorm2D	Blockwise layer normalization with fusion and quantization
03_gemm	GEMM	Matrix multiplication with tilewise parallelism
04_img2col	im2col	Image-to-column transformation for GEMM-based convolution
05_reduce	Reduction	Tilewise sum, max, mean reductions
06_permute	Permute	Generic tensor permutation (up to rank-8)
09_topk_softmax	TopK-Softmax	Rowwise softmax and top-k selection for MoE gating
10_rmsnorm2d	RMSNorm2D	Root mean square normalization for LLMs
11_add_rmsnorm2d_rdquant	Add + RMSNorm2D + RDQuant	Fused add, RMSNorm, and rowwise dynamic quantization
12_smoothquant	SmoothQuant	Per-channel scaling and quantization for int8 inference
13_moe_sorting	MoE Sorting	Token-to-expert rearrangement for MoE dispatch
14_moe_smoothquant	MoE-SmoothQuant	Expert-dependent quantization fused with top-k selection
15_fused_moe	Fused MoE	End-to-end fused MoE block: sorting, group-GEMM, activation, weighting
16_batched_gemm	Batched GEMM	Parallel computation of multiple GEMMs
17_grouped_gemm	Grouped GEMM	Multiple independent GEMMs with different shapes
18_flatmm	FLATMM	Flattened matrix multiplication for packed layouts
19_gemm_multi_d	Multi-D GEMM	GEMM with multiple side inputs (bias, residual, etc.)
35_batched_transpose	Batched Transpose	NCHW <-> NHWC and other layout conversions
36_copy	Copy	Minimal example for tile-based memory movement
37_transpose	Block Transpose	High-performance tiled transpose for large tensors

Technical Highlights

Tile Distribution: See include/ck_tile/tile_program/tile_distribution/ for mapping tiles to thread blocks.
Block Tile Pipelines: See include/ck_tile/tile_program/block_tile_pipeline/ for memory/computation pipelines.
Policies and Utilities: Many examples use custom policies for tile/block size and memory access.

How to Build & Run

mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make -j

Each example produces its own executable in build/bin/.

Learning and Extending

Start Simple: Try 03_gemm or 36_copy to learn tile basics.
Explore Fusion: See 11_add_rmsnorm2d_rdquant, 15_fused_moe, or 14_moe_smoothquant for advanced fusion.
Experiment: Modify tile sizes, layouts, or pipelines to explore performance and flexibility.

References

Back to Composable Kernel Examples