mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-10 08:18:26 +00:00

Files

JP-Fernando e7798e9560 [rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac)

Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112)

## Motivation

The goal of this work is to apply XOR shuffle (swizzle) to the current
`comp_async` GEMM pipeline and the `gemm_mx` pipeline.
XOR swizzling has been helpful to avoid LDS bank conflicts, as data are
redistributed across LDS banks, such that simultaneous threads accessing
different rows land on different LDS banks.

## Technical Details

A similar approach to the work in the existing eight-waves pipeline was
followed.
Currently, XOR swizzle support is available for FP8 and BF8 types.
FP4 support is also available for MX GEMM.
Should the types not match, or should the async vector width be of an
unsupported size, then the pipeline falls through to the previously
existing ('unswizzled') path.

## Test Plan

Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM
pipeline.
Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for
the MX GEMM pipeline.

## Test Result

The tests passed successfully in the `Alola` cluster with MI350
hardware.

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

2026-05-21 09:36:41 +02:00

01_fmha

[rocm-libraries] ROCm/rocm-libraries#7530 (commit 378e049)

2026-05-19 00:09:23 +08:00

02_layernorm2d

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 13:54:08 -08:00

03_gemm

[rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac)

2026-05-21 09:36:41 +02:00

04_img2col

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

05_reduce

Dlejeune/ck tile 2d multiple reductions (#3147 )

2026-01-09 11:16:37 +01:00

06_permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

09_topk_softmax

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

10_rmsnorm2d

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 13:54:08 -08:00

11_add_rmsnorm2d_rdquant

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

12_smoothquant

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 11:17:11 -04:00

13_moe_sorting

Update unsigned long literals and format specifiers to work correctly in Windows (#3483 )

2026-01-02 22:16:41 -07:00

14_moe_smoothquant

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 11:17:11 -04:00

15_fused_moe

[CK_TILE] Generate random tensor values with multiple threads (#3324 )

2025-12-09 11:02:33 +08:00

16_batched_gemm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

17_grouped_gemm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

18_flatmm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

19_gemm_multi_d

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

20_grouped_convolution

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

21_elementwise

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

22_gemm_multi_abd

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

35_batched_transpose

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

36_pooling

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

37_transpose

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

38_block_scale_gemm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

40_streamk_gemm

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

41_batched_contraction

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

42_mx_gemm

[rocm-libraries] ROCm/rocm-libraries#5552 (commit 369c7a2)

2026-05-19 11:53:19 -07:00

49_sageattention

[rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)

2026-04-30 11:32:23 -07:00

50_sparse_attn

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 13:54:08 -08:00

51_tile_distr_enc_reg_map

[rocm-libraries] ROCm/rocm-libraries#6088 (commit 6ac353c)

2026-05-20 12:36:13 +00:00

52_cshuffle_lds

[rocm-libraries] ROCm/rocm-libraries#5383 (commit b660b8c)

2026-04-14 20:43:23 -07:00

CMakeLists.txt

[rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)

2026-04-30 11:32:23 -07:00

README.md

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

remod.py

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

README.md

CK Tile Example Suite

This directory contains a comprehensive suite of examples demonstrating the CK Tile programming model for high-performance GPU kernels. Each example illustrates a key deep learning or HPC operation, implemented using tile-based parallelism, modular pipelines, and data movement policy.

What is CK Tile?

CK Tile is a composable GPU programming API that expresses kernels as a composition of "tiles"—rectangular blocks of computation and data movement. The pipeline & policy orchestrates data movement (global <-> LDS <-> registers), computation, and synchronization, enabling high efficiency and flexibility.

Example Index

Example	Operation	Description
01_fmha	Fused Multi-Head Attention	Tile-based FMHA with masking, quantization, and epilogue fusion
02_layernorm2d	LayerNorm2D	Blockwise layer normalization with fusion and quantization
03_gemm	GEMM	Matrix multiplication with tilewise parallelism
04_img2col	im2col	Image-to-column transformation for GEMM-based convolution
05_reduce	Reduction	Tilewise sum, max, mean reductions
06_permute	Permute	Generic tensor permutation (up to rank-8)
09_topk_softmax	TopK-Softmax	Rowwise softmax and top-k selection for MoE gating
10_rmsnorm2d	RMSNorm2D	Root mean square normalization for LLMs
11_add_rmsnorm2d_rdquant	Add + RMSNorm2D + RDQuant	Fused add, RMSNorm, and rowwise dynamic quantization
12_smoothquant	SmoothQuant	Per-channel scaling and quantization for int8 inference
13_moe_sorting	MoE Sorting	Token-to-expert rearrangement for MoE dispatch
14_moe_smoothquant	MoE-SmoothQuant	Expert-dependent quantization fused with top-k selection
15_fused_moe	Fused MoE	End-to-end fused MoE block: sorting, group-GEMM, activation, weighting
16_batched_gemm	Batched GEMM	Parallel computation of multiple GEMMs
17_grouped_gemm	Grouped GEMM	Multiple independent GEMMs with different shapes
18_flatmm	FLATMM	Flattened matrix multiplication for packed layouts
19_gemm_multi_d	Multi-D GEMM	GEMM with multiple side inputs (bias, residual, etc.)
35_batched_transpose	Batched Transpose	NCHW <-> NHWC and other layout conversions
36_copy	Copy	Minimal example for tile-based memory movement
37_transpose	Block Transpose	High-performance tiled transpose for large tensors

Technical Highlights

Tile Distribution: See include/ck_tile/tile_program/tile_distribution/ for mapping tiles to thread blocks.
Block Tile Pipelines: See include/ck_tile/tile_program/block_tile_pipeline/ for memory/computation pipelines.
Policies and Utilities: Many examples use custom policies for tile/block size and memory access.

How to Build & Run

mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make -j

Each example produces its own executable in build/bin/.

Learning and Extending

Start Simple: Try 03_gemm or 36_copy to learn tile basics.
Explore Fusion: See 11_add_rmsnorm2d_rdquant, 15_fused_moe, or 14_moe_smoothquant for advanced fusion.
Experiment: Modify tile sizes, layouts, or pipelines to explore performance and flexibility.

References

Back to Composable Kernel Examples