composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Files

Illia Silin 717f2efef7 [rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

[CK] add composable kernel support on gfx1250 (#6978)

## Motivation

Add composable kernel support on gfx1250.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>

2026-05-15 06:46:51 -07:00

abquant_grouped_gemm.cpp

[rocm-libraries] ROCm/rocm-libraries#4354 (commit d41f08a)

2026-02-11 07:05:46 +00:00

abquant_grouped_gemm.hpp

[CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452 )

2026-01-06 12:36:04 -08:00

CMakeLists.txt

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

grouped_gemm_multi_d.cpp

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

grouped_gemm_multi_d.hpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

grouped_gemm_preshuffle.cpp

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

grouped_gemm.cpp

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

grouped_gemm.hpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

quant_grouped_gemm_bf8_aquant.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm_bf8_bquant.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm_bf8_rowcol.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm_bf8_tensor.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm_config.hpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm_fp8_aquant.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm_fp8_bquant.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm_fp8_rowcol.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm_fp8_tensor.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_grouped_gemm.cpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

quant_invoke_grouped_gemm_kernel.hpp

[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 )

2026-01-28 19:45:09 -08:00

quant_run_grouped_gemm_example.hpp

[CK_TILE] Grouped gemm quant tensor layouts (#3414 )

2025-12-24 23:01:23 -08:00

README.md

feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle… (#3225 )

2025-11-18 09:32:27 -05:00

run_grouped_gemm_abquant_example.inc

[CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452 )

2026-01-06 12:36:04 -08:00

run_grouped_gemm_example.inc

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

run_grouped_gemm_multi_d_example.inc

[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 )

2026-01-04 03:28:14 -08:00

README.md

Quick Tour for New Users

The Grouped GEMM operators are versions of GEMM that run multiple GEMM operations within a single kernel call. Each GEMM operation performs a matrix multiplication. Unlike regular batched GEMM operations where both matrices must be of the same size and have the same configuration, Grouped GEMM operations can take matrices with different sizes and configurations, making them more flexible for diverse workloads.

Preshuffle and Persistence

The grouped GEMM examples include the following advanced optimization features:

Weight Preshuffle

Weight preshuffle is an optimization technique that reorganizes the B matrix (weights) in memory to improve data access patterns and reduce memory bandwidth requirements. This is particularly beneficial for inference workloads where the same weights are reused across multiple batches.

Implementation: Available in grouped_gemm_preshuffle.cpp
Configuration: Uses GemmConfigPreshuffleDecode and GemmConfigPreshufflePrefill template configuration
Constraints: Currently supports only A(Row major) + B(Column major) → C(Row major) layouts

Persistence Mode

Persistence mode is a GPU optimization where thread blocks remain active on the compute units to process multiple work items sequentially, reducing kernel launch overhead and improving occupancy.

Template Parameter: Controlled by the Persistent boolean template parameter in invoke_gemm
Usage: invoke_gemm<ALayout, BLayout, CLayout, true> enables persistence

Multi-D Operations

Multi-D operations extend the standard GEMM operation by supporting additional elementwise operations on the result tensor. This feature is particularly useful for workloads that require post-processing of the GEMM output.

Implementation: Available in grouped_gemm_multi_d.cpp
Operation: E = C × D₀ × D₁ (where C = A × B is the standard GEMM result)
Configuration: Uses GemmConfigV3, GemmConfigV4, GemmConfigMemory template configuration with 2 D tensors
Data Types: Supports fp16, bf16, fp8
Benefits: Enables complex operations like scaling, activation functions, or other elementwise transformations in a single kernel call
Build Target: make tile_example_grouped_gemm_multi_d -j

Multi-D operations supports both persistence and non-persistence modes. Weight preshuffle supports only on non-persistence mode.

Build

# in the root of ck_tile
mkdir build && cd build
../script/cmake-ck-dev.sh ../ <arch>
make tile_example_grouped_gemm -j
# The preshuffle example
make tile_example_grouped_gemm_preshuffle -j
# The multi-D operations example
make tile_example_grouped_gemm_multi_d -j
# The quant grouped gemm fp8 example
make tile_example_quant_grouped_gemm -j

Each example will result in an corresponding executable build/bin/tile_example_grouped_gemm, build/bin/tile_example_grouped_gemm_preshuffle, build/bin/tile_example_grouped_gemm_multi_d, and build/bin/tile_example_quant_grouped_gemm.

example

args:
 -Ms          M dimensions - (Default: empty).
 -Ns          N dimensions - (Default: empty).
 -Ks          K dimensions - (Default: empty).
 -stride_As   Tensor A strides - (Default: empty).
 -stride_Bs   Tensor B strides - (Default: empty).
 -stride_Cs   Tensor C strides - (Default: empty).
 -a_layout    A tensor data layout - (Default: Row).
 -b_layout    B tensor data layout - (Default: Col).
 -c_layout    C tensor data layout - (Default: Row).
 -prec        data type. fp16/bf16/fp8 - (Default: fp16).
 -validate    0. No validation, 1. Validation on CPU. (Default: 1).
 -warmup      Number of iterations before benchmark the kernel. (Default: 10).
 -repeat      Number of iterations to benchmark the kernel. (Default: 100).
 -group_count Group count. (Default: 16).
 -kbatch      kbatch for SplitK (Default: 1).
 -json        0: No Json, 1: Dump Results in Json format (Default: 0).
 -jsonfile    json file name to dump results (Default: grouped_gemm.json).

If any of Ms, Ns, Ks, stride_As, stride_Bs, or stride_Cs are missing or their sizes don't match group_count, the example generates defaults per group index i (0-based):

M[i] = 256 + 256 * i
N[i] = 256 + 512 * i
K[i] = 512 + 384 * i

stride_A[i] = K[i]
stride_B[i] = K[i]
stride_C[i] = N[i]

Source Structure

Kernel: grouped_gemm.hpp (tile-programming kernel template)
Executables: grouped_gemm.cpp
Build: CMakeLists.txt, run_grouped_gemm_example.inc

16_batched_gemm: Batched GEMM with tiles
15_fused_moe: Fused MoE block (uses grouped GEMM)
03_gemm: Single GEMM with tiles

For distribution, see include/ck_tile/tile_program/tile_distribution/.

Back to CK Tile Examples

README.md Unescape Escape

Quick Tour for New Users

Preshuffle and Persistence

Weight Preshuffle

Persistence Mode

Multi-D Operations

Build

example

Source Structure

Related CK Tile Examples

README.md