mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 13:11:25 +00:00

Files

Aviral Goel e99356dabc Add Memory pipeline for AQuant Block Scale GEMM (#2987 )

* WIP: add memory pipeline boiler plate code that compiles and works for one block

* WIP: tail handling works for memory pipeline

* WIP: numerical errors appears to have gone by adding block_sync_lds()

* fix: numerical error with memory pipeline by adding block_sync_lds() and new tail handler

* refactror: remove debug print statements and lints

* fix: remove redundant sync barriars

* chore: remove lint

* fix: remove unused code from tile handler and remove redundant block_sync_lds()

* fix: correct parent struct name for memory pipeline

* fix: remove static assert check from parent struct and add it to child struct because not all child structs needs to static assert

* fix: defer block sync lds to just before prefill

2025-10-08 17:22:30 -07:00

CMakeLists.txt

[CK-Tile] Fix quant example code (#2813 )

2025-09-10 17:15:39 -07:00

gemm_quant_basic.cpp

Add Memory pipeline for AQuant Block Scale GEMM (#2987 )

2025-10-08 17:22:30 -07:00

gemm_utils.hpp

Weight Preshuffle Block Scale gemm support (#2877 )

2025-09-29 12:46:37 -07:00

README.md

Weight Preshuffle Block Scale gemm support (#2877 )

2025-09-29 12:46:37 -07:00

run_gemm_quant_example.inc

Add Memory pipeline for AQuant Block Scale GEMM (#2987 )

2025-10-08 17:22:30 -07:00

README.md

Quant GEMM Matrix Multiplication

This folder contains examples of quant GEMMs using the ck_tile tile-programming implementation.

AQuant kernel with blocks of A matrix sharing scales: custom GEMM pipeline
BQuant kernel with blocks of B matrix sharing scales: custom GEMM pipeline
Row and Column-wise scaled: scaling implemented in Epilogue
Tensor-wise scaled: scaling implemented in Epilogue

build

# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx942) or leave it blank
../script/cmake-ck-dev.sh  ../ <arch>
# Compile the quant kernels
make tile_example_gemm_quant_basic -j

This will result in an executable build/bin/tile_example_gemm_quant_basic

example

args:
          -b    batch size (default:1)
          -m    m dimension (default:1024)
          -n    n dimension (default:2048)
          -k    k dimension (default:64)
   -a_layout    Tensor A data layout (default: R)
   -b_layout    Tensor B data layout (default: C)
   -c_layout    Tensor C data layout (default: R)
   -stride_a    Tensor A stride (default:0)
   -stride_b    Tensor B stride (default:0)
   -stride_c    Tensor C stride (default:0)
          -v    0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:1)
          -e    Absolute error tolerance (default:1e-5)
       -prec    data type. fp8/bf8/i4fp8/i4bf8/i4f32fp8/i4f32bf8 (default:fp8)
     -warmup    number of iterations before benchmark the kernel (default:10)
     -repeat    number of iterations to benchmark the kernel (default:100)
      -timer    gpu:gpu timer, cpu:cpu timer (default:gpu)
 -quant_mode    Which quant method to use (aquant, bquant, tensor, rowcol)

User need to select correct mapping of config for each quant mode:

	quant_mode as runtime argument	Config in cpp file
For selecting AQuant	aquant	GemmConfigQuant
For selecting Aquant with Preshuffle	aquant	GemmConfigPreshuffleQuant
For selecting BQuant	bquant	GemmConfigQuant
For selecting PreShuffle Weight matrix with Bquant	bquant	GemmConfigPreshuffleB_Bquant_decode (or) GemmConfigPreshuffleB_Bquant_prefill
For selecting RowCol quant	rowcolquant	GemmConfigRowColQuant