mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-02 04:31:25 +00:00

Files

Cong Ma 452791a3ba Preshuffle AQ matrix in block scale gemm (#2624 )

* Preshuffle AQ matrix in block scale gemm

* turns the output to fp16. Increase the repetition time.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

2025-08-12 21:32:51 -07:00

CMakeLists.txt

Preshuffle AQ matrix in block scale gemm (#2624 )

2025-08-12 21:32:51 -07:00

gemm_aquant_basic.cpp

Preshuffle AQ matrix in block scale gemm (#2624 )

2025-08-12 21:32:51 -07:00

gemm_aquant_preshuffle.cpp

Preshuffle AQ matrix in block scale gemm (#2624 )

2025-08-12 21:32:51 -07:00

gemm_utils.hpp

Preshuffle AQ matrix in block scale gemm (#2624 )

2025-08-12 21:32:51 -07:00

README.md

ck_tile kernel for gemm with groupwise quantized A tensor (#2473 )

2025-07-23 00:10:16 -07:00

run_gemm_aquant_example.inc

Preshuffle AQ matrix in block scale gemm (#2624 )

2025-08-12 21:32:51 -07:00

README.md

GEMM Matrix Multiplication

This folder contains example for Block Scale GEMM using ck_tile tile-programming implementation.

build

# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
sh ../script/cmake-ck-dev.sh  ../ <arch>
# The aquant pipeline method on the gemm calculation
make tile_example_gemm_aquant_basic -j

This will result in an executable build/bin/tile_example_gemm_aquant_basic

example

args:
          -b    batch size (default:1)
          -m    m dimension (default:1024)
          -n    n dimension (default:2048)
          -k    k dimension (default:64)
   -a_layout    Tensor A data layout (default: R)
   -b_layout    Tensor B data layout (default: R)
   -c_layout    Tensor C data layout (default: R)
   -stride_a    Tensor A stride (default:0)
   -stride_b    Tensor B stride (default:0)
   -stride_c    Tensor C stride (default:0)
          -v    0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:2)
          -e    Absolute error tolerance (default:1e-5)
       -prec    data type. fp16/bf16/fp8/bf8/int8 (default:fp16)
     -warmup    number of iterations before benchmark the kernel (default:10)
     -repeat    number of iterations to benchmark the kernel (default:100)
      -timer    gpu:gpu timer, cpu:cpu timer (default:gpu)