mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 05:01:25 +00:00

Files

linqunAMD 9fcc1ee9fd Support Wave32 in CK_TILE - Part 1 (#2594 )

* Support wave32/wave64 in CK_TILE - Part 1

* remove blocksize in kernel launch

* fix build error

* fix clang format

* fix clang format 2

* fix clang format 3

* fix fmha build error

* fix fmha build 2

* fix fmha build 3

* fix build error 4

* address review comment

* update change log

* replace KernelBlockSize with kBlockSize

* fix CI fail

* fix clang format

* address review comment and rebase code.

* fix universal test fail

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

2025-08-18 10:08:31 -07:00

CMakeLists.txt

Preshuffle AQ matrix in block scale gemm (#2624 )

2025-08-12 21:32:51 -07:00

gemm_aquant_basic.cpp

Support Wave32 in CK_TILE - Part 1 (#2594 )

2025-08-18 10:08:31 -07:00

gemm_aquant_preshuffle.cpp

Support Wave32 in CK_TILE - Part 1 (#2594 )

2025-08-18 10:08:31 -07:00

gemm_utils.hpp

Preshuffle AQ matrix in block scale gemm (#2624 )

2025-08-12 21:32:51 -07:00

README.md

ck_tile kernel for gemm with groupwise quantized A tensor (#2473 )

2025-07-23 00:10:16 -07:00

run_gemm_aquant_example.inc

Preshuffle AQ matrix in block scale gemm (#2624 )

2025-08-12 21:32:51 -07:00

README.md

GEMM Matrix Multiplication

This folder contains example for Block Scale GEMM using ck_tile tile-programming implementation.

build

# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
sh ../script/cmake-ck-dev.sh  ../ <arch>
# The aquant pipeline method on the gemm calculation
make tile_example_gemm_aquant_basic -j

This will result in an executable build/bin/tile_example_gemm_aquant_basic

example

args:
          -b    batch size (default:1)
          -m    m dimension (default:1024)
          -n    n dimension (default:2048)
          -k    k dimension (default:64)
   -a_layout    Tensor A data layout (default: R)
   -b_layout    Tensor B data layout (default: R)
   -c_layout    Tensor C data layout (default: R)
   -stride_a    Tensor A stride (default:0)
   -stride_b    Tensor B stride (default:0)
   -stride_c    Tensor C stride (default:0)
          -v    0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:2)
          -e    Absolute error tolerance (default:1e-5)
       -prec    data type. fp16/bf16/fp8/bf8/int8 (default:fp16)
     -warmup    number of iterations before benchmark the kernel (default:10)
     -repeat    number of iterations to benchmark the kernel (default:100)
      -timer    gpu:gpu timer, cpu:cpu timer (default:gpu)