[CK TILE GEMM] Refactor block_scale_gemm examples (#3181)

* [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * [CK TILE GEMM] Refactor block_scale_gemm examples - Set quant group size to (1, 1, 64) for targets excluding gfx950, where warp tile size (16, 16, 128) is incompatible.
2026-05-02 20:51:23 +00:00 · 2025-11-13 00:43:40 -07:00
parent 9af30f04b6
commit 6fd8ddabe7
14 changed files with 805 additions and 495 deletions
--- a/example/ck_tile/38_block_scale_gemm/README.md
+++ b/example/ck_tile/38_block_scale_gemm/README.md
@@ -40,23 +40,31 @@ This will result in an executable `build/bin/tile_example_gemm_quant_basic`
 ## example
 ```
 args:
-          -b    batch size (default:1)
-          -m    m dimension (default:1024)
-          -n    n dimension (default:2048)
-          -k    k dimension (default:64)
-   -a_layout    Tensor A data layout (default: R)
-   -b_layout    Tensor B data layout (default: C)
-   -c_layout    Tensor C data layout (default: R)
-   -stride_a    Tensor A stride (default:0)
-   -stride_b    Tensor B stride (default:0)
-   -stride_c    Tensor C stride (default:0)
-          -v    0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:1)
-          -e    Absolute error tolerance (default:1e-5)
-       -prec    data type. fp8/bf8/i4fp8/i4bf8/i4f32fp8/i4f32bf8 (default:fp8)
-     -warmup    number of iterations before benchmark the kernel (default:10)
-     -repeat    number of iterations to benchmark the kernel (default:100)
-      -timer    gpu:gpu timer, cpu:cpu timer (default:gpu)
- -quant_mode    Which quant method to use (aquant, bquant, tensor, rowcol)
+             -h    Print help message (default:false)
+             -m    m dimension (default:3840)
+             -n    n dimension (default:4096)
+             -k    k dimension (default:2048)
+      -a_layout    A tensor data layout - Row or Column (default:R)
+      -b_layout    B tensor data layout - Row or Column (default:C)
+     -bq_layout    Bq tensor data layout - Row or Column (default:C)
+      -c_layout    C tensor data layout - Row or Column (default:R)
+      -stride_a    Tensor A stride (default:0)
+      -stride_q    Tensor AQ stride (default:0)
+      -stride_b    Tensor B stride (default:0)
+      -stride_c    Tensor C stride (default:0)
+             -v    0: No validation, 1: Validation on CPU, 2: Validation on GPU (default:1)
+          -prec    Data type. For AQuant: fp8, bf8, i4fp8, or i4bf8;  for Bquant: fp8, bf8, fp8i4, or bf8i4 (default for both AQuant and Bquant: fp8)
+        -warmup    Number of iterations before benchmarking the kernel (default:50)
+        -repeat    Number of iterations to benchmark the kernel (default:1000)
+         -timer    gpu:gpu timer, cpu:cpu timer (default:gpu)
+       -split_k    SplitK value (default:1)
+        -device    Device id that will be used to run the kernel (default:0)
+          -init    0:random, 1:linear, 2:constant(1) (default:0)
+   -flush_cache    Flush cache before running the kernel (default:true)
+-rotating_count    Rotating count (default:1000)
+    -quant_mode    Choose aquant, bquant, tensor or rowcol (default:bquant)
+   -preshuffleb    Enable preshuffle of tensor B (default:false)
+    -group_size    Quantization group size as MxNxK, e.g., 1x1x128, 1x32x128, 1x64x128 (default:1x1x128)
 ```

 User need to select correct mapping of config for each quant mode: