supporting prefill shapes for preshuffle block scale gemm (#2975)

* debugging * debugging for prefill shapes * comment unused code * fix for prefill shapes * clearing up the code * add int4 to universal gemm example * clang formatted * adding test for prefill shapes in block scale gemm * lil improv on the block pipeline * Address Review Comment --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>
2026-05-02 04:31:25 +00:00 · 2025-10-10 15:36:24 -07:00
parent 9d060d3e3c
commit 3c39d279ab
10 changed files with 137 additions and 89 deletions
--- a/example/ck_tile/38_block_scale_gemm/README.md
+++ b/example/ck_tile/38_block_scale_gemm/README.md
@@ -4,8 +4,18 @@ This folder contains examples of quant GEMMs using the ck_tile tile-programming

 - AQuant kernel with blocks of A matrix sharing scales: custom GEMM pipeline
 - BQuant kernel with blocks of B matrix sharing scales: custom GEMM pipeline
- Row and Column-wise scaled: scaling implemented in Epilogue
- Tensor-wise scaled: scaling implemented in Epilogue
+- Row and Column-wise scaled: All of the rowwise elements in A Matrix and columwise elements in B Matrix will share the same quantization element and the elementwisde operation will complete in epilogue.
+- Tensor-wise scaled: Share the same scalar scale across the whole tensor of A or B
+
+---
+
+## Features
+
+- **Preshuffled GEMM**: Shuffle the GEMM of B (weight) matrix in the warp layout and bypass the shared memory to do the GEMM calculation. Best performance solution for GEMM.
+- **TransposeC**: Transpose the C Matrix Output layout to have the best coalesced scale reading
+- **Preshuffled Quant**: Preshuffle the input matrix to load multiple Quant warp blocks along the selected dimension.
+- **Precision**: Supports fp16, bf16, fp8, bf8, int4 (for B Matrix).
+- **Validation**: CPU/GPU validation and error tolerance options.

 ## build
 ```