[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

[CK] Add split-K support for ABQuantGrouped in block_scale_gemm (#4816) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Changes ### Split-K support in `gemm_quant_kernel.hpp` - **`SplitKBatchOffset`**: Added `aq_group_offset` and `aq_k_split_offset` fields (mirroring the existing `bq_*` fields for B) to track each split-K batch's position within the AQ scale tensor. For `ABQuantGrouped`, both offsets are computed from `k_id * KRead` divided by `AQuantGroupSize::kK`. - **`MakeAQBlockWindow`**: Added an `aq_group_offset` parameter (defaulting to 0 for non-split-K paths) so the AQ tensor view's K-group dimension reflects only the remaining K-groups from the split-K offset, consistent with how `MakeBQBlockWindow` handles the BQ tensor. - **`RunGemm`**: Threads the `aq_k_split_offset` through to `MakeAQBlockWindow` when in split-K mode. ### Constraints in `IsSupportedArgument()` Four constraints gate split-K (`k_batch > 1`) for ABQuantGrouped: 1. **Mode check** — split-K is only allowed for `BQuantGrouped` (no preshuffle) or `ABQuantGrouped` (no `APreshuffleQuant`). Any other quant mode with `k_batch > 1` returns `false`. 2. **B quant group alignment** — `KRead` (per-batch K slice) must be divisible by `BQuantGroupSize::kK`. Each batch must operate on complete B quantization groups; a partial group would require splitting a scale value across batches. 3. **A quant group alignment** (new, ABQuantGrouped only) — `KRead` must also be divisible by `AQuantGroupSize::kK` for the same reason applied to the AQ scale tensor. 4. **Minimum 2 K-tile iterations per batch** (new) — The software-pipelined GEMM kernels (CompV3 family) prefetch one tile ahead, so they require `per_batch_num_loop = KRead / KPerBlock >= 2`. When `KRead == KPerBlock` (i.e. each batch is exactly one tile), the prefetch reads into the next batch's memory region and produces incorrect results. Configurations where `K == k_batch * KPerBlock` are therefore rejected. ### Example update (`run_gemm_quant_example.inc`) Updated the comment above the `IsSupportedArgument` call to document that split-K is now supported for both `BQuantGrouped` (no preshuffle) and `ABQuantGrouped` (no `APreshuffleQuant`). ## Unit Tests Two new test files covering decode and prefill tile shapes across a range of `k_batch` values (2–8), data types (FP8, BF8), and quantization group sizes (1×1×128 and 1×128×128 for B): - `test_gemm_quant_abquant_splitk_decode.cpp` — uses the decode tile shape (M=16, N=64, K_tile=256) - `test_gemm_quant_abquant_splitk_prefill.cpp` — uses the prefill tile shape (M=128, N=128, K_tile=128) Each test calls `run_test_with_validation` which runs the kernel and checks correctness against a CPU reference. Configurations excluded from tests are annotated with comments explaining which constraint they violate (typically the `per_batch_num_loop >= 2` requirement). ## Prerequisites This PR depends on #4429, which must be merged before this can be merged.
2026-06-06 05:55:39 +00:00 · 2026-02-26 23:57:17 +00:00
parent 6549c320fc
commit c8a8449eec
23 changed files with 796 additions and 418 deletions
--- a/example/ck_tile/38_block_scale_gemm/CMakeLists.txt
+++ b/example/ck_tile/38_block_scale_gemm/CMakeLists.txt
@@ -14,7 +14,11 @@ if(GPU_TARGETS MATCHES "gfx94|gfx95|gfx12")
    set(EXE_NAME tile_example_gemm_quant)
    add_executable(${EXE_NAME}
        gemm_quant.cpp
-        gemm_abquant_quantgrouped.cpp
+        gemm_abquant_quantgrouped_fp8.cpp
+        gemm_abquant_quantgrouped_fp4.cpp
+        gemm_abquant_quantgrouped_bf8.cpp
+        gemm_abquant_quantgrouped_preshuffleb_fp8.cpp
+        gemm_abquant_quantgrouped_preshuffleb_bf8.cpp
        gemm_abquant_quantgrouped_preshuffleb_preshufflequant.cpp
        gemm_aquant_quantgrouped.cpp
        gemm_aquant_quantgrouped_preshufflequant.cpp
--- a/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped.cpp
+++ b/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped.cpp
@@ -1,202 +0,0 @@
-// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
-// SPDX-License-Identifier: MIT
-
-#include "run_gemm_quant_example.inc"
-
-#if defined(CK_USE_GFX950)
-template <typename T>
-using GemmConfig = GemmConfigEightWarps<T>;
-template <typename T>
-using GemmConfigPrefill = GemmConfigPreshuffleBEightWarps<T>;
-#else
-template <typename T>
-using GemmConfig = GemmConfigABQuantPrefill<T>;
-template <typename T>
-using GemmConfigPrefill = GemmConfigPreshuffleB_ABQuant_Prefill<T>;
-#endif
-
-static auto _ = []() {
-    auto& lut                               = get_kernel_lut();
-    lut[hash_multiple_strings({"fp8",
-                               "abquant",
-                               "non-preshuffleb",
-                               "non-preshufflequant",
-                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigABQuantPrefill<ck_tile::fp8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"fp8",
-                               "abquant",
-                               "non-preshuffleb",
-                               "non-preshufflequant",
-                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfig<ck_tile::fp8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"bf8",
-                               "abquant",
-                               "non-preshuffleb",
-                               "non-preshufflequant",
-                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigABQuantPrefill<ck_tile::bf8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"bf8",
-                               "abquant",
-                               "non-preshuffleb",
-                               "non-preshufflequant",
-                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfig<ck_tile::bf8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"fp8",
-                               "abquant",
-                               "preshuffleb",
-                               "non-preshufflequant",
-                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::fp8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"fp8",
-                               "abquant",
-                               "preshuffleb",
-                               "non-preshufflequant",
-                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::fp8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"bf8",
-                               "abquant",
-                               "preshuffleb",
-                               "non-preshufflequant",
-                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::bf8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"bf8",
-                               "abquant",
-                               "preshuffleb",
-                               "non-preshufflequant",
-                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::bf8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"fp8",
-                               "abquant",
-                               "non-preshuffleb",
-                               "preshufflequant",
-                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPreshuffleBQuantPrefill<ck_tile::fp8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"fp8",
-                               "abquant",
-                               "non-preshuffleb",
-                               "preshufflequant",
-                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPreshuffleBQuantPrefill<ck_tile::fp8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings(
-        {"fp4", "abquant", "non-preshuffleb", "non-preshufflequant", "1x128x128"})] =
-        [](const ck_tile::ArgParser& arg_parser) {
-            using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-            using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
-            using TypeConfig      = decltype(GemmQuantTypeConfig<ck_tile::pk_fp4_t,
-                                                                 ck_tile::pk_fp4_t,
-                                                                 ck_tile::half_t,
-                                                                 float>{});
-            return run_gemm_example_prec_type<GemmConfigABQuantPrefill<ck_tile::pk_fp4_raw_t>,
-                                              TypeConfig,
-                                              AQuantGroupSize,
-                                              BQuantGroupSize,
-                                              ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-        };
-    lut[hash_multiple_strings(
-        {"fp4", "abquant", "preshuffleb", "non-preshufflequant", "1x128x128"})] =
-        [](const ck_tile::ArgParser& arg_parser) {
-            using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-            using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
-            using TypeConfig      = decltype(GemmQuantTypeConfig<ck_tile::pk_fp4_t,
-                                                                 ck_tile::pk_fp4_t,
-                                                                 ck_tile::half_t,
-                                                                 float>{});
-            return run_gemm_example_prec_type<
-                GemmConfigPreshuffleB_ABQuant_Prefill<ck_tile::pk_fp4_raw_t>,
-                TypeConfig,
-                AQuantGroupSize,
-                BQuantGroupSize,
-                ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-        };
-    return 0;
-}();
--- a/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped.h
+++ b/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped.h
@@ -0,0 +1,18 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#pragma once
+
+#include "gemm_utils.hpp"
+
+#if defined(CK_USE_GFX950)
+template <typename T, bool TransposeC = true>
+using GemmConfig = GemmConfigEightWarps<T, TransposeC>;
+template <typename T, bool TransposeC = true>
+using GemmConfigPrefill = GemmConfigPreshuffleBEightWarps<T, TransposeC>;
+#else
+template <typename T, bool TransposeC = true>
+using GemmConfig = GemmConfigABQuantPrefill<T, TransposeC>;
+template <typename T, bool TransposeC = true>
+using GemmConfigPrefill = GemmConfigPreshuffleB_ABQuant_Prefill<T, TransposeC>;
+#endif
--- a/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_bf8.cpp
+++ b/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_bf8.cpp
@@ -0,0 +1,41 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#include "run_gemm_quant_example.inc"
+#include "gemm_abquant_quantgrouped.h"
+
+static auto _ = []() {
+    auto& lut = get_kernel_lut();
+
+    lut[hash_multiple_strings({"bf8",
+                               "abquant",
+                               "non-preshuffleb",
+                               "non-preshufflequant",
+                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfigABQuantPrefill<ck_tile::bf8_t, false>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    lut[hash_multiple_strings({"bf8",
+                               "abquant",
+                               "non-preshuffleb",
+                               "non-preshufflequant",
+                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfig<ck_tile::bf8_t>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    return 0;
+}();
--- a/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_fp4.cpp
+++ b/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_fp4.cpp
@@ -0,0 +1,41 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#include "run_gemm_quant_example.inc"
+#include "gemm_abquant_quantgrouped.h"
+
+static auto _ = []() {
+    auto& lut = get_kernel_lut();
+    lut[hash_multiple_strings(
+        {"fp4", "abquant", "non-preshuffleb", "non-preshufflequant", "1x128x128"})] =
+        [](const ck_tile::ArgParser& arg_parser) {
+            using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+            using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
+            using TypeConfig      = decltype(GemmQuantTypeConfig<ck_tile::pk_fp4_t,
+                                                                 ck_tile::pk_fp4_t,
+                                                                 ck_tile::half_t,
+                                                                 float>{});
+            return run_gemm_example_prec_type<GemmConfigABQuantPrefill<ck_tile::pk_fp4_raw_t>,
+                                              TypeConfig,
+                                              AQuantGroupSize,
+                                              BQuantGroupSize,
+                                              ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+        };
+    lut[hash_multiple_strings(
+        {"fp4", "abquant", "preshuffleb", "non-preshufflequant", "1x128x128"})] =
+        [](const ck_tile::ArgParser& arg_parser) {
+            using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+            using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
+            using TypeConfig      = decltype(GemmQuantTypeConfig<ck_tile::pk_fp4_t,
+                                                                 ck_tile::pk_fp4_t,
+                                                                 ck_tile::half_t,
+                                                                 float>{});
+            return run_gemm_example_prec_type<
+                GemmConfigPreshuffleB_ABQuant_Prefill<ck_tile::pk_fp4_raw_t>,
+                TypeConfig,
+                AQuantGroupSize,
+                BQuantGroupSize,
+                ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+        };
+    return 0;
+}();
--- a/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_fp8.cpp
+++ b/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_fp8.cpp
@@ -0,0 +1,70 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#include "run_gemm_quant_example.inc"
+#include "gemm_abquant_quantgrouped.h"
+
+static auto _ = []() {
+    auto& lut                               = get_kernel_lut();
+    lut[hash_multiple_strings({"fp8",
+                               "abquant",
+                               "non-preshuffleb",
+                               "non-preshufflequant",
+                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfigABQuantPrefill<ck_tile::fp8_t, false>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    lut[hash_multiple_strings({"fp8",
+                               "abquant",
+                               "non-preshuffleb",
+                               "non-preshufflequant",
+                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfig<ck_tile::fp8_t>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    lut[hash_multiple_strings({"fp8",
+                               "abquant",
+                               "non-preshuffleb",
+                               "preshufflequant",
+                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfigPreshuffleBQuantPrefill<ck_tile::fp8_t>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    lut[hash_multiple_strings({"fp8",
+                               "abquant",
+                               "non-preshuffleb",
+                               "preshufflequant",
+                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfigPreshuffleBQuantPrefill<ck_tile::fp8_t>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    return 0;
+}();
--- a/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_preshuffleb_bf8.cpp
+++ b/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_preshuffleb_bf8.cpp
@@ -0,0 +1,41 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#include "run_gemm_quant_example.inc"
+#include "gemm_abquant_quantgrouped.h"
+
+static auto _ = []() {
+    auto& lut = get_kernel_lut();
+
+    lut[hash_multiple_strings({"bf8",
+                               "abquant",
+                               "preshuffleb",
+                               "non-preshufflequant",
+                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::bf8_t>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    lut[hash_multiple_strings({"bf8",
+                               "abquant",
+                               "preshuffleb",
+                               "non-preshufflequant",
+                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::bf8_t>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    return 0;
+}();
--- a/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_preshuffleb_fp8.cpp
+++ b/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_preshuffleb_fp8.cpp
@@ -0,0 +1,40 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#include "run_gemm_quant_example.inc"
+#include "gemm_abquant_quantgrouped.h"
+
+static auto _ = []() {
+    auto& lut                               = get_kernel_lut();
+    lut[hash_multiple_strings({"fp8",
+                               "abquant",
+                               "preshuffleb",
+                               "non-preshufflequant",
+                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::fp8_t>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    lut[hash_multiple_strings({"fp8",
+                               "abquant",
+                               "preshuffleb",
+                               "non-preshufflequant",
+                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
+        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
+        using TypeConfig =
+            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
+        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::fp8_t>,
+                                          TypeConfig,
+                                          AQuantGroupSize,
+                                          BQuantGroupSize,
+                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+    };
+    return 0;
+}();
--- a/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_preshuffleb_preshufflequant.cpp
+++ b/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_preshuffleb_preshufflequant.cpp
@@ -4,41 +4,41 @@
 #include "38_block_scale_gemm/gemm_utils.hpp"
 #include "run_gemm_quant_example.inc"

-template <typename T>
+template <typename T, bool TransposeC>
 using GemmConfigPreshuffleB_PreshuffleBQuant =
-    GemmConfigPreshuffleB_ABQuant_PreshuffleBQuant_Prefill<T>;
+    GemmConfigPreshuffleB_ABQuant_PreshuffleBQuant_Prefill<T, TransposeC>;

 static auto _ = []() {
-    auto& lut                               = get_kernel_lut();
-    lut[hash_multiple_strings({"fp8",
-                               "abquant",
-                               "preshuffleb",
-                               "preshufflequant",
-                               "1x1x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPreshuffleB_PreshuffleBQuant<ck_tile::fp8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
-    lut[hash_multiple_strings({"fp8",
-                               "abquant",
-                               "preshuffleb",
-                               "preshufflequant",
-                               "1x128x128"})] = [](const ck_tile::ArgParser& arg_parser) {
-        using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
-        using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
-        using TypeConfig =
-            decltype(GemmQuantTypeConfig<ck_tile::fp8_t, ck_tile::fp8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPreshuffleB_PreshuffleBQuant<ck_tile::fp8_t>,
-                                          TypeConfig,
-                                          AQuantGroupSize,
-                                          BQuantGroupSize,
-                                          ck_tile::QuantType::ABQuantGrouped>(arg_parser);
-    };
+    auto& lut = get_kernel_lut();
+    lut[hash_multiple_strings({"fp8", "abquant", "preshuffleb", "preshufflequant", "1x1x128"})] =
+        [](const ck_tile::ArgParser& arg_parser) {
+            using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+            using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+            using TypeConfig      = decltype(GemmQuantTypeConfig<ck_tile::fp8_t,
+                                                                 ck_tile::fp8_t,
+                                                                 ck_tile::half_t,
+                                                                 float>{});
+            return run_gemm_example_prec_type<
+                GemmConfigPreshuffleB_PreshuffleBQuant<ck_tile::fp8_t, false>,
+                TypeConfig,
+                AQuantGroupSize,
+                BQuantGroupSize,
+                ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+        };
+    lut[hash_multiple_strings({"fp8", "abquant", "preshuffleb", "preshufflequant", "1x128x128"})] =
+        [](const ck_tile::ArgParser& arg_parser) {
+            using AQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 1, 128>>;
+            using BQuantGroupSize = ck_tile::QuantGroupShape<ck_tile::sequence<1, 128, 128>>;
+            using TypeConfig      = decltype(GemmQuantTypeConfig<ck_tile::fp8_t,
+                                                                 ck_tile::fp8_t,
+                                                                 ck_tile::half_t,
+                                                                 float>{});
+            return run_gemm_example_prec_type<
+                GemmConfigPreshuffleB_PreshuffleBQuant<ck_tile::fp8_t, true>,
+                TypeConfig,
+                AQuantGroupSize,
+                BQuantGroupSize,
+                ck_tile::QuantType::ABQuantGrouped>(arg_parser);
+        };
    return 0;
 }();
--- a/example/ck_tile/38_block_scale_gemm/gemm_utils.hpp
+++ b/example/ck_tile/38_block_scale_gemm/gemm_utils.hpp
@@ -223,7 +223,7 @@ struct GemmConfigPreshuffleB_PreshuffleBQuant_Prefill
    static constexpr bool BPreshuffleQuant = true;
 };

-template <typename PrecType>
+template <typename PrecType, bool TransposeC_ = true>
 struct GemmConfigPreshuffleB_ABQuant_Prefill : public GemmConfigPreshuffleB_BQuant_Prefill<PrecType>
 {
    static constexpr ck_tile::index_t M_Warp = 2;
@@ -231,17 +231,17 @@ struct GemmConfigPreshuffleB_ABQuant_Prefill : public GemmConfigPreshuffleB_BQua
    static constexpr ck_tile::index_t K_Warp = 1;

    static constexpr bool kPadK      = false;
-    static constexpr bool TransposeC = true;
+    static constexpr bool TransposeC = TransposeC_;
 };

-template <typename PrecType>
+template <typename PrecType, bool TransposeC_ = true>
 struct GemmConfigPreshuffleB_ABQuant_PreshuffleBQuant_Prefill
-    : public GemmConfigPreshuffleB_ABQuant_Prefill<PrecType>
+    : public GemmConfigPreshuffleB_ABQuant_Prefill<PrecType, TransposeC_>
 {
    static constexpr bool BPreshuffleQuant = true;
 };

-template <typename PrecType>
+template <typename PrecType, bool TransposeC_ = true>
 struct GemmConfigPreshuffleB_ABQuant_Decode : public GemmConfigPreshuffleB_BQuant_Prefill<PrecType>
 {
    static constexpr ck_tile::index_t M_Tile = 16;
@@ -249,7 +249,7 @@ struct GemmConfigPreshuffleB_ABQuant_Decode : public GemmConfigPreshuffleB_BQuan
    static constexpr ck_tile::index_t K_Tile = 256 / sizeof(PrecType);

    static constexpr bool kPadK      = false;
-    static constexpr bool TransposeC = true;
+    static constexpr bool TransposeC = TransposeC_;
 };

 template <typename PrecType>
@@ -271,11 +271,11 @@ struct GemmConfigQuantPrefill : public GemmConfigBase
    // static constexpr auto Scheduler = ck_tile::GemmPipelineScheduler::Interwave;
 };

-template <typename PrecType>
+template <typename PrecType, bool TransposeC_ = true>
 struct GemmConfigABQuantPrefill : public GemmConfigQuantPrefill<PrecType>
 {
    static constexpr bool kPadK      = false;
-    static constexpr bool TransposeC = true;
+    static constexpr bool TransposeC = TransposeC_;
 };

 // Used for A=16bit and B=8bit. The warp tile has KPack=16
@@ -296,8 +296,8 @@ struct GemmConfigMixedPrecision : public GemmConfigBase
    static constexpr ck_tile::index_t K_Warp_Tile = 64;
 };

-template <typename PrecType>
-struct GemmConfigEightWarps : public GemmConfigABQuantPrefill<PrecType>
+template <typename PrecType, bool TransposeC_ = true>
+struct GemmConfigEightWarps : public GemmConfigABQuantPrefill<PrecType, TransposeC_>
 {
    static constexpr ck_tile::index_t M_Warp = 4;
    static constexpr ck_tile::index_t N_Warp = 2; // NWarps == 2 for ping-pong!
@@ -308,12 +308,11 @@ struct GemmConfigEightWarps : public GemmConfigABQuantPrefill<PrecType>
    static constexpr ck_tile::index_t K_Tile = 128 / sizeof(PrecType) * K_Warp;

    static constexpr bool kPadK      = false;
-    static constexpr bool TransposeC = true;
    static constexpr int kBlockPerCu = 1;
 };

-template <typename PrecType>
-struct GemmConfigPreshuffleBEightWarps : public GemmConfigEightWarps<PrecType>
+template <typename PrecType, bool TransposeC_ = true>
+struct GemmConfigPreshuffleBEightWarps : public GemmConfigEightWarps<PrecType, TransposeC_>
 {
    static constexpr bool PreshuffleB      = true;
    static constexpr bool DoubleSmemBuffer = true;
--- a/example/ck_tile/38_block_scale_gemm/run_gemm_quant_example.inc
+++ b/example/ck_tile/38_block_scale_gemm/run_gemm_quant_example.inc
@@ -235,8 +235,10 @@ float gemm_calc_quant(const ck_tile::QuantGemmHostArgs& args, const ck_tile::str
        const dim3 grids  = Kernel::GridSize(args.M, args.N, args.k_batch);
        const dim3 blocks = Kernel::BlockSize();

-        // Split-K validation is handled by Kernel::IsSupportedArgument
-        // Split-K is only supported for BQuantGrouped without preshuffle
+        // Split-K validation is handled by Kernel::IsSupportedArgument.
+        // Split-K is supported for:
+        //   - BQuantGrouped without preshuffle
+        //   - ABQuantGrouped without APreshuffleQuant
        if(!Kernel::IsSupportedArgument(kargs))
        {
            throw std::runtime_error("Wrong! Arguments not supported! Skipping gemm!\n");