mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-02 04:31:25 +00:00

Files

Brock Hargreaves 2a16d53cce [rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

[CK] Address a bunch of errors associated with targeting
 gfx1200 on Windows (#5045)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Still addressing errors that are blocking the merge of TheRock PR:
https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382

## Technical Details

1. There are multiple fmha python scripts that are writing native paths
which are confusing cmake. I addressed one of these in an earlier PR
https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing
more that are exposed with gfx1200 target:

```
[composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library):
[composable_kernel configure]   Syntax error in cmake code when parsing string
[composable_kernel configure]
[composable_kernel configure]     B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp
[composable_kernel configure]
[composable_kernel configure]   Invalid character escape '\b'.
```

2. In the following compiler error we see gemm_prec_str<ADataType,
BDataType> being passed as a function to concat(...), instead of being
evaluated with the parenthesis operator(), i.e.,
gemm_prec_str<ADataType, BDataType>(). There are multiples instances of
this, I wonder what non-msvc compilers do here:

```
[composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7:
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast]
[composable_kernel]   119 |     ((oss << sep << rest), ...);
[composable_kernel]       |                     ^~~~
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here
[composable_kernel]   248 |         return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName());
[composable_kernel]       |                ^
```

There are plenty of other places where we use gemm_prec_str with the
operator(), so I'm pretty sure these were just typos...but I'd like some
eyes on it.

3. There are 2 tests that fail to build on Windows, which I've excluded
from the build but will open bug tickets for:
    1.  gemm_weight_preshuffle
    2.  grouped_gemm_preshuffle

Here's a sample of the compiler error for these tests:

```
[composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj
[composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj
[composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe  -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt   -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16:
[composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89:
[composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations]
[composable_kernel]   110 |         const char* vp = std::getenv(name);
[composable_kernel]       |                               ^
[composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here
[composable_kernel]  1183 |     _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s)
[composable_kernel]       |                    ^
[composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE'
[composable_kernel]   368 |         #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT(    \
[composable_kernel]       |                                                       ^
[composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT'
[composable_kernel]   358 | #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text))
[composable_kernel]       |                                               ^
[composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation)
[composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918)
[composable_kernel] Target: x86_64-pc-windows-msvc
[composable_kernel] Thread model: posix
[composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin
[composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s).
[composable_kernel] ninja: build stopped: subcommand failed.
[composable_kernel FAILED WITH CODE 1 in 238 seconds]
ninja: build stopped: subcommand failed.
```

## Test Plan

Wait for internal CI and make sure build compiles locally.

## Test Result

Waiting on CI

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-03-03 21:55:14 +00:00

CMakeLists.txt

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

gemm_abquant_quantgrouped_bf8.cpp

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

gemm_abquant_quantgrouped_fp4.cpp

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

gemm_abquant_quantgrouped_fp8.cpp

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

gemm_abquant_quantgrouped_preshuffleb_bf8.cpp

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

gemm_abquant_quantgrouped_preshuffleb_fp8.cpp

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

gemm_abquant_quantgrouped_preshuffleb_preshufflequant.cpp

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

gemm_abquant_quantgrouped.h

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

gemm_aquant_quantgrouped_preshufflequant.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_aquant_quantgrouped.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_bf8.cpp

feat: add split_k support for block scale gemm bquant mode. (#3653 )

2026-02-02 14:41:53 -08:00

gemm_bquant_quantgrouped_bf8i4.cpp

feat: add split_k support for block scale gemm bquant mode. (#3653 )

2026-02-02 14:41:53 -08:00

gemm_bquant_quantgrouped_fp8.cpp

feat: add split_k support for block scale gemm bquant mode. (#3653 )

2026-02-02 14:41:53 -08:00

gemm_bquant_quantgrouped_fp8i4.cpp

feat: add split_k support for block scale gemm bquant mode. (#3653 )

2026-02-02 14:41:53 -08:00

gemm_bquant_quantgrouped_mx_bf16bf8.cpp

[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)

2026-02-24 17:57:02 +00:00

gemm_bquant_quantgrouped_mx_bf16bf16.cpp

[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)

2026-02-24 17:57:02 +00:00

gemm_bquant_quantgrouped_mx_bf16fp4.cpp

[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)

2026-02-24 17:57:02 +00:00

gemm_bquant_quantgrouped_preshuffleb_bf8.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshuffleb_bf8i4.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshuffleb_fp8.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshuffleb_fp8i4.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshuffleb_preshufflequant_bf8.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshuffleb_preshufflequant_bf8i4.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshuffleb_preshufflequant_fp8.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshuffleb_preshufflequant_fp8i4.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshufflequant_bf8.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshufflequant_bf8i4.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshufflequant_fp8.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_bquant_quantgrouped_preshufflequant_fp8i4.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_quant_rowcol.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_quant_tensor.cpp

[CK_TILE] ABQuant New Preshuffle (#3638 )

2026-01-27 23:46:49 -08:00

gemm_quant.cpp

[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)

2026-02-24 17:57:02 +00:00

gemm_utils.hpp

[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)

2026-02-26 23:57:17 +00:00

README.md

[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)

2026-02-24 17:57:02 +00:00

run_gemm_quant_example.inc

[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)

2026-03-03 21:55:14 +00:00

README.md

Quant GEMM Matrix Multiplication

This folder contains examples of quant GEMMs using the ck_tile tile-programming implementation.

AQuant kernel with blocks of A matrix sharing scales: custom GEMM pipeline
BQuant kernel with blocks of B matrix sharing scales: custom GEMM pipeline
Row and Column-wise scaled: All of the row-wise elements in A Matrix and column-wise elements in B Matrix will share the same quantization element and the element-wise operation will complete in epilogue.
Tensor-wise scaled: Share the same scalar scale across the whole tensor of A or B

Quantization Mode Comparison

Quant Mode	A Matrix Organization	A Scale Shape	B Matrix Organization	B Scale Shape
AQuant	Blocks along K dimension Each M×GroupSize block shares one scale	`[M, K/GroupSize]`	Not quantized	N/A
BQuant	Not quantized	N/A	Blocks along K dimension Each GroupSize×N block shares one scale	`[K/GroupSize, N]`
RowColQuant	Per-row quantization All K elements in each row share one scale	`[M, 1]`	Per-column quantization All K elements in each column share one scale	`[1, N]`
TensorQuant	Tensor-wise quantization All M×K elements share one scale	`[1]`	Tensor-wise quantization All K×N elements share one scale	`[1]`

Features

Preshuffled GEMM: Shuffle the GEMM of B (weight) matrix in the warp layout and bypass the shared memory to do the GEMM calculation. Best performance solution for GEMM.
TransposeC: Transpose the C Matrix Output layout to have the best coalesced scale reading
Preshuffled Quant: Preshuffle the input matrix to load multiple Quant warp blocks along the selected dimension.
Precision: Supports fp16, bf16, fp8, bf8, int4 (for B Matrix), uint8 (split into two fp4 in the pipeline (for B Matrix)).
Validation: CPU/GPU validation and error tolerance options.

build

# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx942) or leave it blank
../script/cmake-ck-dev.sh  ../ <arch>
# Compile the quant kernels
make tile_example_gemm_quant -j

This will result in an executable build/bin/tile_example_gemm_quant

example

args:
               -h    Print help message (default:false)
               -m    m dimension (default:3840)
               -n    n dimension (default:4096)
               -k    k dimension (default:2048)
        -a_layout    A tensor data layout - R for Row or C for Column (default:R)
        -b_layout    B tensor data layout - R for Row or C for Column (default:C)
       -bq_layout    Bq tensor data layout - R for Row or C for Column (default:C)
        -c_layout    C tensor data layout - R for Row or C for Column (default:R)
        -stride_a    Tensor A stride (default:0)
        -stride_q    Tensor AQ stride (default:0)
        -stride_b    Tensor B stride (default:0)
        -stride_c    Tensor C stride (default:0)
               -v    0: No validation, 1: Validation on CPU, 2: Validation on GPU (default:1)
            -prec    Data type. For AQuant: fp8, bf8, i4fp8, or i4bf8;  for Bquant: fp8, bf8, fp8i4, bf8i4, mxbf16bf16, mxbf16bf8 or mxbf16fp4 (default for both AQuant and Bquant: fp8)
          -warmup    Number of iterations before benchmarking the kernel (default:50)
          -repeat    Number of iterations to benchmark the kernel (default:1000)
           -timer    gpu:gpu timer, cpu:cpu timer (default:gpu)
         -split_k    SplitK value (default:1)
          -device    Device id that will be used to run the kernel (default:0)
            -init    0:random, 1:linear, 2:constant(1) (default:0)
     -flush_cache    Flush cache before running the kernel (default:true)
  -rotating_count    Rotating count (default:1000)
      -quant_mode    Choose aquant, bquant, tensor or rowcol (default:bquant)
     -preshuffleb    Enable preshuffle of tensor B (default:false)
 -preshufflequant   Enable preshuffle of quant tensor (default:false)
      -group_size    Quantization group size as MxNxK, e.g., 1x1x128, 1x32x128, 1x64x128 (default:1x1x128)

User need to select correct mapping of config for each quant mode:

	quant_mode as runtime argument	Corresponding cpp file	GemmConfig at the top of cpp file
For selecting AQuant	aquant	gemm_aquant_quantgrouped.cpp	GemmConfigQuantDecode
For selecting AQuant with Preshuffle quant	aquant	gemm_aquant_quantgrouped_preshufflequant.cpp	GemmConfigPreshuffleQuantDecode
For selecting BQuant	bquant	gemm_bquant_quantgrouped_<prec_type>.cpp	GemmConfigQuantDecode (or) GemmConfigQuantPrefill
For selecting BQuant with Preshuffle quant	bquant	gemm_bquant_quantgrouped_preshufflequant.cpp	GemmConfigPreshuffleQuantDecode (or) GemmConfigPreshuffleBQuantPrefill
For selecting PreShuffle B with BQuant	bquant	gemm_bquant_quantgrouped_preshuffleb.cpp	GemmConfigPreshuffleB_BQuant_Decode (or) GemmConfigPreshuffleB_BQuant_Prefill
For selecting PreShuffle B with preshuffle BQuant	bquant	gemm_bquant_quantgrouped_preshuffleb_preshufflequant.cpp	GemmConfigPreshuffleB_PreshuffleBQuant_Decode (or) GemmConfigPreshuffleB_PreshuffleBQuant_Prefill
For selecting RowCol quant	rowcolquant	gemm_quant_rowcol	GemmConfigRowColQuant

README.md Unescape Escape

Quant GEMM Matrix Multiplication

Quantization Mode Comparison

Features

build

example

README.md