ck_tile kernel for gemm with groupwise quantized A tensor (#2473)

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-04-20 06:49:15 +00:00

* ck_tile kernel for gemm with groupwise quantized A or B tensor.

This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

1. fp8, fp8 -> f32
2. bf8, bf8 -> f32
3. i4, fp8 -> f32
4. i4, bf8 -> f32

Group size can go down to as low as K length of underlying WarpGemm primitive.

For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.

* [Block Scale Gemm] Only run gemm quant examples on __gfx94__

- Only run gemm quant examples on __gfx94__ for usage of
`v_cvt_pk_fp8_f32`
- Format the code

* [Block Scale Gemm] Remove Bquant Gemm BlockScale

This cleanup is in preparation for future development of bquant. By
isolating Aquant-related code, we can streamline the codebase and make
it easier to add and maintain bquant functionality in subsequent
updates.

* [Block Scale Gemm] Format code with clang-format-12

The latest clang-format (v19) in ROCm 7.0 generate different result than
clang-format-12 which is used in CK CI.

Format code with clang-format-12 for consistency.

* [Block Scale Gemm] Split the k direction loop

- Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp
to make the logic clearer.
- Disable C transposition.

* [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm

* [Block Scale Gemm] Update copyright

* test

* Add TailHandler

* Move TileDistributionEncodingPatternAQ

* Refactor

* refactor

* fix bug

* help solve the PR comment

* Format the code

* [Block Scale Gemm] Add unit tests

* [Block Scale Gemm] Add support to 16x16x32 MFMA

- Add support to 16x16x32 MFMA
- Fix a bug when exchange data crossing lanes

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com>
Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

This commit is contained in:

Cong Ma

2025-07-23 01:10:16 -06:00

committed by

GitHub

parent 67b2821623

commit e62710e461

35 changed files with 4864 additions and 13 deletions

									
										3

include/ck_tile/ops/flatmm/pipeline/tile_flatmm_shape.hpp
									
												View File
												
				@@ -29,6 +29,9 @@ struct TileFlatmmShape

				    static constexpr index_t flatKPerWarp  = WarpTile::at(idxK) * WarpTile::at(idxN);

				    static constexpr index_t flatKPerBlock = flatKPerWarp * kK / WarpTile::at(idxK);

				    static constexpr bool PermuteA = false;

				    static constexpr bool PermuteB = false;

				    CK_TILE_HOST static std::string GetName()

				    {

				        // clang-format off

ck_tile kernel for gemm with groupwise quantized A tensor (#2473)

3 include/ck_tile/ops/flatmm/pipeline/tile_flatmm_shape.hpp Unescape Escape View File

3

include/ck_tile/ops/flatmm/pipeline/tile_flatmm_shape.hpp

View File