Commit Graph

61 Commits

Author SHA1 Message Date
Erwin Terpstra
b41bfece83 [rocm-libraries] ROCm/rocm-libraries#4268 (commit d2fca53)
[CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant
 pipeline (#4268)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [X] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [X] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [X] I have added inline documentation which enables the maintainers
with understanding the motivation
- [X] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [X] I have run `clang-format` on all changed files
- [X] Any dependent changes have been merged
2026-02-10 13:59:03 +00:00
Yi DING
d5acfd8d52 [rocm-libraries] ROCm/rocm-libraries#4451 (commit 091bf0f)
[CK_TILE] Blockscale Gemm Fix Multi-Arch Compilation

## Motivation
This PR updates CK_TILE blockscale GEMM-quant kernels and launch helpers
to compile across multiple GPU architectures by introducing compile-time
availability gating and a new attribute tag mechanism for kernel
symbol/attribute specialization.

## Technical Details
- Add an architecture-guarded `kIsAvailable` flag to the gfx950 pipeline
and propagate availability handling into `QuantGemmKernel`.
- Extend `make_kernel`/`kentry` to accept an `Attr` tag enabling
per-kernel compile-time attributes (e.g., `no-packed-fp32-ops`) and
unique symbols.
- Update the blockscale GEMM quant example to pass kernel attributes and
adjust gfx950 gating.

## Test Plan
- CI
- Local test: `cmake .. --preset dev -DGPU_TARGETS='gfx942;gfx950'
-GNinja && ninja tile_example_gemm_quant`
- Local test with ROCm/aiter#1954
## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-10 12:42:19 +00:00
kensclin
5b3e527c88 [rocm-libraries] ROCm/rocm-libraries#4280 (commit b7de1e1)
[CK_TILE] Add blockscale GEMM support for EightWarps on
 gfx950 (#4280)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Proposed changes

gemm blockscale eightwarps support

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run `clang-format` on all changed files
- [x] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
2026-02-09 03:55:52 +00:00
Aviral Goel
3e77721755 feat: add split_k support for block scale gemm bquant mode. (#3653)
* WIP: add splitk to bquant

* feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types

* chore: remove temporary test script

* fix: incorrect tile window length for splitted bq tensor window

* chore: improve comments

* test: add unit tests to cover bquant splitk functionality

* fix: conflict resolution by renaming variables
2026-02-02 14:41:53 -08:00
Erwin Terpstra
6a6177a246 [CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603)
* chore: split block scale example instances in more separate files to speed up compile times

* wip: fp4 scaffolding for abquant

* feat: add fp4 decoding-while-loading to abquant pipeline

* feat: add support for fp4 CPU verification in abquant

* chore: add time tracking to reference calculation

* feat: add a4w4 test for blockscale gemm

* feat: optimize reference calculation by preconverting values to AccType

* feat: add fp4 to fp8 look-up table

* fix: reference to wrong ComputeDataType field in QuantProblem

* feat: type utilities for determining MFMA compute types

* feat: packed fp4 for abquant weight preshuffle

* feat: add separate tests for a4w4 base case, padding and preshuffleB

* fix: fp4 conversion on gfx950 attempting to use non-supported method

* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size

* chore: add fp4 preshuffleb mode to block scale example

* chore: sanity check for packed types being 1 byte

* chore: clarify tensor dimension indices with constants

* chore: replace traits check with specialized check for packed types

* style: some minor refactoring and cleanup

* fix: correct conversion table for FNUZ fp8

* chore: add fp4 instances to main abquant instances again

* chore: use same initialization branch for int4 and fp4

* chore: add missing initialization for fp4 in block scale gemm example

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-30 04:40:50 -07:00
Khushbu Agarwal
9b168082b7 [CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629)
* initial commit

* preshuffleQuant support for ABQuant

* fix mxfp4 to use correct QuantGroupSize

* addressing review comments and seperated Preshufflequant for A and B

* updated grouped gemm example for updated traits definition

* fix for CI failure

* updated grouped_gemm_abquant test for updated traits definition

* updated grouped_gemm_abquant test for updated traits definition
2026-01-28 19:45:09 -08:00
Yi DING
8e3d84aba3 [CK_TILE] ABQuant New Preshuffle (#3638)
* Refactor

* Gemm quant improvement

* Change preshuffle

* Fix

* Fix grouped gemm ut

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-27 23:46:49 -08:00
Aviral Goel
b8751e505d feat: Add Interwave scheduler for aquant memory pipeline (#3540)
* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* test: add unit test for aquant memory pipeline

* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* fix: compilation error on gfx950

* chore: remove debug statements from the code

* test: resolve merge conflict

* test: remove non rcr unit tests from test suite
2026-01-26 11:27:42 -08:00
kensclin
31a35ecab4 GEMM Blockscale ABQuant Optimization (#3620)
* GEMM Blockscale ABQuant Optimization

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix precommit error

* clean

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2026-01-22 09:39:38 -08:00
Cong Ma
f9104ef9b3 [CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584) 2026-01-16 15:27:39 -08:00
Khushbu Agarwal
118afa455c [CK_Tile] Support for group size 128 for Preshuffle quant for 2d block scale gemm (#3462)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* G128 working for both prefill and decode shapes for preshufflequant

* clean up after merging with develop

* fixing group 64 for decode shapes

* non preshufflequant working for group size 128

* enable preshuffleb and preshufflequant with variour group sizes

* reduce build time by splitting example into diff datatype files

* Adding tests for preshuffleQuant

* address review comment

* fix for gfx1201

* compile time fix for gfx1201

* clang formatted

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
2026-01-14 10:00:19 -08:00
Aviral Goel
e809861d49 refactor: remove Default scheduler implementation as it not used anymore (#3542)
* refactor: remove Default scheduler implementation as it not used anymore

* refactor: remove dead code from gemm universal kernel

* chore: add descriptive comments about amd intrinsic hardware sync instructions

* fix: label existing memory pipeline for aquant as intrawave
2026-01-12 09:51:06 -08:00
damien-lejeune
4216d43da8 Dlejeune/ck tile 2d multiple reductions (#3147)
* WIP

* Add Unit tests for the Multi Reduction Kernel

* clang format

* Rename multiblock to threadwise

* Multiblock WIP

* Fix multi reduce multi block unit tests

* Multi Reduce Tile Engine: WIP

* refactoring + try addressing precision error

* Fix multiops examples

* Cleanup

* Clean up tile engine's reduce op

* Update changelog

* Fix remod/clang

* Fix dates

* Fix documentation & missing file

* Fix comments

* Use the update_tile api in the multi-block kernel

* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests

* Add TileParitioner

* Cleanup

* Add warning when no data to process, in the example

* Refactoring Reduce kernel Tile Partioner + cleanup

* Move the tile partioner to its own file

* Add missing includes

* Fix copyright header with update_amd_copyright_headers.py

* Fix change of interface in Reduce2dProblem

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-01-09 11:16:37 +01:00
Khushbu Agarwal
aaa35f0bbf [CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* clean up after merging with develop

* added comments for the tile window and tile distribution encoding

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
2026-01-06 12:46:59 -08:00
kyle-256
76696ace44 [CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452)
* update grouped_gemm blockwise kernel

* update config

* update kernel

* update examples

* remove test code for now

* sync test files with origin/develop

* update example

* fix code lint

* fix code-lint

* update test code

* run clang format

* run pre-commit

* update api
2026-01-06 12:36:04 -08:00
kensclin
2309c86054 [CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495)
* [CK_TILE] add preshuffleB mode for ABQuant GEMM

* fix precommit error

* use template method call for cvt_scale_to_fp32

* fix precommit error

* add test code

* fix precommit error

* switch abquant  gemmconfig to default

* Add changelog.md

* fix precommit error

* fix conflict
2026-01-06 12:35:01 -08:00
joyeamd
2b563ad048 Joye/revise wp pipeline (#3493)
* [CK_TILE] unify double and single lds implementation (#108)

Unify LDS buffer management API for single and double buffering modes

This change consolidates the Local Data Store (LDS) buffer management by:

Merging single and double LDS buffer APIs into a unified interface
Implementing ping-pong address calculation in pipeline when double LDS is enabled
Computing pong buffer addresses dynamically using base address offsets

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update wp_pipeline

* fix a c++17 issue

* update for ci errors

* fix ci issues

* include a header to fix ci errors

* fix some rebase issues

* update with rebase

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-05 13:49:26 -08:00
Max Podkorytov
e339101e9c [CK-Tile] move out memory operation from cshuffle epilogue class (#3359)
* initial poc

* factor out common parts in operator()

* cv4

* rest of the universal gemm pipelines

* fix test

* remove boilerplate from tile engine

* fix example

* fix example

* format

* fix tests build for gemm

* remove base pipeline codegen from gemm instance builder

* unify v3 logic with the rest of universal gemm pipelines

* fix build for multi abd test

* fix test gemm multi d

* fix build for weight preshuffle

* fix grouped gemm test

* fix grouped gemm multi d test

* fix grouped gemm preshuffle

* fix grouped gemm example except for quant

* fix gemm preshuffle

* fix splitk 2 stage example

* fix batched gemm example

* fix multid example

* fix multiabd example

* fix batched gemm test

* fixup

* fix examples build

* fix grouped gemm test build

* fix smoke builder

* hacky poc

* fix tile engine

* kill the lambda

* maybe fix test build

* more fixes

* clang-format

* save temp

* clang-format

* mostly fix examples

* clang-format

* remove dead code

* more cleanup

* fix fmha bwd build (default epilogue set/add appears to be broken)

* fix default epilogue tests but not correctness

* clang-format

* fix bquant

* clang-format

* cleanup dead code

* rearrange make windows for readability

* restore changes to IsSupportedArgument

* fix smoke-builder

* clang-format

* fixup rename class

* build fixes

* clang-format

* fix builder

* fixup

* remove set from builder tests

* fix test

* clang-format

* re-refactor the kernels

* clang-format

* fix header license

* remove memory operation from conv bwd test

* clang-format

* clang-format example,include

* clang-format test

* build fixes

* clang-format

* solve compilation error

* fix the CI

* solve compilation error

* clang format

* solve merge conflict

* solve merge conflict

* solve the gfx11 error

* solve test error

* moar build fixes

* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-04 03:28:14 -08:00
Erwin Terpstra
e08efa551f [CK_TILE] Grouped gemm quant tensor layouts (#3414)
* feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time

* chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times

* chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems

* fix: missing check for transposed load in bquant pipeline

* chore: lower unit test tensors dimensions a bit for faster tests

* chore: set grouped gemm example M/N warp tile to 16

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-24 23:01:23 -08:00
kensclin
7f68f3c4fa Enable padding blockscale for abquant (#3453)
* Enable padding blockscale for abquant

* run clang-format

* Reduce unnecessary testing

* remove cout
2025-12-24 09:12:40 -08:00
kensclin
0500fcc017 Support A/B Quantization in Blockscale GEMM (#3343)
* Support A/B Quantization in Blockscale GEMM

* Support A/B Quantization in Blockscale GEMM

* Support A/B Quantization in Blockscale GEMM

* Support A/B Quantization in Blockscale GEMM

* Support A/B Quantization in Blockscale GEMM

* Implement review suggested changes

* Implement review suggested changes

* Sync with develop

* fix pre-commit error

* Add unit tests for blockscale AB-Quantization

* fix pre-commit error

* fix pre-commit error

* fix compile error

* fix compile error

* fix clang-format

* fix clang-format

* fix enumeration values not handled in switch

* rebase file

* Add missing enums to data_type_sizeof (#3430)

Fixes broken build on gfx942. This was some test code that got merged at the same time.

* [CK_BUILDER] CK Tile header installation for builder, algorithm concept improvements (#3419)

* Added install of CK_Tile headers when using CK_EXPERIMENTAL_BUILDER. MIOpen needs this since the builder uses features from CK Tile and the CK Tile install is excluded when doing a narrow build for MIOpen
* Changed algorithm concept type checks to be concepts instead of constexpr bool functions. This improves compiler error messages when using these concepts in static_asserts

---------

Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com>

* Add build trace diagnostics to CI. (#3432)

* generate and visualize build traces for all archs

* generate build traces in all cases

* fix jenkins logic

* fix typo

* use more threads for parsing dependency map

* add script to parse ninja traces and issue warnings

* fix python script syntax and header

* fix python syntax one more time

* fix python syntax

* Support A/B Quantization in Blockscale GEMM

* Implement review suggested changes

* Sync with develop

* Add unit tests for blockscale AB-Quantization

* fix enumeration values not handled in switch

* rebase file

* rebase file

---------

Co-authored-by: John Shumway <jshumway@amd.com>
Co-authored-by: DarylHawkinsAMD <Daryl.Hawkins@amd.com>
Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-17 07:13:47 -08:00
Sami Remes
a0cdb0b493 [CK_TILE] Fix some inconsistencies with OverrideBDatatype in BQuant GEMM (#3394)
* Fix some inconsistencies with OverrideBDatatype

* fix formatting

* Fix BGlobalPrefetch, no static

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-15 07:18:38 -08:00
Bartłomiej Kocot
3b773109e5 [CK TILE][AICK-439] Fix cshuffle epilogue wave per shuffle (#3364)
* [CK TILE] Fix cshufle epligoue wave per shuffle

* Align shuffle per tile with smem

* fixes

* Fixes for double smem

* fix
2025-12-15 12:59:48 +01:00
Aviral Goel
45c4ea510c chore: add copyright to pass the CI (#3407) 2025-12-11 10:34:15 -08:00
eliotwang
715671e419 Bf16*fp4 gemm (#2801)
* support bf16*mxfp4 gemm

* rebase bf16*fp4 example to develop branch

* Clean up commented debug code in GEMM kernel

* rename example folder

* support bf16*mxfp4 gemm

* rebase bf16*fp4 example to develop branch

* Clean up commented debug code in GEMM kernel

* rename example folder

* rebase to new develop

* fix clang format

* update code according to reviewer's comment

* Update README.md

* update code according to reviewer's comment

* update code according to reviewer's comment

* Update CMakeLists.txt

* Update README.md

* Update CMakeLists.txt

* Delete files

* Delete files

* Add unit tests

* Update test_gemm_quant_base.hpp

* merge bf16*fp4 example to develop branch

* fix clang format

* fix clang format

* Update CMakeLists.txt

* fix ci test

* fix clang format

* resolve conflicts

---------

Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-11 07:20:29 -08:00
Sami Remes
c363a98d41 [CK_TILE] Support more layouts for BQuant GEMM (#3349)
* WIP: preparing to add transpose bq support

* WIP: handle both row/col layout for BQ windows/tile dstr

* Fix build

* WIP: adding some test, debugging numerical errors

* Fix all but pkint4 tests

* Remove test_gemm_quant_typed.cpp again

* update disabled tests

* add conversion from pkint4 for b matrix

* fix formatting

* fix formatting

* Fix tr_load and use override b datatype for clarity

* fix formatting

* make bquant preshuffle tests bqlayout column-major
2025-12-08 13:05:56 -08:00
Erwin Terpstra
fe07b5a1bf [CK Tile] Grouped GEMM aquant mode and non-persistent kernel (#3337)
* wip: add aquant to grouped gemm quant example

* fix: properly handle hot loop count in aquant pipeline

* fix: add separate GemmConfig structs for AQuant, automatically select the correct one

* feat: finish support for a non-persistent kernel invocation for grouped gemm quant, and add support code to example

* refactor: cleaned up grouped gemm quant example a bit by reusing pipeline selection logic

* chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants

* feat: add quant grouped gemm tests cases for aquant (regular and transpose C) and non-persistent kernel

* fix: update base pipeline classes according to changes in develop branch

* Revert "chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants"

This reverts commit b3fd4d326d.

* feat: remove aquant config from grouped gemm quant example, update to add persistency as runtime parameter

* chore: removed work-around for aquant bug that has been fixed

* chore: fix typo in command-line parameters

* fix: correct K warp tile size for gfx950

* chore: incorrect warp tile configuration on gfx942
2025-12-08 12:19:22 -08:00
Thomas Ning
86a84ae611 Add the gfx1011 support on CK Tile with the SGPR builtin reading protection (#3350)
* Finish the fixes

* add the gfx1010 support macro

* Fix the compilation error
2025-12-05 14:18:30 -08:00
Khushbu Agarwal
6b1bceca7b [CK_Tile] Enable PreshuffleB for 2d block scale Gemm (#3298)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* fixing bq tensor calculation

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-05 09:57:52 -08:00
Cong Ma
ed080f5a56 Congma/ck tile/aquant mem pipeline (#3346)
* [CK TILE GEMM QUANT] Fix the bug in HotLoopTail of memory pipeline
2025-12-05 09:35:27 -07:00
kensclin
ffc3120f63 Ck tile/gemm blockscale opt (#3227)
* GEMM block scale optimization kernel

* GEMM block scale optimization kernel

* Fix: Apply clang-format for style consistency

* Fix: Apply clang-format for style consistency

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-03 22:07:23 -08:00
Aviral Goel
6cb0bc2d11 feat(block_scale_gemm): Support RRR-R, CRR-R and CCR-C layout for aquant quant mode (#3193)
* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* feat(gemm_quant): add RRR and CRR layout support for aquant gemm

* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm

* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests

* fix: test cases compilation failure due to PR# 2095

* fix: make condition to filter out tests for gfx950 more explicit

* need to support the gfx950

* fix: add layout suppot for gfx950

* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)

* WIP: add support for pk_int4_t for aquant mode layouts RR and CR

* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant

* fix: compile time error for gfx950

* fix: minor bug where is_a_load_tr_v() was mising

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* docs: update changelog with new layout support info

* Update CHANGELOG.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* refactor: break test instances into multiple cpp files to reduce build time (#3319)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* refactor: break test instances into multiple cpp files to reduce build time

* chore: rename file for better code readability

* fix: merge conflict resolution

* fix: remove memory pipeline because new layout is not compatible

* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()

* refactor: address review comments

* solve the conflict

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-02 14:59:07 -08:00
Cong Ma
23fb253c4e Make CK TILE GEMM Aquant support block tile 128x128x128 (#3325)
* [CK TILE GEMM Quant] Rename GemmConfigBQuantPrefill to GemmConfigQuantPrefill in examples

* [CK TILE GEMM Quant] update tile distribution of aquant

* [CK TILE GEMM Quant] update aquant register offset calculation

* [CK TILE GEMM Quant] Reimplement aquant register offset calculation

* [CK TILE GEMM Quant] Add more unit tests of Aquant

- Test M128xN128xK128

* [CK TILE GEMM Quant] Add more comments to Gemm Aquant
2025-12-01 15:04:37 -08:00
Sami Remes
f981554c39 [CK_TILE] Fix Quant GEMM build (#3320)
* Fix build

* Fix ck_tile example 38 & 40

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-11-28 20:33:53 +08:00
Thomas Ning
a38aeceb21 Fix and improve the gemm quant pipeline infrastructure (#3245) 2025-11-26 18:04:27 -08:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
Aviral Goel
35a4b26af0 fix: add dynamic selection of pipelines for aquant mode (#3282)
- Add conditional selection to use v3 pipeline when PreshuffleQuant is true
- Add static assertion in memory pipeline to prevent PreshuffleQuant usage
- Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases
- Update BaseGemmPipeline selection to handle all quant modes properly
2025-11-26 10:58:09 +04:00
Khushbu Agarwal
8111572785 [CK_Tile] Support for preshuffle weight(B) quant tensor for block scale gemm (#3165)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* addressing review comments

* fixing CI issue

* addressing reveiw comments

* formatting

* formatting

* fixing aquant operator overlaoding

* formatting

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-11-24 07:48:42 -08:00
Max Podkorytov
a3a4eb12bd [CK-Tile] Remove usage of tile partitioner's full gemm shape (#3204)
gemm shape should be used from the pipeline instead (where it gets from a problem description struct)
2025-11-18 09:56:40 -08:00
SamiAario-AMD
f2cfc6b94e Remove "basic" and universal GEMM tests, and incorporate their test cases into the GEMM pipeline tests (#3094)
* Add missing copyright statements

* Use ck_tile::host_tensor_descriptor instead of a custom lambda

* Refactor use of check_data_type in test classes

* Use TEST_SUITE_NAME with TYPED_TEST_SUITE

* Remove an unused namespace

* Make dim3 const

* Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile

* Add missing precision type combinations to CompV4 from CompV3

* Move the INT8 tests around for consistency with KernelTypesCompV3Wmma

* Add missing precision type combinations to CompV3Wmma from CompV3

* Remove the basic and universal tests and their dependencies

* On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B

* Use ADataType and BDataType instead of ComputeDataType for WarpGemm

* Explicitly set some return types to void

* Use more general typenames in InterleavedPKTypeLoader

* Add load_interleaved_pk_type.hpp to common.hpp

* Use std::is_same_v in load_int4_tile

* Add handling of LoadTranspose to load_int4_tile

* Factor out common code in several places using load_int4_tile

* Add support for pk_int4_t using load_int4_tile

* Fix formatting
2025-11-13 11:01:27 -08:00
Bartłomiej Kocot
92c1f4981a [CK_BUILDER] Add grouped conv fwd ck tile traits (#3183)
* [CK BUILDER] Add grouped conv fwd ck tile traits

* Update instance_traits_tile_grouped_convolution_forward.hpp

* Update grouped_convolution_forward_kernel.hpp
2025-11-11 13:55:33 -08:00
linqunAMD
1b1c46e508 [CK_TILE] Fix gemm_quant (#3186) 2025-11-11 08:23:57 -08:00
Khushbu Agarwal
06c651b100 formatting (#3182) 2025-11-11 07:42:26 -08:00
Sami Remes
16e85cf179 [CK_TILE] B matrix 2D block scale gemm (#3074)
* Refactor quant group size to be configurable for M/N/K, not just K

* add some asserts for configurations not implemented

* start setting of group size for N dimension

* enable 2d for reference quant gemm

* WIP: trying to figure out tile dstr and/or indexing for scale matrix

* WIP

* Fix handling of n dim blocks in tile windows etc

* remove commented code and enable all tests again

* fix formatting

* Add more specialized tile distributions

* Enable NWarps replication for bquant tile dstr

* fix formatting

* fix format

* Fix some issues from the merge

* fix formatting

* one more fix to tile dstr, and revert debug initialization

* Remove commented code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* simplify conditions that are needed for tile distributions

* only enable the working group sizes in tests

* fix formatting

* Update tile distribution for 2D bquant

* add some documentation and 2d block scale example

* fix formatting

* Add in Changlog and restructure the quant 2d example

* fix CMake

* support the change for blockscale 2d

* fix the test file

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-11-02 16:49:20 -08:00
Aviral Goel
8f1274d9b6 test(grouped_gemm): add unit tests for grouped_gemm bquant with preshuffleB true (#3119)
* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* add bquant to grouped_gemm

* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm

* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* feat: add bf8 support

* chore: remove unnecessary decltype usage

* chore: add default quant_mode to function signature as fallback

* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel

Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.

* chore: set default quant mode in function signature

* test: add additional test cases to cover edge case of no hotloop

* change code based on comments

* WIP: bquant preshuffle b compiles but gives numerical error

* feat(grouped_gemm_quant): bquant with preshuffleB support added to grouped_gemm example & kernel

* refactor: refactor code after merge commit

* chore: remove print statements

* test(grouped_gemm): split test cases by quant mode to reduce compilation time and add bquant-preshuffleB mode test cases

---------

Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-31 12:07:06 -07:00
Aviral Goel
4368fd9f57 [CK_TILE] Add Bquant to Grouped Gemm (#3063)
* update test cases

* format codes

* use GTEST_FAIL

* add bquant to grouped_gemm

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* fix a bug in test_grouped_gemm_util

* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm

* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* feat: add bf8 support

* chore: remove unnecessary decltype usage

* chore: add default quant_mode to function signature as fallback

* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel

Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.

* chore: set default quant mode in function signature

* test: add additional test cases to cover edge case of no hotloop

* chore: clang formatting

---------

Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-28 10:20:24 -04:00
Khushbu Agarwal
b11f53a484 Fix quant scale matrix layout for block scale gemm (#3079)
* Adding support for TiledPermuteN

* Adding test

* moving shuffle functions to common place

* resolving commit hook

* fix formatting
2025-10-27 13:56:07 -07:00
kyle-256
3c12a02827 [CK_TILE] add tensorwise quant in grouped gemm (#3007)
* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* change code based on comments

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-24 07:41:54 -07:00
Johannes Graner
d40b50b9d5 Update pre-commit to fixed versions, run remod for ck_tile (#2895)
* Fix ruff linter errors

* Fix remod dos2unix command

* Clang format

* Ignore utility in remod

* Run remod

* Specify clang-format version in pre-commit

* Specify ruff version

* Include PoolKernelArgs in reference_pool

* Add calculate_total_elements to reference batched contraction

* Fix calculate_total_elements declaration

* Refactor remod pre-commit hook

* Fix Aquant tests

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-16 15:29:17 -07:00
Khushbu Agarwal
3c39d279ab supporting prefill shapes for preshuffle block scale gemm (#2975)
* debugging

* debugging for prefill shapes

* comment unused code

* fix for prefill shapes

* clearing up the code

* add int4 to universal gemm example

* clang formatted

* adding test for prefill shapes in block scale gemm

* lil improv on the block pipeline

* Address Review Comment

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-10 15:36:24 -07:00