composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-01 12:11:19 +00:00

Author	SHA1	Message	Date
Erwin Terpstra	b41bfece83	[rocm-libraries] ROCm/rocm-libraries#4268 (commit d2fca53) [CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant pipeline (#4268) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [X] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [X] I have added inline documentation which enables the maintainers with understanding the motivation - [X] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [X] Any dependent changes have been merged	2026-02-10 13:59:03 +00:00
Yi DING	d5acfd8d52	[rocm-libraries] ROCm/rocm-libraries#4451 (commit 091bf0f) [CK_TILE] Blockscale Gemm Fix Multi-Arch Compilation ## Motivation This PR updates CK_TILE blockscale GEMM-quant kernels and launch helpers to compile across multiple GPU architectures by introducing compile-time availability gating and a new attribute tag mechanism for kernel symbol/attribute specialization. ## Technical Details - Add an architecture-guarded `kIsAvailable` flag to the gfx950 pipeline and propagate availability handling into `QuantGemmKernel`. - Extend `make_kernel`/`kentry` to accept an `Attr` tag enabling per-kernel compile-time attributes (e.g., `no-packed-fp32-ops`) and unique symbols. - Update the blockscale GEMM quant example to pass kernel attributes and adjust gfx950 gating. ## Test Plan - CI - Local test: `cmake .. --preset dev -DGPU_TARGETS='gfx942;gfx950' -GNinja && ninja tile_example_gemm_quant` - Local test with ROCm/aiter#1954 ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-10 12:42:19 +00:00
kensclin	5b3e527c88	[rocm-libraries] ROCm/rocm-libraries#4280 (commit b7de1e1) [CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes gemm blockscale eightwarps support ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered	2026-02-09 03:55:52 +00:00
Aviral Goel	3e77721755	feat: add split_k support for block scale gemm bquant mode. (#3653 ) * WIP: add splitk to bquant * feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types * chore: remove temporary test script * fix: incorrect tile window length for splitted bq tensor window * chore: improve comments * test: add unit tests to cover bquant splitk functionality * fix: conflict resolution by renaming variables	2026-02-02 14:41:53 -08:00
Erwin Terpstra	6a6177a246	[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603 ) * chore: split block scale example instances in more separate files to speed up compile times * wip: fp4 scaffolding for abquant * feat: add fp4 decoding-while-loading to abquant pipeline * feat: add support for fp4 CPU verification in abquant * chore: add time tracking to reference calculation * feat: add a4w4 test for blockscale gemm * feat: optimize reference calculation by preconverting values to AccType * feat: add fp4 to fp8 look-up table * fix: reference to wrong ComputeDataType field in QuantProblem * feat: type utilities for determining MFMA compute types * feat: packed fp4 for abquant weight preshuffle * feat: add separate tests for a4w4 base case, padding and preshuffleB * fix: fp4 conversion on gfx950 attempting to use non-supported method * fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size * chore: add fp4 preshuffleb mode to block scale example * chore: sanity check for packed types being 1 byte * chore: clarify tensor dimension indices with constants * chore: replace traits check with specialized check for packed types * style: some minor refactoring and cleanup * fix: correct conversion table for FNUZ fp8 * chore: add fp4 instances to main abquant instances again * chore: use same initialization branch for int4 and fp4 * chore: add missing initialization for fp4 in block scale gemm example --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-30 04:40:50 -07:00
Khushbu Agarwal	9b168082b7	[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 ) * initial commit * preshuffleQuant support for ABQuant * fix mxfp4 to use correct QuantGroupSize * addressing review comments and seperated Preshufflequant for A and B * updated grouped gemm example for updated traits definition * fix for CI failure * updated grouped_gemm_abquant test for updated traits definition * updated grouped_gemm_abquant test for updated traits definition	2026-01-28 19:45:09 -08:00
Yi DING	8e3d84aba3	[CK_TILE] ABQuant New Preshuffle (#3638 ) * Refactor * Gemm quant improvement * Change preshuffle * Fix * Fix grouped gemm ut * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-27 23:46:49 -08:00
Aviral Goel	b8751e505d	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite	2026-01-26 11:27:42 -08:00
kensclin	31a35ecab4	GEMM Blockscale ABQuant Optimization (#3620 ) * GEMM Blockscale ABQuant Optimization * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix precommit error * clean * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2026-01-22 09:39:38 -08:00
Cong Ma	f9104ef9b3	[CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584 )	2026-01-16 15:27:39 -08:00
Khushbu Agarwal	118afa455c	[CK_Tile] Support for group size 128 for Preshuffle quant for 2d block scale gemm (#3462 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * initial commit with prints * debugging * fine-grained working * debugging medium grained * fixing the tile window * formatting * enabling prefill shapes * working prefill shapes * formatted * clean up * code cleanup * bug fix after merging with develop * G128 working for both prefill and decode shapes for preshufflequant * clean up after merging with develop * fixing group 64 for decode shapes * non preshufflequant working for group size 128 * enable preshuffleb and preshufflequant with variour group sizes * reduce build time by splitting example into diff datatype files * Adding tests for preshuffleQuant * address review comment * fix for gfx1201 * compile time fix for gfx1201 * clang formatted --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>	2026-01-14 10:00:19 -08:00
Aviral Goel	e809861d49	refactor: remove Default scheduler implementation as it not used anymore (#3542 ) * refactor: remove Default scheduler implementation as it not used anymore * refactor: remove dead code from gemm universal kernel * chore: add descriptive comments about amd intrinsic hardware sync instructions * fix: label existing memory pipeline for aquant as intrawave	2026-01-12 09:51:06 -08:00
damien-lejeune	4216d43da8	Dlejeune/ck tile 2d multiple reductions (#3147 ) * WIP * Add Unit tests for the Multi Reduction Kernel * clang format * Rename multiblock to threadwise * Multiblock WIP * Fix multi reduce multi block unit tests * Multi Reduce Tile Engine: WIP * refactoring + try addressing precision error * Fix multiops examples * Cleanup * Clean up tile engine's reduce op * Update changelog * Fix remod/clang * Fix dates * Fix documentation & missing file * Fix comments * Use the update_tile api in the multi-block kernel * Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests * Add TileParitioner * Cleanup * Add warning when no data to process, in the example * Refactoring Reduce kernel Tile Partioner + cleanup * Move the tile partioner to its own file * Add missing includes * Fix copyright header with update_amd_copyright_headers.py * Fix change of interface in Reduce2dProblem --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-01-09 11:16:37 +01:00
Khushbu Agarwal	aaa35f0bbf	[CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * initial commit with prints * debugging * fine-grained working * debugging medium grained * fixing the tile window * formatting * enabling prefill shapes * working prefill shapes * formatted * clean up * code cleanup * bug fix after merging with develop * clean up after merging with develop * added comments for the tile window and tile distribution encoding --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>	2026-01-06 12:46:59 -08:00
kyle-256	76696ace44	[CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452 ) * update grouped_gemm blockwise kernel * update config * update kernel * update examples * remove test code for now * sync test files with origin/develop * update example * fix code lint * fix code-lint * update test code * run clang format * run pre-commit * update api	2026-01-06 12:36:04 -08:00
kensclin	2309c86054	[CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495 ) * [CK_TILE] add preshuffleB mode for ABQuant GEMM * fix precommit error * use template method call for cvt_scale_to_fp32 * fix precommit error * add test code * fix precommit error * switch abquant gemmconfig to default * Add changelog.md * fix precommit error * fix conflict	2026-01-06 12:35:01 -08:00
joyeamd	2b563ad048	Joye/revise wp pipeline (#3493 ) * [CK_TILE] unify double and single lds implementation (#108) Unify LDS buffer management API for single and double buffering modes This change consolidates the Local Data Store (LDS) buffer management by: Merging single and double LDS buffer APIs into a unified interface Implementing ping-pong address calculation in pipeline when double LDS is enabled Computing pong buffer addresses dynamically using base address offsets --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update wp_pipeline * fix a c++17 issue * update for ci errors * fix ci issues * include a header to fix ci errors * fix some rebase issues * update with rebase --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-05 13:49:26 -08:00
Max Podkorytov	e339101e9c	[CK-Tile] move out memory operation from cshuffle epilogue class (#3359 ) * initial poc * factor out common parts in operator() * cv4 * rest of the universal gemm pipelines * fix test * remove boilerplate from tile engine * fix example * fix example * format * fix tests build for gemm * remove base pipeline codegen from gemm instance builder * unify v3 logic with the rest of universal gemm pipelines * fix build for multi abd test * fix test gemm multi d * fix build for weight preshuffle * fix grouped gemm test * fix grouped gemm multi d test * fix grouped gemm preshuffle * fix grouped gemm example except for quant * fix gemm preshuffle * fix splitk 2 stage example * fix batched gemm example * fix multid example * fix multiabd example * fix batched gemm test * fixup * fix examples build * fix grouped gemm test build * fix smoke builder * hacky poc * fix tile engine * kill the lambda * maybe fix test build * more fixes * clang-format * save temp * clang-format * mostly fix examples * clang-format * remove dead code * more cleanup * fix fmha bwd build (default epilogue set/add appears to be broken) * fix default epilogue tests but not correctness * clang-format * fix bquant * clang-format * cleanup dead code * rearrange make windows for readability * restore changes to IsSupportedArgument * fix smoke-builder * clang-format * fixup rename class * build fixes * clang-format * fix builder * fixup * remove set from builder tests * fix test * clang-format * re-refactor the kernels * clang-format * fix header license * remove memory operation from conv bwd test * clang-format * clang-format example,include * clang-format test * build fixes * clang-format * solve compilation error * fix the CI * solve compilation error * clang format * solve merge conflict * solve merge conflict * solve the gfx11 error * solve test error * moar build fixes * remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-04 03:28:14 -08:00
Erwin Terpstra	e08efa551f	[CK_TILE] Grouped gemm quant tensor layouts (#3414 ) * feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time * chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times * chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems * fix: missing check for transposed load in bquant pipeline * chore: lower unit test tensors dimensions a bit for faster tests * chore: set grouped gemm example M/N warp tile to 16 --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-24 23:01:23 -08:00
kensclin	7f68f3c4fa	Enable padding blockscale for abquant (#3453 ) * Enable padding blockscale for abquant * run clang-format * Reduce unnecessary testing * remove cout	2025-12-24 09:12:40 -08:00
kensclin	0500fcc017	Support A/B Quantization in Blockscale GEMM (#3343 ) * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Implement review suggested changes * Implement review suggested changes * Sync with develop * fix pre-commit error * Add unit tests for blockscale AB-Quantization * fix pre-commit error * fix pre-commit error * fix compile error * fix compile error * fix clang-format * fix clang-format * fix enumeration values not handled in switch * rebase file * Add missing enums to data_type_sizeof (#3430) Fixes broken build on gfx942. This was some test code that got merged at the same time. * [CK_BUILDER] CK Tile header installation for builder, algorithm concept improvements (#3419) * Added install of CK_Tile headers when using CK_EXPERIMENTAL_BUILDER. MIOpen needs this since the builder uses features from CK Tile and the CK Tile install is excluded when doing a narrow build for MIOpen * Changed algorithm concept type checks to be concepts instead of constexpr bool functions. This improves compiler error messages when using these concepts in static_asserts --------- Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com> * Add build trace diagnostics to CI. (#3432) * generate and visualize build traces for all archs * generate build traces in all cases * fix jenkins logic * fix typo * use more threads for parsing dependency map * add script to parse ninja traces and issue warnings * fix python script syntax and header * fix python syntax one more time * fix python syntax * Support A/B Quantization in Blockscale GEMM * Implement review suggested changes * Sync with develop * Add unit tests for blockscale AB-Quantization * fix enumeration values not handled in switch * rebase file * rebase file --------- Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: DarylHawkinsAMD <Daryl.Hawkins@amd.com> Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-17 07:13:47 -08:00
Sami Remes	a0cdb0b493	[CK_TILE] Fix some inconsistencies with OverrideBDatatype in BQuant GEMM (#3394 ) * Fix some inconsistencies with OverrideBDatatype * fix formatting * Fix BGlobalPrefetch, no static --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-15 07:18:38 -08:00
Bartłomiej Kocot	3b773109e5	[CK TILE][AICK-439] Fix cshuffle epilogue wave per shuffle (#3364 ) * [CK TILE] Fix cshufle epligoue wave per shuffle * Align shuffle per tile with smem * fixes * Fixes for double smem * fix	2025-12-15 12:59:48 +01:00
Aviral Goel	45c4ea510c	chore: add copyright to pass the CI (#3407 )	2025-12-11 10:34:15 -08:00
eliotwang	715671e419	Bf16fp4 gemm (#2801 ) support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * rebase to new develop * fix clang format * update code according to reviewer's comment * Update README.md * update code according to reviewer's comment * update code according to reviewer's comment * Update CMakeLists.txt * Update README.md * Update CMakeLists.txt * Delete files * Delete files * Add unit tests * Update test_gemm_quant_base.hpp * merge bf16fp4 example to develop branch fix clang format * fix clang format * Update CMakeLists.txt * fix ci test * fix clang format * resolve conflicts --------- Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu> Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-11 07:20:29 -08:00
Sami Remes	c363a98d41	[CK_TILE] Support more layouts for BQuant GEMM (#3349 ) * WIP: preparing to add transpose bq support * WIP: handle both row/col layout for BQ windows/tile dstr * Fix build * WIP: adding some test, debugging numerical errors * Fix all but pkint4 tests * Remove test_gemm_quant_typed.cpp again * update disabled tests * add conversion from pkint4 for b matrix * fix formatting * fix formatting * Fix tr_load and use override b datatype for clarity * fix formatting * make bquant preshuffle tests bqlayout column-major	2025-12-08 13:05:56 -08:00
Erwin Terpstra	fe07b5a1bf	[CK Tile] Grouped GEMM aquant mode and non-persistent kernel (#3337 ) * wip: add aquant to grouped gemm quant example * fix: properly handle hot loop count in aquant pipeline * fix: add separate GemmConfig structs for AQuant, automatically select the correct one * feat: finish support for a non-persistent kernel invocation for grouped gemm quant, and add support code to example * refactor: cleaned up grouped gemm quant example a bit by reusing pipeline selection logic * chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants * feat: add quant grouped gemm tests cases for aquant (regular and transpose C) and non-persistent kernel * fix: update base pipeline classes according to changes in develop branch * Revert "chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants" This reverts commit `b3fd4d326d`. * feat: remove aquant config from grouped gemm quant example, update to add persistency as runtime parameter * chore: removed work-around for aquant bug that has been fixed * chore: fix typo in command-line parameters * fix: correct K warp tile size for gfx950 * chore: incorrect warp tile configuration on gfx942	2025-12-08 12:19:22 -08:00
Thomas Ning	86a84ae611	Add the gfx1011 support on CK Tile with the SGPR builtin reading protection (#3350 ) * Finish the fixes * add the gfx1010 support macro * Fix the compilation error	2025-12-05 14:18:30 -08:00
Khushbu Agarwal	6b1bceca7b	[CK_Tile] Enable PreshuffleB for 2d block scale Gemm (#3298 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * fixing bq tensor calculation --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-05 09:57:52 -08:00
Cong Ma	ed080f5a56	Congma/ck tile/aquant mem pipeline (#3346 ) * [CK TILE GEMM QUANT] Fix the bug in HotLoopTail of memory pipeline	2025-12-05 09:35:27 -07:00
kensclin	ffc3120f63	Ck tile/gemm blockscale opt (#3227 ) * GEMM block scale optimization kernel * GEMM block scale optimization kernel * Fix: Apply clang-format for style consistency * Fix: Apply clang-format for style consistency --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-03 22:07:23 -08:00
Aviral Goel	6cb0bc2d11	feat(block_scale_gemm): Support RRR-R, CRR-R and CCR-C layout for aquant quant mode (#3193 ) * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * feat(gemm_quant): add RRR and CRR layout support for aquant gemm * test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm * fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests * fix: test cases compilation failure due to PR# 2095 * fix: make condition to filter out tests for gfx950 more explicit * need to support the gfx950 * fix: add layout suppot for gfx950 * Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277) * WIP: add support for pk_int4_t for aquant mode layouts RR and CR * test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant * fix: compile time error for gfx950 * fix: minor bug where is_a_load_tr_v() was mising * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318) * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant * test: add unit tests for new layout support CCRC for aquant block scale gemm * docs: update changelog with new layout support info * Update CHANGELOG.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * refactor: break test instances into multiple cpp files to reduce build time (#3319) * feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant * test: add unit tests for new layout support CCRC for aquant block scale gemm * refactor: break test instances into multiple cpp files to reduce build time * chore: rename file for better code readability * fix: merge conflict resolution * fix: remove memory pipeline because new layout is not compatible * build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr() * refactor: address review comments * solve the conflict --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-12-02 14:59:07 -08:00
Cong Ma	23fb253c4e	Make CK TILE GEMM Aquant support block tile 128x128x128 (#3325 ) * [CK TILE GEMM Quant] Rename GemmConfigBQuantPrefill to GemmConfigQuantPrefill in examples * [CK TILE GEMM Quant] update tile distribution of aquant * [CK TILE GEMM Quant] update aquant register offset calculation * [CK TILE GEMM Quant] Reimplement aquant register offset calculation * [CK TILE GEMM Quant] Add more unit tests of Aquant - Test M128xN128xK128 * [CK TILE GEMM Quant] Add more comments to Gemm Aquant	2025-12-01 15:04:37 -08:00
Sami Remes	f981554c39	[CK_TILE] Fix Quant GEMM build (#3320 ) * Fix build * Fix ck_tile example 38 & 40 --------- Co-authored-by: Yi DING <yi.ding@amd.com>	2025-11-28 20:33:53 +08:00
Thomas Ning	a38aeceb21	Fix and improve the gemm quant pipeline infrastructure (#3245 )	2025-11-26 18:04:27 -08:00
Aviral Goel	de6466481f	chore(copyright): update copyright header for include directory (#3293 )	2025-11-26 11:00:05 -07:00
Aviral Goel	35a4b26af0	fix: add dynamic selection of pipelines for aquant mode (#3282 ) - Add conditional selection to use v3 pipeline when PreshuffleQuant is true - Add static assertion in memory pipeline to prevent PreshuffleQuant usage - Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases - Update BaseGemmPipeline selection to handle all quant modes properly	2025-11-26 10:58:09 +04:00
Khushbu Agarwal	8111572785	[CK_Tile] Support for preshuffle weight(B) quant tensor for block scale gemm (#3165 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * addressing review comments * fixing CI issue * addressing reveiw comments * formatting * formatting * fixing aquant operator overlaoding * formatting --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-11-24 07:48:42 -08:00
Max Podkorytov	a3a4eb12bd	[CK-Tile] Remove usage of tile partitioner's full gemm shape (#3204 ) gemm shape should be used from the pipeline instead (where it gets from a problem description struct)	2025-11-18 09:56:40 -08:00
SamiAario-AMD	f2cfc6b94e	Remove "basic" and universal GEMM tests, and incorporate their test cases into the GEMM pipeline tests (#3094 ) * Add missing copyright statements * Use ck_tile::host_tensor_descriptor instead of a custom lambda * Refactor use of check_data_type in test classes * Use TEST_SUITE_NAME with TYPED_TEST_SUITE * Remove an unused namespace * Make dim3 const * Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile * Add missing precision type combinations to CompV4 from CompV3 * Move the INT8 tests around for consistency with KernelTypesCompV3Wmma * Add missing precision type combinations to CompV3Wmma from CompV3 * Remove the basic and universal tests and their dependencies * On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B * Use ADataType and BDataType instead of ComputeDataType for WarpGemm * Explicitly set some return types to void * Use more general typenames in InterleavedPKTypeLoader * Add load_interleaved_pk_type.hpp to common.hpp * Use std::is_same_v in load_int4_tile * Add handling of LoadTranspose to load_int4_tile * Factor out common code in several places using load_int4_tile * Add support for pk_int4_t using load_int4_tile * Fix formatting	2025-11-13 11:01:27 -08:00
Bartłomiej Kocot	92c1f4981a	[CK_BUILDER] Add grouped conv fwd ck tile traits (#3183 ) * [CK BUILDER] Add grouped conv fwd ck tile traits * Update instance_traits_tile_grouped_convolution_forward.hpp * Update grouped_convolution_forward_kernel.hpp	2025-11-11 13:55:33 -08:00
linqunAMD	1b1c46e508	[CK_TILE] Fix gemm_quant (#3186 )	2025-11-11 08:23:57 -08:00
Khushbu Agarwal	06c651b100	formatting (#3182 )	2025-11-11 07:42:26 -08:00
Sami Remes	16e85cf179	[CK_TILE] B matrix 2D block scale gemm (#3074 ) * Refactor quant group size to be configurable for M/N/K, not just K * add some asserts for configurations not implemented * start setting of group size for N dimension * enable 2d for reference quant gemm * WIP: trying to figure out tile dstr and/or indexing for scale matrix * WIP * Fix handling of n dim blocks in tile windows etc * remove commented code and enable all tests again * fix formatting * Add more specialized tile distributions * Enable NWarps replication for bquant tile dstr * fix formatting * fix format * Fix some issues from the merge * fix formatting * one more fix to tile dstr, and revert debug initialization * Remove commented code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * simplify conditions that are needed for tile distributions * only enable the working group sizes in tests * fix formatting * Update tile distribution for 2D bquant * add some documentation and 2d block scale example * fix formatting * Add in Changlog and restructure the quant 2d example * fix CMake * support the change for blockscale 2d * fix the test file --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-11-02 16:49:20 -08:00
Aviral Goel	8f1274d9b6	test(grouped_gemm): add unit tests for grouped_gemm bquant with preshuffleB true (#3119 ) * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * add bquant to grouped_gemm * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm * Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat: add bf8 support * chore: remove unnecessary decltype usage * chore: add default quant_mode to function signature as fallback * fix: pass correct runtime pipeline params in grouped_gemm bquant kernel Calculate has_hot_loop, num_loop, and tail_number on device side for each GEMM problem instead of using default values. This fixes incorrect results when different problems in the group have different K dimensions. * chore: set default quant mode in function signature * test: add additional test cases to cover edge case of no hotloop * change code based on comments * WIP: bquant preshuffle b compiles but gives numerical error * feat(grouped_gemm_quant): bquant with preshuffleB support added to grouped_gemm example & kernel * refactor: refactor code after merge commit * chore: remove print statements * test(grouped_gemm): split test cases by quant mode to reduce compilation time and add bquant-preshuffleB mode test cases --------- Co-authored-by: kyle-256 <Kyle.Zhao@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-31 12:07:06 -07:00
Aviral Goel	4368fd9f57	[CK_TILE] Add Bquant to Grouped Gemm (#3063 ) * update test cases * format codes * use GTEST_FAIL * add bquant to grouped_gemm * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * fix a bug in test_grouped_gemm_util * tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm * Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat: add bf8 support * chore: remove unnecessary decltype usage * chore: add default quant_mode to function signature as fallback * fix: pass correct runtime pipeline params in grouped_gemm bquant kernel Calculate has_hot_loop, num_loop, and tail_number on device side for each GEMM problem instead of using default values. This fixes incorrect results when different problems in the group have different K dimensions. * chore: set default quant mode in function signature * test: add additional test cases to cover edge case of no hotloop * chore: clang formatting --------- Co-authored-by: kyle-256 <Kyle.Zhao@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-28 10:20:24 -04:00
Khushbu Agarwal	b11f53a484	Fix quant scale matrix layout for block scale gemm (#3079 ) * Adding support for TiledPermuteN * Adding test * moving shuffle functions to common place * resolving commit hook * fix formatting	2025-10-27 13:56:07 -07:00
kyle-256	3c12a02827	[CK_TILE] add tensorwise quant in grouped gemm (#3007 ) * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * change code based on comments --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-10-24 07:41:54 -07:00
Johannes Graner	d40b50b9d5	Update pre-commit to fixed versions, run remod for ck_tile (#2895 ) * Fix ruff linter errors * Fix remod dos2unix command * Clang format * Ignore utility in remod * Run remod * Specify clang-format version in pre-commit * Specify ruff version * Include PoolKernelArgs in reference_pool * Add calculate_total_elements to reference batched contraction * Fix calculate_total_elements declaration * Refactor remod pre-commit hook * Fix Aquant tests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-10-16 15:29:17 -07:00
Khushbu Agarwal	3c39d279ab	supporting prefill shapes for preshuffle block scale gemm (#2975 ) * debugging * debugging for prefill shapes * comment unused code * fix for prefill shapes * clearing up the code * add int4 to universal gemm example * clang formatted * adding test for prefill shapes in block scale gemm * lil improv on the block pipeline * Address Review Comment --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-10-10 15:36:24 -07:00

1 2

61 Commits