composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Author	SHA1	Message	Date
Sami Aario	1ee1307d49	Candidate fix 13	2026-02-24 11:26:05 +00:00
Sami Aario	1a5ec9efdb	Candidate fix 12	2026-02-24 10:06:23 +00:00
Sami Aario	f35c9da001	Candidate fix 11	2026-02-23 13:54:53 +00:00
Sami Aario	3dad12583b	Candidate fix 10	2026-02-23 13:18:40 +00:00
Sami Aario	aebc095d0e	Candidate fix 9	2026-02-23 12:22:47 +00:00
Sami Aario	c82c7fe2b3	Candidate fix 8	2026-02-23 10:35:51 +00:00
Sami Aario	c79ab1f84a	Candidate fix 7	2026-02-23 09:03:20 +00:00
Sami Aario	1de8bc9501	Candidate fix 6	2026-02-20 16:25:36 +00:00
Sami Aario	a6ffc9c6e5	Candidate fix 5	2026-02-20 13:40:12 +00:00
Sami Aario	e2a85ee7a0	Candidate fix 4	2026-02-20 13:40:12 +00:00
Sami Aario	5d40ac6c1b	Candidate fix 3	2026-02-20 13:40:12 +00:00
Sami Aario	709608f843	Candidate fix 2	2026-02-20 13:40:12 +00:00
Sami Aario	de1a228b34	Candidate fix	2026-02-20 13:40:12 +00:00
Sami Aario	8b462b04ce	Clear load_tile_transpose_convert_with_offset	2026-02-20 13:40:12 +00:00
Sami Aario	3ec60914ad	Add include statements added by remod.py	2026-02-03 13:52:41 +00:00
Sami Aario	31c91a9535	Formatting changes	2026-02-03 13:52:41 +00:00
Sami Aario	ad2d10a633	Switch to an implementation of DetermineWarpPrecType that explicitly defines the A and B types - This is for improved clarity and finer control of the datatypes to use	2026-02-03 13:52:41 +00:00
Sami Aario	298fd29fba	Add and use load_tile_transpose_convert for mixed precision transpose loading	2026-02-03 13:52:41 +00:00
Sami Aario	7fef648bca	Refactor type conversions out of MakeBLdsBlockDescriptor, WIP!	2026-02-03 13:52:41 +00:00
Sami Aario	1b610f4aaf	Add type conversions to V4 pipeline, WIP!	2026-02-03 13:52:40 +00:00
Sami Aario	3a792017fb	Add functionality and tests for fp16 x fp8 and fp8 x fp16	2026-02-03 13:52:40 +00:00
Sami Aario	f8c4868a59	Add functionality and tests for bf16 x fp8 and fp8 x bf16	2026-02-03 13:52:40 +00:00
Sami Aario	3f4a85146c	Add MFMA warp gemm for float, float, float, 32, 32, 16	2026-02-03 13:52:40 +00:00
Sami Aario	7f22e8c66a	Add and use load_with_type_convert	2026-02-03 13:52:40 +00:00
Sami Aario	b41ed6e371	Introduce DetermineWarpPrecType for determining warp GEMM precision types	2026-02-03 13:52:40 +00:00
Sami Aario	f2fcc4a461	Add NumAccess as a template parameter to WarpGemmAttributeMfma::get_warp_dstr_encoding	2026-02-03 13:52:40 +00:00
Sami Aario	933e09f6c3	Rename the parameters of load_interleaved_pk_type and load_and_convert_tile	2026-02-03 13:52:40 +00:00
SamiAario-AMD	8c8715904e	Merge branch 'develop' into LWPCK-3549-cleanups	2026-02-03 13:28:08 +02:00
Aviral Goel	3e77721755	feat: add split_k support for block scale gemm bquant mode. (#3653 ) * WIP: add splitk to bquant * feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types * chore: remove temporary test script * fix: incorrect tile window length for splitted bq tensor window * chore: improve comments * test: add unit tests to cover bquant splitk functionality * fix: conflict resolution by renaming variables	2026-02-02 14:41:53 -08:00
Jan Patrick Lehr	069500464d	[Compiler] Addressing new compiler warnings (#3640 ) * [Compiler] Addressing new compiler warnings Clang enables new lifetime warnings in production and we see build errors due to this with the staging compiler. The attributes added in this PR are suggested by the compiler. However, I'm not very familiar with the code base, so the changes may be incorrect. * Update some more instances * Adds file-level ignores via clang diagnostic pragma The number of instances was large, so I decided to use file-level scope to disable the warning via pragma clang diagnostic ignored. It also showed this warning coming from the gtest dependency. For that, I did add the respective command line flag to the CMake variables. I don't know if this is acceptable or not. * This adds the remaining instances For a build on gfx90a. * fix clang format * Adding couple more instances from gfx1200 build * Fixed another few instances --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-02 09:39:48 -08:00
Sami Aario	4eceb2fc69	Fix a build break	2026-02-02 15:24:35 +00:00
Sami Aario	aa247e2d63	Fix a build break	2026-02-02 14:30:39 +00:00
Sami Aario	70be645270	Fix a build break	2026-02-02 10:27:58 +00:00
Sami Aario	348b555cc3	Merge remote-tracking branch 'origin/develop' into LWPCK-3549-cleanups	2026-02-02 10:00:44 +00:00
ZheWang	e6bcd192d4	Mx fp6 flatmm (#3601 ) * add fp6 data-type and support sync/async dwordx3 load/store * clang-format * pre-commit * 1st commit * default mnk pass ut * fix a distrubution * fix * fix bdram distr * update * pass ut * improve perf * update * clean code * resolve copilot comment * reslove comment * clang-format --------- Co-authored-by: ZheWang <zhewan@amd.com>	2026-02-02 16:04:40 +08:00
Po Yen Chen	8c1788757a	[CK_TILE] Fix incompatible vector type arguments for the intrinsic calls (#3672 ) * Change call to the intrinsics * fix clang format * Undo changes under include/ck/utility * Use named variable as vector size --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-01-30 12:02:49 -08:00
jiangyon.ren	4d2f8c111e	[CK_TILE][FMHA] Add sparse attention VSA (#3341 ) * add sparse attention VSA * fix the pre-commit * Add jenga test and pre-commit * add bf16 for vsa * add jenga support bf16 * remove lse arg * split kernel code to block & kernel * fix the pre-commit * fix the pre-commit * fix the copyrights * fix the copyright * fix the copyright & rename block to pipeline * fix the copyright and pipeline * remove lse & dropout & add fmt * fix the jenga&VSA code review * remove the useless code & resolved the comments * remove useless code * remove useless code * Clean up code * Remove more unused code * Re-format .hpp * Refactor codegen scripts --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2026-01-31 00:59:47 +08:00
Erwin Terpstra	6a6177a246	[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603 ) * chore: split block scale example instances in more separate files to speed up compile times * wip: fp4 scaffolding for abquant * feat: add fp4 decoding-while-loading to abquant pipeline * feat: add support for fp4 CPU verification in abquant * chore: add time tracking to reference calculation * feat: add a4w4 test for blockscale gemm * feat: optimize reference calculation by preconverting values to AccType * feat: add fp4 to fp8 look-up table * fix: reference to wrong ComputeDataType field in QuantProblem * feat: type utilities for determining MFMA compute types * feat: packed fp4 for abquant weight preshuffle * feat: add separate tests for a4w4 base case, padding and preshuffleB * fix: fp4 conversion on gfx950 attempting to use non-supported method * fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size * chore: add fp4 preshuffleb mode to block scale example * chore: sanity check for packed types being 1 byte * chore: clarify tensor dimension indices with constants * chore: replace traits check with specialized check for packed types * style: some minor refactoring and cleanup * fix: correct conversion table for FNUZ fp8 * chore: add fp4 instances to main abquant instances again * chore: use same initialization branch for int4 and fp4 * chore: add missing initialization for fp4 in block scale gemm example --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-30 04:40:50 -07:00
MHYangAMD	6ff0737843	Fix redundant cast in model sensitive rmsnorm (#3681 ) * Fix redundant cast * Fix linting	2026-01-30 10:52:19 +08:00
Khushbu Agarwal	9b168082b7	[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 ) * initial commit * preshuffleQuant support for ABQuant * fix mxfp4 to use correct QuantGroupSize * addressing review comments and seperated Preshufflequant for A and B * updated grouped gemm example for updated traits definition * fix for CI failure * updated grouped_gemm_abquant test for updated traits definition * updated grouped_gemm_abquant test for updated traits definition	2026-01-28 19:45:09 -08:00
Jeff Huang	e3556fed04	Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657 ) - Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather - Add is_gather_dim() and get_gather_index() for multi-dim page lookup - Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1}) - Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread - Add static_assert for RowMajor VLayout requirement in batch prefill Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-01-29 07:18:41 +08:00
Sami Aario	fc1b683d18	Fix a build break	2026-01-28 15:44:02 +00:00
SamiAario-AMD	d0e9dc510e	Merge branch 'develop' into LWPCK-3549-cleanups	2026-01-28 17:14:23 +02:00
Yi DING	8e3d84aba3	[CK_TILE] ABQuant New Preshuffle (#3638 ) * Refactor * Gemm quant improvement * Change preshuffle * Fix * Fix grouped gemm ut * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-27 23:46:49 -08:00
damien-lejeune	91e32f305f	[CK Tile] multi reduce improvements (#3607 ) * WIP: refactoring * Swap operation/data nested loops order * Improve memory coalescing * Add comments * Enforce same identity element for the reduce operations * Re-add compile time constant * Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-01-27 12:56:09 -08:00
Illia Silin	b26cb596b0	fix some syntax errors (#3658 )	2026-01-27 09:59:39 -08:00
SamiAario-AMD	72fa29bad5	Merge branch 'develop' into LWPCK-3549-cleanups	2026-01-27 15:39:38 +02:00
Bartłomiej Kocot	3d67e6c492	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt	2026-01-27 11:04:11 +02:00
Aviral Goel	b8751e505d	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite	2026-01-26 11:27:42 -08:00
Thomas Ning	3900e1e7ce	Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634 ) * Finished the work * Fix the pipeline	2026-01-26 10:29:28 -08:00

1 2 3 4 5 ...

722 Commits