composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 03:37:38 +00:00

Author	SHA1	Message	Date
Sami Aario	1ee1307d49	Candidate fix 13	2026-02-24 11:26:05 +00:00
Sami Aario	1a5ec9efdb	Candidate fix 12	2026-02-24 10:06:23 +00:00
Sami Aario	f35c9da001	Candidate fix 11	2026-02-23 13:54:53 +00:00
Sami Aario	3dad12583b	Candidate fix 10	2026-02-23 13:18:40 +00:00
Sami Aario	aebc095d0e	Candidate fix 9	2026-02-23 12:22:47 +00:00
Sami Aario	c82c7fe2b3	Candidate fix 8	2026-02-23 10:35:51 +00:00
Sami Aario	c79ab1f84a	Candidate fix 7	2026-02-23 09:03:20 +00:00
Sami Aario	1de8bc9501	Candidate fix 6	2026-02-20 16:25:36 +00:00
Sami Aario	a6ffc9c6e5	Candidate fix 5	2026-02-20 13:40:12 +00:00
Sami Aario	e2a85ee7a0	Candidate fix 4	2026-02-20 13:40:12 +00:00
Sami Aario	5d40ac6c1b	Candidate fix 3	2026-02-20 13:40:12 +00:00
Sami Aario	709608f843	Candidate fix 2	2026-02-20 13:40:12 +00:00
Sami Aario	de1a228b34	Candidate fix	2026-02-20 13:40:12 +00:00
Sami Aario	8b462b04ce	Clear load_tile_transpose_convert_with_offset	2026-02-20 13:40:12 +00:00
Sami Aario	3ec60914ad	Add include statements added by remod.py	2026-02-03 13:52:41 +00:00
Sami Aario	31c91a9535	Formatting changes	2026-02-03 13:52:41 +00:00
Sami Aario	ad2d10a633	Switch to an implementation of DetermineWarpPrecType that explicitly defines the A and B types - This is for improved clarity and finer control of the datatypes to use	2026-02-03 13:52:41 +00:00
Sami Aario	298fd29fba	Add and use load_tile_transpose_convert for mixed precision transpose loading	2026-02-03 13:52:41 +00:00
Sami Aario	7fef648bca	Refactor type conversions out of MakeBLdsBlockDescriptor, WIP!	2026-02-03 13:52:41 +00:00
Sami Aario	1b610f4aaf	Add type conversions to V4 pipeline, WIP!	2026-02-03 13:52:40 +00:00
Sami Aario	3a792017fb	Add functionality and tests for fp16 x fp8 and fp8 x fp16	2026-02-03 13:52:40 +00:00
Sami Aario	f8c4868a59	Add functionality and tests for bf16 x fp8 and fp8 x bf16	2026-02-03 13:52:40 +00:00
Sami Aario	3f4a85146c	Add MFMA warp gemm for float, float, float, 32, 32, 16	2026-02-03 13:52:40 +00:00
Sami Aario	7f22e8c66a	Add and use load_with_type_convert	2026-02-03 13:52:40 +00:00
Sami Aario	b41ed6e371	Introduce DetermineWarpPrecType for determining warp GEMM precision types	2026-02-03 13:52:40 +00:00
Sami Aario	f2fcc4a461	Add NumAccess as a template parameter to WarpGemmAttributeMfma::get_warp_dstr_encoding	2026-02-03 13:52:40 +00:00
Sami Aario	933e09f6c3	Rename the parameters of load_interleaved_pk_type and load_and_convert_tile	2026-02-03 13:52:40 +00:00
SamiAario-AMD	8c8715904e	Merge branch 'develop' into LWPCK-3549-cleanups	2026-02-03 13:28:08 +02:00
Max Podkorytov	3f04d27b68	Remove concrete performance numbers from BUILD_TIME_OPTIMIZATION.md (#3702 ) Replace specific benchmark numbers with qualitative descriptions since measurements vary across environments and may become outdated. Co-authored-by: Claude <noreply@anthropic.com>	2026-02-03 03:54:18 -07:00
Illia Silin	8b56ffb6ae	Fix one more lifetimebound error. (#3703 ) * fix staging compiler errors * fix clang format	2026-02-02 18:25:56 -08:00
Aviral Goel	3e77721755	feat: add split_k support for block scale gemm bquant mode. (#3653 ) * WIP: add splitk to bquant * feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types * chore: remove temporary test script * fix: incorrect tile window length for splitted bq tensor window * chore: improve comments * test: add unit tests to cover bquant splitk functionality * fix: conflict resolution by renaming variables	2026-02-02 14:41:53 -08:00
Zoltán Lakatos	301eb5cf08	Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 ) * device struct implementation * added xdl grouped multi abd fixed nk testing * wmma implementation fixed * avoid unnecessary device mem allocation and code cleanups * cleanup instances definitions * wmma examples added * code cleanups * fix clang format * typo and compilation fixes related to reference gemm * fix compilation error due to std::remove_cvref_t * added missing hip_check_error includes * correction to example instances * review commentes addressed * removed split-k from testing * code formatting --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-02 13:58:11 -08:00
Jan Patrick Lehr	069500464d	[Compiler] Addressing new compiler warnings (#3640 ) * [Compiler] Addressing new compiler warnings Clang enables new lifetime warnings in production and we see build errors due to this with the staging compiler. The attributes added in this PR are suggested by the compiler. However, I'm not very familiar with the code base, so the changes may be incorrect. * Update some more instances * Adds file-level ignores via clang diagnostic pragma The number of instances was large, so I decided to use file-level scope to disable the warning via pragma clang diagnostic ignored. It also showed this warning coming from the gtest dependency. For that, I did add the respective command line flag to the CMake variables. I don't know if this is acceptable or not. * This adds the remaining instances For a build on gfx90a. * fix clang format * Adding couple more instances from gfx1200 build * Fixed another few instances --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-02 09:39:48 -08:00
Sami Aario	4eceb2fc69	Fix a build break	2026-02-02 15:24:35 +00:00
Sami Aario	aa247e2d63	Fix a build break	2026-02-02 14:30:39 +00:00
Sami Aario	70be645270	Fix a build break	2026-02-02 10:27:58 +00:00
Sami Aario	348b555cc3	Merge remote-tracking branch 'origin/develop' into LWPCK-3549-cleanups	2026-02-02 10:00:44 +00:00
ZheWang	e6bcd192d4	Mx fp6 flatmm (#3601 ) * add fp6 data-type and support sync/async dwordx3 load/store * clang-format * pre-commit * 1st commit * default mnk pass ut * fix a distrubution * fix * fix bdram distr * update * pass ut * improve perf * update * clean code * resolve copilot comment * reslove comment * clang-format --------- Co-authored-by: ZheWang <zhewan@amd.com>	2026-02-02 16:04:40 +08:00
Po Yen Chen	8c1788757a	[CK_TILE] Fix incompatible vector type arguments for the intrinsic calls (#3672 ) * Change call to the intrinsics * fix clang format * Undo changes under include/ck/utility * Use named variable as vector size --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-01-30 12:02:49 -08:00
ApoorvaKalyani	70d71b1514	Test fix for gemm_b_scale_xdl_v3. (#3674 )	2026-01-30 10:34:54 -07:00
jiangyon.ren	4d2f8c111e	[CK_TILE][FMHA] Add sparse attention VSA (#3341 ) * add sparse attention VSA * fix the pre-commit * Add jenga test and pre-commit * add bf16 for vsa * add jenga support bf16 * remove lse arg * split kernel code to block & kernel * fix the pre-commit * fix the pre-commit * fix the copyrights * fix the copyright * fix the copyright & rename block to pipeline * fix the copyright and pipeline * remove lse & dropout & add fmt * fix the jenga&VSA code review * remove the useless code & resolved the comments * remove useless code * remove useless code * Clean up code * Remove more unused code * Re-format .hpp * Refactor codegen scripts --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2026-01-31 00:59:47 +08:00
Kiefer van Teutem	2377a62837	Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529 ) * Adding remaining flavors for grouped conv fwd As titled. Following variants are added: - grouped_conv2d_fwd_dynamic_op - grouped_conv3d_fwd_dynamic_op - grouped_conv3d_fwd_bilinear - grouped_conv3d_fwd_convscale - grouped_conv3d_fwd_convinvscale - grouped_conv3d_fwd_convscale_add - grouped_conv3d_fwd_convscale_relu - grouped_conv3d_fwd_scale - grouped_conv3d_fwd_combconvscale - grouped_conv3d_fwd_scaleadd_scaleadd_relu * Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3. * Do not build f8 / bf8 only flavor tests on RDNA3 * Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time. * Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream. * adding int8 and fp16 overloads to the elementwise operations * fixed copilot nits * Addressing review comments: - removed unnecessary examples for dynamic op - removed unnecessary conv specalizations for all the flavors - removed spurious bilinear and scale source files * clang-format * reduced no of tests --------- Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>	2026-01-30 17:02:14 +01:00
Erwin Terpstra	6a6177a246	[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603 ) * chore: split block scale example instances in more separate files to speed up compile times * wip: fp4 scaffolding for abquant * feat: add fp4 decoding-while-loading to abquant pipeline * feat: add support for fp4 CPU verification in abquant * chore: add time tracking to reference calculation * feat: add a4w4 test for blockscale gemm * feat: optimize reference calculation by preconverting values to AccType * feat: add fp4 to fp8 look-up table * fix: reference to wrong ComputeDataType field in QuantProblem * feat: type utilities for determining MFMA compute types * feat: packed fp4 for abquant weight preshuffle * feat: add separate tests for a4w4 base case, padding and preshuffleB * fix: fp4 conversion on gfx950 attempting to use non-supported method * fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size * chore: add fp4 preshuffleb mode to block scale example * chore: sanity check for packed types being 1 byte * chore: clarify tensor dimension indices with constants * chore: replace traits check with specialized check for packed types * style: some minor refactoring and cleanup * fix: correct conversion table for FNUZ fp8 * chore: add fp4 instances to main abquant instances again * chore: use same initialization branch for int4 and fp4 * chore: add missing initialization for fp4 in block scale gemm example --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-30 04:40:50 -07:00
Zoltán Lakatos	565fea2645	fix undefined behaviour in softmax kernel (#3683 ) Co-authored-by: root <zoltan.lakatos@streamhpc.com>	2026-01-30 15:22:54 +08:00
MHYangAMD	6ff0737843	Fix redundant cast in model sensitive rmsnorm (#3681 ) * Fix redundant cast * Fix linting	2026-01-30 10:52:19 +08:00
Enrico Degregori	f16d9100e4	Multi AB support for wave transfer (#3578 ) * Add multi AB support to wave transfer * Improviments to multi ABD examples * Add instances and use intrawave v1 instead of interwave * Apply changes to other transfers * Wave transfer: add support for multiple internal vgpr buffers * Fix compilation error gfx11	2026-01-29 10:29:40 -08:00
Johannes Graner	fabac7e2c3	[Conv] Enable bwd weight splitk autodeduction with cap (#3656 ) * Enable bwd weight splitk autodeduction with cap * Fix error threshold calculations * Add missing logic to wmma multiple d kernel * Fix threshold calculation * Update test with new applicability	2026-01-29 17:40:28 +00:00
Khushbu Agarwal	9b168082b7	[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 ) * initial commit * preshuffleQuant support for ABQuant * fix mxfp4 to use correct QuantGroupSize * addressing review comments and seperated Preshufflequant for A and B * updated grouped gemm example for updated traits definition * fix for CI failure * updated grouped_gemm_abquant test for updated traits definition * updated grouped_gemm_abquant test for updated traits definition	2026-01-28 19:45:09 -08:00
Jeff Huang	e3556fed04	Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657 ) - Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather - Add is_gather_dim() and get_gather_index() for multi-dim page lookup - Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1}) - Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread - Add static_assert for RowMajor VLayout requirement in batch prefill Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-01-29 07:18:41 +08:00
Bartłomiej Kocot	83b58bb0c3	Grouped Conv Bwd Weight Direct Load (#3648 ) * Grouped Conv Bwd Weight Direct Load * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp * Implement group merging for bwd_weight and add instances * Link direct load instances * builder fixes * fix * fixes * fix --------- Co-authored-by: Graner, Johannes <johannes.graner@amd.com>	2026-01-28 15:31:54 -06:00

1 2 3 4 5 ...

1496 Commits