composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-24 06:44:36 +00:00

Author	SHA1	Message	Date
assistant-librarian[bot]	2dbf9c368b	Merge commit '51027474afe07ba069123f37798867270d59ac12' into develop	2026-01-14 00:40:04 +00:00
Thrupti Raj Lakshmana Gowda	183c01c8f1	[CK TILE ENGINE] CI fix for Basic Tile Engine (#3554 ) * memory op changes * memory op changes * Fixing TILE_ENGINE_BASIC in Tile Engine * Removing gfx90a from Tile Engine Run * [CK TILE ENGINE] increasing ci configs for BASIC case * Setting RUN_TILE_ENGINE_BASIC_TESTS to ON by default --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `51027474af`]	2026-01-13 16:20:30 -08:00
assistant-librarian[bot]	a648b6c373	Merge commit '00c46785a8a590bfe76b3fae20f23109a2685f4d' into develop	2026-01-13 18:17:38 +00:00
Thomas Ning	f444eab66c	Shuffle fix for gfx950 (#3491 ) * solve compiler issue * solve the gfx950 mfma shuffle regression * refactor jenkinsfile to handle arch name better * [CK TILE] set divisor to count of thread along k dimension * fix the compiler error * solve degradation * Finish the multiplies fix * fix the scales * solve compilation error * solve the composes * solve the error of tile sweeper * fix the test and example * fix for gfx950 --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Cong Ma <congma13@amd.com> [ROCm/composable_kernel commit: `00c46785a8`]	2026-01-13 09:21:29 -08:00
assistant-librarian[bot]	97e2b52bf5	Merge commit '9908a87c311352057da5eed93271ce7ea575ad21' into develop	2026-01-13 16:17:05 +00:00
Ville Pietilä	4caaa64c39	[CK_BUILDER] Add bwd weight factories (#3509 ) * Add placeholder test. * Initial conv bwd weight factory. * Conv builder test refactoring. * Add missing pieces to bwd weight factory. * Improve compile time erros message when no matching factory is found. * Use amcro to ensure automatic macthing between concepts are their string representations. * Improve compile time diagnostics. * Small improvements. * Improve missing member/wrong type compile-time errors. * Improve compile time diagnostics. * Concept bug fixes. * Remove debug assert. * Update algorithm signature diagnostics. * Factory bug fixes. * First functional version of bwd weight conv factory. * Refactor handing of GEMM-K batch template parameter in conv bwd weight factory. * Concept improvements. * Improve concept diagnostics. * Introduve a common size type for concepts. * Update compiletime diagnostics to use the size type. * Update conv specialization enum. * Fix fwd conv builder tests. * Fix smoke tests. * Separate bwd weigth and bwd data tests into separate targets. * Clean-up CK Tile builder tests. * Add bwd weight XDL CShuffle V3 factory. * Build conv bwd weigth v3 instances successfully. * Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Test fix. * Add instance traits for bwd weight algorithms. * Add unit tests for instance strings. * Build new instance traits unit tests but exclude WMMA for now. * Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Conv bwd weight DL factory. * Final implementation for bwd weight DL factory. * Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle * Treat ref algorithm the same way as real algorithms in the dispatcher. * Refactor large tensor support and WMMA configuration. * Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3. * Update Readme. * Fix WMMA bwd weight tests. * Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3. * Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle. * Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 * Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs. * Fix fwd factories after refactoring. * clang-format * Move compile-time diagnostics to a separate branch. * Fix ref algorithm dispatching. * Fix smoke tests. * clang-format * Fix factory for regular WMMA conv bwd weight. * Clarify builder Readme. * Remove obsolete test file. * Fix test after merge. * clang-format * Remove the C++26 extensions. * Unify conv elementwise ops and layout definitions for fwd and bwd directions. * Remove old layout and elementwise ops. * Unify handling of conv tensor types between fwd and bwd directions. * Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank. * Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms. * clang-format --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `9908a87c31`]	2026-01-13 18:12:38 +02:00
assistant-librarian[bot]	f9557c3692	Merge commit '710fa1fd3d317839ac9627751279f89ad610f20d' into develop	2026-01-13 15:17:57 +00:00
Po Yen Chen	83dac7e00f	fix incorrect List import in reduce_parameter.py (#3555 ) [ROCm/composable_kernel commit: `710fa1fd3d`]	2026-01-13 20:03:05 +05:30
assistant-librarian[bot]	25bf808899	Merge commit 'eb041079a36a767ccc8aa9a0a9d0e2822f352f03' into develop	2026-01-13 06:17:41 +00:00
Erwin Terpstra	18c8824e3c	Implement grouped gemm tile loop for RDNA4 (#3304 ) * feat: grouped gemm tile loop support for RDNA4 * fix: removed extra parameter from grouped gemm example instance * fix: FP8 check incorrectly enabling FP8 on RDNA3 [ROCm/composable_kernel commit: `eb041079a3`]	2026-01-13 07:14:23 +01:00
Jeff Huang	eb143eade0	[CK Tile] Fix FMHA LSE calculation and potential division by zero (#3326 ) This commit addresses numerical stability issues in the BlockFmhaPipelineQRKSVS pipeline when bias has -inf masking values: 1. Explicitly handle the case where the accumulated exponential sum (l) is zero. In this case, the LSE is now correctly set to negative infinity, preventing log(0) errors. 2. Extend the zero-check protection in the normalization step to cover the ELEMENTWISE_BIAS case, preventing potential division by zero. [ROCm/composable_kernel commit: `141f77aa12`]	2026-01-13 13:52:26 +08:00
assistant-librarian[bot]	dd7236189c	Merge commit 'c9f112b0267625016a58ce3465ee34232c85812b' into develop	2026-01-13 04:27:40 +00:00
Jeff Huang	908afb3a55	[FMHA] Support page_size=1 (linear layout) in batch prefill pipeline (#3545 ) - Enable page_size=1 support in batch prefill codegen (linear layout only). - Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly. - Relax `kPageBlockSize` alignment assertion for the page_size=1 case. [ROCm/composable_kernel commit: `c9f112b026`]	2026-01-13 12:04:43 +08:00
assistant-librarian[bot]	d05825d823	Merge commit 'a575acb245847d96d54c1e6d198748bda3e57952' into develop	2026-01-13 02:50:03 +00:00
ZheWang	91c829504a	fix mxfp8-gemm example failure (#3531 ) Co-authored-by: ZheWang <zhewan@amd.com> [ROCm/composable_kernel commit: `a575acb245`]	2026-01-13 10:26:45 +08:00
assistant-librarian[bot]	acd77f9c2c	Merge commit '5aaa0313503305ad697f6614836be87f8e0b281a' into develop	2026-01-12 18:17:03 +00:00
Aviral Goel	8dceee271e	WIP: extract MakeALdsDescriptor() from child to parent class for code readability (#3392 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `5aaa031350`]	2026-01-12 09:51:58 -08:00
Aviral Goel	23a1768487	refactor: remove Default scheduler implementation as it not used anymore (#3542 ) * refactor: remove Default scheduler implementation as it not used anymore * refactor: remove dead code from gemm universal kernel * chore: add descriptive comments about amd intrinsic hardware sync instructions * fix: label existing memory pipeline for aquant as intrawave [ROCm/composable_kernel commit: `e809861d49`]	2026-01-12 09:51:06 -08:00
assistant-librarian[bot]	bd27b4f097	Merge commit '18c2ff6019309d991c7f8d4d9c6f643191c28040' into develop	2026-01-12 11:13:21 +00:00
Johannes Graner	32e0beb399	[CK profiler] Perform verification on GPU when using GPU reference (#3482 ) * Simple verification kernel for ckProfiler * Verification kernel unit tests * Explicit synchronization * Address review comments [ROCm/composable_kernel commit: `18c2ff6019`]	2026-01-12 12:12:41 +01:00
assistant-librarian[bot]	d196ee4a2e	Merge commit '20f66c1e6b314a39533cac95b81e08f89645af2a' into develop	2026-01-12 09:19:02 +00:00
kabrahamAMD	529fbdc771	adressed review comments from PR3459 (#3526 ) Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `20f66c1e6b`]	2026-01-12 09:47:00 +01:00
Robin Voetter	61e6e155b0	ck-builder: tensor input/output reflection (#3536 ) This adds some utilities to automatically generate UniqueInputs, UniqueOutputs, alloc_inputs, alloc_outputs, and validate, based on a Inputs::reflect() and Outputs::reflect(). [ROCm/composable_kernel commit: `b352a68606`]	2026-01-12 09:45:53 +01:00
assistant-librarian[bot]	5de7a34a91	Merge commit '32408c8bc05b759ba62c2f97c9b7c3e808e2a6bc' into develop	2026-01-12 02:56:46 +00:00
yadaish	981c891757	moe fp8 blockscale use nt (#3524 ) * nt on fp8 blockscale * some improve and tests needs to be fixed * update * fix format * revert useless change * revert any change in amd_buffer_coherence [ROCm/composable_kernel commit: `32408c8bc0`]	2026-01-12 10:48:10 +08:00
assistant-librarian[bot]	96c8f16f1e	Merge commit '4216d43da86e08efad810671605cdb72a19dc026' into develop	2026-01-09 11:13:18 +00:00
damien-lejeune	58d8d793b1	Dlejeune/ck tile 2d multiple reductions (#3147 ) * WIP * Add Unit tests for the Multi Reduction Kernel * clang format * Rename multiblock to threadwise * Multiblock WIP * Fix multi reduce multi block unit tests * Multi Reduce Tile Engine: WIP * refactoring + try addressing precision error * Fix multiops examples * Cleanup * Clean up tile engine's reduce op * Update changelog * Fix remod/clang * Fix dates * Fix documentation & missing file * Fix comments * Use the update_tile api in the multi-block kernel * Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests * Add TileParitioner * Cleanup * Add warning when no data to process, in the example * Refactoring Reduce kernel Tile Partioner + cleanup * Move the tile partioner to its own file * Add missing includes * Fix copyright header with update_amd_copyright_headers.py * Fix change of interface in Reduce2dProblem --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `4216d43da8`]	2026-01-09 11:16:37 +01:00
assistant-librarian[bot]	f46a18694e	Merge commit 'e3884bbf0512f539a2ce0e1493e41fc19369911d' into develop	2026-01-08 09:17:23 +00:00
Robin Voetter	1a4deaded3	[CK_BUILDER] Debug utilities (#3528 ) * ck-builder: make toString to_string We are using snake case for CK-Builder * ck-builder: add debug.hpp with tensor descriptor printing function This adds some initial functionality to debug.hpp, a header which will be used to house some debug utilities. * ck-builder: abstract nd-iteration Abstracting this makes it easier to test, clearer, and allows us to use it elsewhere (such as in debug.hpp soon) * ck-builder: tensor printing * ck-builder: rename INT32 to I32 This makes it more in line with the other data type definitions. [ROCm/composable_kernel commit: `e3884bbf05`]	2026-01-08 10:14:13 +01:00
assistant-librarian[bot]	96eecd01e2	Merge commit '770a14494e944c803661c89575bf7be70fdbbfdf' into develop	2026-01-08 08:16:23 +00:00
Thrupti Raj Lakshmana Gowda	f8d1442908	Removing memop from chshuffle (#3530 ) [ROCm/composable_kernel commit: `770a14494e`]	2026-01-07 23:34:43 -08:00
assistant-librarian[bot]	cba48a5cab	Merge commit 'ee2c35b92db5ef4c4703935d203e9612e6b5f573' into develop	2026-01-08 07:16:26 +00:00
Johannes Graner	9d6add54e5	[CK] Allow tensors larger than 2GB in grouped conv bwd weight (#3169 ) * Take split_k into account when checking 2GB tensor limit. * Revert "Take split_k into account when checking 2GB tensor limit." This reverts commit `adf35c91be`. * Optimize grouped conv bwd wei split_k off calc (cherry picked from commit 2115642ee59050dabd81393c1b8f03b34adc05aa) * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp (cherry picked from commit 900d4d4b466f5730ae1189370d3c96267c35ea69) * Fix tensor descriptors and stride calculations * Don't miss half of the elements * Fix buffer size calculations * Disable hack if stride not divisible by k_batch * Clean up comments * Disallow hack in non-contiguous edge cases * Index -> Dim * Fix broken test * Refactor applicability checks into separate function * fix missed variable name * Fix variable name in info print * update V3 2GB check * No more regression, use templates instead * Code deduplication * Regression fix for cshuffle * arch-guarded atomic_add implementations for gfx11 * Similar for half(4\|8)_t as well * Only use both offset hacks at the same time * Revert "arch-guarded atomic_add implementations for gfx11" This reverts commit `3883fe6935`. This reverts commit `5311ec608d`. * Reapply "arch-guarded atomic_add implementations for gfx11" This reverts commit `1972adeddc`. * Only remove float4 atomic_add * Refactor to single flag * Consolidate template parameters * Consolidate flag in transformers --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `ee2c35b92d`]	2026-01-08 08:02:02 +01:00
Bartłomiej Kocot	5e0d3e77b9	[CK TILE] Fix grouped conv kernels splitk and double lds (#3527 ) [ROCm/composable_kernel commit: `bc497beffb`]	2026-01-08 07:59:38 +01:00
assistant-librarian[bot]	0c659dc743	Merge commit 'f449a5faaaf52a2194e82989bdb46b23392e97a3' into develop	2026-01-08 01:41:45 +00:00
Bartłomiej Kocot	dcc6ce0e22	Disable fp32 atomic adds on gfx11 (#3510 ) * Disable fp32 atomic adds on gfx11 * Fixes is supported [ROCm/composable_kernel commit: `f449a5faaa`]	2026-01-07 15:32:04 -08:00
assistant-librarian[bot]	a73a06fb1d	Merge commit 'aad4cf098511b3f58c5bd3c32e4534d438f7539c' into develop	2026-01-07 19:21:57 +00:00
Enrico Degregori	5a3fc30228	Wmma support for gemm_bias_add_reduce (#3316 ) * Add tests for gemm_bias_add_reduce * Initial working implementation * Generalize implementation of reduce epilogue * Add tests for all layouts * Add instances * Fix test archs * Fix xdl bug * Remove library/profiler duplications * Fix num_byted error profiler * Fix typos * Fix copyright [ROCm/composable_kernel commit: `aad4cf0985`]	2026-01-07 10:27:16 -08:00
Erwin Terpstra	2379b5e6e0	Implement grouped gemm fastgelu for RDNA4 (#3303 ) * Implement grouped gemm fastgelu for RDNA4 * chore: some cleanup and minor inconsistencies in grouped gemm profiler * chore: clarified logic and reporting of supported instance warnings [ROCm/composable_kernel commit: `f9c6ba0403`]	2026-01-07 10:20:44 -08:00
assistant-librarian[bot]	54e7d86ee2	Merge commit 'a7d6b1e7008c0b6e1af8a7d79389aefbdca4da65' into develop	2026-01-07 16:16:37 +00:00
John Shumway	a89756823c	Add unit test coverage for conversion to convolution traits (#3515 ) Our concept-base conversions are fragile and too complex. We want to refactor to straightforward functions for each intance trace class template. This change adds unit test coverage to make that refactoring safer. [ROCm/composable_kernel commit: `a7d6b1e700`]	2026-01-07 07:44:21 -08:00
Johannes Graner	acf98936bc	[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 ) * Disable kernel timing in tests * default time_kernel = false in old CK examples [ROCm/composable_kernel commit: `0a474aa62f`]	2026-01-07 16:30:57 +01:00
assistant-librarian[bot]	850997ff67	Merge commit 'e8cc75aefbe365750cf79c1188014325578941d8' into develop	2026-01-07 15:15:08 +00:00
BrianHarrisonAMD	edc3e4a870	Enable offload-compress for Windows if avaliable (#3521 ) [ROCm/composable_kernel commit: `e8cc75aefb`]	2026-01-07 07:05:03 -08:00
assistant-librarian[bot]	bb614ee8b2	Merge commit 'd7497d26948ca90d0224920472712e0f657fb744' into develop	2026-01-07 08:16:44 +00:00
Cong Ma	cdd9dafe6a	[CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512 ) Refactor function amd_buffer_load_invalid_element_return_zero to avoid the inefficient ASM code generated by compiler. Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage Tested compilers: - Rocm 7.0.1 - Rocm 7.1.1 Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `d7497d2694`]	2026-01-07 00:05:56 -08:00
assistant-librarian[bot]	9ec8eac079	Merge commit 'aaa35f0bbfa45dadc4380ddd6e0224668ddb97b4' into develop	2026-01-06 21:12:56 +00:00
Khushbu Agarwal	c33704febc	[CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * initial commit with prints * debugging * fine-grained working * debugging medium grained * fixing the tile window * formatting * enabling prefill shapes * working prefill shapes * formatted * clean up * code cleanup * bug fix after merging with develop * clean up after merging with develop * added comments for the tile window and tile distribution encoding --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com> [ROCm/composable_kernel commit: `aaa35f0bbf`]	2026-01-06 12:46:59 -08:00
kyle-256	9489e197c3	[CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452 ) * update grouped_gemm blockwise kernel * update config * update kernel * update examples * remove test code for now * sync test files with origin/develop * update example * fix code lint * fix code-lint * update test code * run clang format * run pre-commit * update api [ROCm/composable_kernel commit: `76696ace44`]	2026-01-06 12:36:04 -08:00
kensclin	df198bd5af	[CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495 ) * [CK_TILE] add preshuffleB mode for ABQuant GEMM * fix precommit error * use template method call for cvt_scale_to_fp32 * fix precommit error * add test code * fix precommit error * switch abquant gemmconfig to default * Add changelog.md * fix precommit error * fix conflict [ROCm/composable_kernel commit: `2309c86054`]	2026-01-06 12:35:01 -08:00

... 4 5 6 7 8 ...

3885 Commits