composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Author	SHA1	Message	Date
assistant-librarian[bot]	39405747ab	Merge commit '3900e1e7ceacfa32cb8d1522260ed30befd4dae3' into develop	2026-01-26 19:16:22 +00:00
Thomas Ning	8f972ba2d2	Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634 ) * Finished the work * Fix the pipeline [ROCm/composable_kernel commit: `3900e1e7ce`]	2026-01-26 10:29:28 -08:00
SamiAario-AMD	b07fbbc33a	Re enable f8 x bf8 tests on compv3 and compv4 (#3605 ) * Re-enable f8 x bf8 tests on CompV3 as they now pass * On CompV4, fp8 x bf8 tests now pass with K_BlockSize I32 * Add a changelog entry --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `834642202c`]	2026-01-26 10:23:26 -08:00
chris-tsiaousis-hpc	ea30b43692	Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617 ) * Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> [ROCm/composable_kernel commit: `917f35553a`]	2026-01-26 10:20:30 -08:00
assistant-librarian[bot]	06fb853279	Merge commit 'de59c0716c631edfa4742e4309ee11d4379ef6e8' into develop	2026-01-26 18:17:51 +00:00
Max Podkorytov	8ae166963e	Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585 ) This change significantly improves compile-time performance by reducing template instantiation depth for sequence generation and merging operations: Optimizations: - sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using __make_integer_seq to generate indices in a single step, then applying the functor via pack expansion - uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq with a helper that applies a constant value via pack expansion - sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction strategy. Added direct concatenation specializations for 1-4 sequences to avoid recursion in common cases, falling back to binary tree merging for 5+ sequences Documentation: - Added extensive inline comments explaining why sequence_merge cannot achieve O(1) depth like sequence_gen (requires computing cumulative sequence lengths from heterogeneous inputs, inherently requiring recursion) - Documented the binary tree reduction approach and why it's superior to fold expressions for this use case Testing: - Added comprehensive unit tests for uniform_sequence_gen with different values, sizes, and edge cases - Added tests for sequence_gen with custom functors (double, square, identity, constant) to verify the new implementation works with arbitrary functors - Added tests for sequence_merge with 4, 5, and many sequences to verify both the direct concatenation path and binary tree reduction path - Added tests for empty sequence edge cases [ROCm/composable_kernel commit: `de59c0716c`]	2026-01-26 10:08:55 -08:00
Illia Silin	9c3cc098c4	add dockerfile for manylinux (#3651 ) [ROCm/composable_kernel commit: `054c437dec`]	2026-01-26 09:23:19 -08:00
assistant-librarian[bot]	8968bceee4	Merge commit '7ac379428408337a231a86f8a8b7353b5b45aa2d' into develop	2026-01-25 13:22:29 +00:00
Ville Pietilä	a622665d78	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639 ) Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `7ac3794284`]	2026-01-25 13:42:23 +01:00
assistant-librarian[bot]	6a21c125a0	Merge commit 'f5c2f09036cdc22dc8944719215dd47003c50a24' into develop	2026-01-24 00:38:47 +00:00
Emily Martins	d0b16dc545	[CK_TILE] Fix alignment in Stream-K workspace buffer (#3625 ) * Fix alignment issue in Stream-K workspace buffer In CK Tile Stream-K, the workspace buffer is used to hold flags and partials, where the first i bytes holds the flags and the remaining bytes hold partials. This change adds padding to the flags prefix of the workspace buffer to ensure the number of bytes is 128B-aligned. Without this alignment, since workgroups do not skip cache when reading from partials, they may read stale partials data in cache, leading to incorrect results. The added padding avoids the stale data reading. This change also re-enables the test_ck_tile_streamk_reduction tests. * Compute reference GEMM on GPU for test verification to decrease testing time [ROCm/composable_kernel commit: `f5c2f09036`]	2026-01-23 16:14:22 -07:00
assistant-librarian[bot]	e2e058bcbc	Merge commit 'e1c46ff548cf7bc8b0e1b41a3d559f05317ec2da' into develop	2026-01-23 21:13:05 +00:00
chris-tsiaousis-hpc	e0af034e82	Remove code duplications in batched gemm wmma (#3580 ) * Moved device struct for batched gemm wmma to a common file Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Use the common device struct in the scaled batched gemm wmma implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: Remove unused includes and ambiguous comment Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved pointer offset calculation and gridwise argument to common struct This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common" Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> [ROCm/composable_kernel commit: `e1c46ff548`]	2026-01-23 12:39:03 -08:00
assistant-librarian[bot]	c244e1454a	Merge commit '67f0b74ec6687192fac14c359c57aca237d3cf2a' into develop	2026-01-23 17:17:05 +00:00
ltqin	2bf0f9a3fc	Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 )" (#3635 ) This reverts commit `c495edb11c`. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `67f0b74ec6`]	2026-01-23 09:03:22 -08:00
assistant-librarian[bot]	078593e052	Merge commit '2e08a7e5ab51b020c90008b45c75dc35c2ba426c' into develop	2026-01-23 12:19:23 +00:00
Wojciech Laskowski	a340b8f8fd	WMMA grouped conv fwd large tensor bias bnorm clamp (#3595 ) * Added bias_bnorm_clamp for WMMA conv fwd large tensor. Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout. - grouped_conv2d_fwd_bias_bnorm_clamp - grouped_conv3d_fwd_bias_bnorm_clamp * changed strategy to handle GemmArgs array * Adding generic instance * fixed last nits from reviewers and copilot [ROCm/composable_kernel commit: `2e08a7e5ab`]	2026-01-23 12:20:00 +01:00
Wojciech Laskowski	21ab5fbe49	WMMA grouped conv fwd large tensor extra flavors (#3582 ) * Additional flavors for WMMA conv fwd large tensor - added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors * changed strategy to handle GemmArgs array * Adding generic instance * Added generic instance to clamp and bias_clamp ops [ROCm/composable_kernel commit: `81ee19bd2c`]	2026-01-23 12:19:51 +01:00
assistant-librarian[bot]	ebc79763fb	Merge commit '7b3db1a878181004fc5db7cdb82840623beaadb5' into develop	2026-01-23 10:14:37 +00:00
Bartłomiej Kocot	237d22d6ca	Grouped conv fwd direct load vector=2 (#3632 ) [ROCm/composable_kernel commit: `7b3db1a878`]	2026-01-23 10:29:59 +01:00
assistant-librarian[bot]	36253637d1	Merge commit 'de5a1d730dc77d1471ad53ca18dfd7c1474e9873' into develop	2026-01-23 07:17:14 +00:00
Po Yen Chen	c495edb11c	Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 ) This reverts commit `71e8734c32`. [ROCm/composable_kernel commit: `de5a1d730d`]	2026-01-22 21:21:19 -08:00
assistant-librarian[bot]	e0b24cbecc	Merge commit 'f30d04654e6bb9b064cf96c6bb4e3fff960efbd8' into develop	2026-01-23 00:39:51 +00:00
damien-lejeune	b0623aebc2	Add missing check target in reduce tile engine op (#3631 ) Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> [ROCm/composable_kernel commit: `f30d04654e`]	2026-01-22 16:06:02 -08:00
assistant-librarian[bot]	88d27d2141	Merge commit 'eb2dc8f466cd2978490ccc3ff794d898cad9535a' into develop	2026-01-22 23:13:55 +00:00
Vidyasagar Ananthan	4895336494	Speed up glob recurse. (#3626 ) [ROCm/composable_kernel commit: `eb2dc8f466`]	2026-01-22 14:44:47 -08:00
assistant-librarian[bot]	1654f02845	Merge commit 'b9bb1db5d932c4c0445994cfc1d37f66a3744659' into develop	2026-01-22 21:15:42 +00:00
arai713	2bef359e0e	Addition of Stream-K tests using Tile Engine (#3514 ) * Addition of Stream-K tests using Tile Engine This change adds an implementation for generating Stream-K tests using Tile Engine. This will generate various test executables for different combinations based on the config files. This addition has simple tests running for bf16 and fp16, with both atomic and reduction strategies and compv3 pipeline. The tests rely on the implementation of Stream-K in Tile Engine. * integrating addition of tree reduction and editing the README * temporarily removing parallel and tree reduction from configs while bugs regarding them are being resolved [ROCm/composable_kernel commit: `b9bb1db5d9`]	2026-01-22 12:53:52 -08:00
assistant-librarian[bot]	84e961765d	Merge commit '31a35ecab4e403f63ec4b76f4a709c21172c39de' into develop	2026-01-22 18:17:10 +00:00
kensclin	81771f8b1e	GEMM Blockscale ABQuant Optimization (#3620 ) * GEMM Blockscale ABQuant Optimization * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix precommit error * clean * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> [ROCm/composable_kernel commit: `31a35ecab4`]	2026-01-22 09:39:38 -08:00
Vidyasagar Ananthan	8763bbf6cf	Adding dispatcher architecture (#3300 ) * WIP POC of dispatcher * Dispatcher python workflow setup. * Dispatcher cleanup and updates. Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme * Fixes to python paths * Cleaning up code * Improving dispatcher support for different arch Fixing typos * Fix formatting errors * Cleaning up examples * Improving codegeneration * Improving and fixing C++ examples * Adding conv functionality (fwd,bwd,bwdw) and examples. * Fixes based on feedback. * Further fixes based on feedback. * Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug. * Another round of improvements based on feedback. * Trimming out unnecessary code. * Fixing the multi-D implementation. * Using gpu verification for gemms and fixing convolutions tflops calculation. * Fix counter usage issue and arch filtering per ops. * Adding changelog and other fixes. * Improve examples and resolve critical bugs. * Reduce build time for python examples. * Fixing minor bug. * Fix compilation error. * Improve installation instructions for dispatcher. * Add docker based installation instructions for dispatcher. * Fixing arch-based filtering to match tile engine. * Remove dead code and fix arch filtering. * Minor bugfix. * Updates after rebase. * Trimming code. * Fix copyright headers. * Consolidate examples, cut down code. * Minor fixes. * Improving python examples. * Update readmes. * Remove conv functionality. * Cleanup following conv removable. [ROCm/composable_kernel commit: `9e049a32a1`]	2026-01-22 09:34:33 -08:00
assistant-librarian[bot]	5289967b9b	Merge commit '44f481a45ca75b234ba60fdc3dc68974b1b86164' into develop	2026-01-22 14:20:23 +00:00
Bartłomiej Kocot	6afa598838	[CK TILE] Fix basic gemm pipelines (#3611 ) * [CK TILE] Fix basic pipelines * fixes [ROCm/composable_kernel commit: `44f481a45c`]	2026-01-22 08:11:18 -06:00
assistant-librarian[bot]	872c034358	Merge commit '8daf6ea3026aebe3481792c03026692631059725' into develop	2026-01-22 09:19:13 +00:00
ApoorvaKalyani	513b14c5f2	Grouped conv_fwd_bias_bnorm_clamp instances and tests (#3525 ) * Added bias_bnorm_clamp instances. * fwd_bias_bnorm_clamp comp instances * fwd_bias_bnorm_mem_inter and mem_intra instances * fwd_bias_bnorm_merged_group_instances * fwd_bias_bnorm_clamp_conv3d_bf16 and f16 instances * Device level changes for fwd_bias_bnorm_clamp * Added the test to the regression test list. * Removed the part 2 and 2x instances * Removed the irrelevant checks in wmma * Refactored the instances to adapt to new device implementation * Updated the reference and include files * enabling tests * Added missing profiler * Added missing instance entry , deleted by mistake * Reduce bias bnorm clamp instances to only a single generic one. * Clean up cmakelists file * clang-format * Change bias bnorm clamp tests to use monotone initialization values to avoid tiny off-integer gemm results on RDNA3 from blowing up. * Renaming some instance lists and add functions to be more standardized. * Commented out non default instances. --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `8daf6ea302`]	2026-01-22 09:53:59 +01:00
assistant-librarian[bot]	8f10da355a	Merge commit '0b13697a88e77a733d36b14353df1c0a7ae756df' into develop	2026-01-22 08:17:11 +00:00
Linjun-AMD	bdea62e96c	[CK_TILE][FMHA]Add new tile size for async (#3623 ) * Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)" This reverts commit `30ac278911`. * Add new tile_size for async pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0b13697a88`]	2026-01-22 16:07:14 +08:00
assistant-librarian[bot]	45e2275fc4	Merge commit 'dd0b4294afcf188f4a9154b7eea19f8e786c9539' into develop	2026-01-22 05:20:00 +00:00
ltqin	71e8734c32	Fp8 block scale quantization for fmha fwd (#3330 ) * add block scale parameters to kernel * add block scale to kernel * add smoke test * format * Revert "format" This reverts commit `356c3c9706`. * only format my code * format py * fix auto not allowd in function prototype * change instance tttt to ttff * fix structured binding issue * change s_acc elementwise op * async pipeline add block scale * add quantation P using shift exp2 * precompute (m - shift) once per row * change blk scale seqstrt ptr name * fix some name * fix for deduction guide * fix some comments * add P scale to qr_ksvs_pipeline * add comment to idx_identity * change the method of calculating descale block index * unify naming style: use block_scale_ as name prefix * unify naming style * update the CHANGELOG.md * Add FP8 block scale quantization support for FMHA forward kernel --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `dd0b4294af`]	2026-01-21 20:58:26 -08:00
assistant-librarian[bot]	4d2856612c	Merge commit '4c2c18ef486641d1493f3dc272a1e0e079676308' into develop	2026-01-22 02:55:52 +00:00
Michał Kulikowski	4077353b6a	[CK][Examples] Extending support for rdna3/4 part 4: (#3264 ) * [CK][Examples] Extending support for rdna3/4 part 4: -example_gemm_xdl_streamk -example_gemm_xdl_fp16_fp8_v3 -example_gemm_xdl_fp16_v3 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> * [CK][Examples] Revert example\01_gemm\gemm_xdl_streamk parameters change. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> --------- Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `4c2c18ef48`]	2026-01-21 18:10:16 -08:00
assistant-librarian[bot]	aadd581b8d	Merge commit '1040d9b1f53945867d78d0bbcf03de65ee01aea3' into develop	2026-01-21 18:24:44 +00:00
Robin Voetter	2cff6c74fc	[CK_BUILDER] Replace reference conv with old ck implementation (#3604 ) * ck-builder: remove SPATIAL_DIM parameter from ConvTensorLayouts This information is already in the SIGNATURE, so its pointless to pass it separately. This streamlines the interface of those functions a bit. Also touches up the style of those files in general. * ck-builder: implement reference conv using old ck The old ck implementation is more featureful and better tested. * ck-builder: replace test_reference_execution reference with old ck This strips out the ck-tile gpu reference implementation completely. * ck-builder: clean up test_reference_execution - Remove unneccesary messages - Replace EXPECT_TRUE(true) with EXPECT_NO_THROW() [ROCm/composable_kernel commit: `1040d9b1f5`]	2026-01-21 19:18:47 +01:00
andrew clark	b66bbac9ea	Sanitizing URL-encoded characters from the image file name (#3622 ) [ROCm/composable_kernel commit: `0fbb3bb8c4`]	2026-01-21 11:00:53 -07:00
assistant-librarian[bot]	579d2eb5fb	Merge commit 'f41f37da969d8f0dbcf590b72e5ac8e74e8846b6' into develop	2026-01-21 16:34:17 +00:00
Yi DING	f7d5c3a34c	Add CMakePresets.json (#3284 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `f41f37da96`]	2026-01-21 08:04:24 -08:00
assistant-librarian[bot]	8fbde9114b	Merge commit 'fcc9372c009c8e0a23fece77b582da83b04a654f' into develop	2026-01-21 02:52:11 +00:00
Yi DING	f3b962ecc4	[CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615 ) [ROCm/composable_kernel commit: `fcc9372c00`]	2026-01-21 09:54:46 +08:00
assistant-librarian[bot]	b2c76ff10f	Merge commit 'd5ae81b2922773f7cdf4a02a2e1fd57d0e4df851' into develop	2026-01-20 22:14:29 +00:00
Erwin Terpstra	fa56471c91	Implement batched gemm add relu gemm add for rdna4 (#3391 ) * wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation * wip: many fixes in implementation of batched gemm gemm multiple d * wip: batched gemm gemm multiple d gridwise op compiling, not working yet * fix: incorrect d0 grid indexing in batched gemm gemm multipled * feat: add instances for batched gemm add relu gemm add * chore: configure instance with low vector transfer size for odd sizes * chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense * fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes * fix: disable odd size tests on XDL archs * chore: removed temporary logging * chore: update some references to C tensor to E tensor * Tentative fix for example template params * Tentative fix for non-multi-D batched gemm gemm device impl. * Tentative fix for xdl example template params * Tentative fix for profiler build on gfx90a * chore: improve device batched gemm gemm multi D comment to include all ops and dimensions * chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply * fix: make the gemm1 data types match what happens in the device op * feat: add d0s/d1s datatypes and layouts to the device op type string * chore: change element-wise op so addition happens in fp32 * chore: add static asserts for gemm0/gemm1 calculated wave sizes * chore: also updated other element-wise ops to use fp32 calculations * chore: log number of supported instances * chore: update instance comment * chore: disable kernel timing in example by default * fix: gemm1 wave size calculation * fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions * chore: remove increased tolerance in batched gemm gemm multiple d example * chore: add comment explaining that verification fails for certain input values * chore: clarify instance comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `d5ae81b292`]	2026-01-20 13:06:59 -08:00

... 6 7 8 9 10 ...

4099 Commits