composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-18 01:28:27 +00:00

Author	SHA1	Message	Date
assistant-librarian[bot]	9b81070cf3	[Conv] Add NumGroupsToMerge to BwdWeight type string (#4271 ) ## Proposed changes Add parameter to bwd weight V3 type string showing the number of groups to merge. This is required for MIOpen to be properly tuned since it uses type strings for performance database entries. In order to not break existing tuning databases, the parameter is added as a named suffix and only when group merging is enabled. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3679](https://github.com/ROCm/composable_kernel/pull/3679) 🧑‍💻 Originally authored by @johannes-graner --------- Co-authored-by: Graner, Johannes <Johannes.Graner@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2026-02-11 10:07:53 +01:00
assistant-librarian[bot]	201fec5e8a	Add a README.md file to ck/library/util (#4277 ) I'm collecting information about our current testing (#3664). As part of this work I a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances. This readme contains internal code comments for CK developers and does not need ROCm documentation review. --- 🔁 Imported from [ROCm/composable_kernel#3665](https://github.com/ROCm/composable_kernel/pull/3665) 🧑‍💻 Originally authored by @shumway Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-10 21:26:45 +00:00
Bartłomiej Kocot	784a03af29	[CK] Fix grouped conv fwd transform for merged groups (#4399 ) ## Motivation [CK] Fix grouped conv fwd transform for merged groups for 1d and 3d. ## Technical Details After optimizations for 2d there is a lack of implementation for 1d and 3d ## Test Plan test_grouped_convnd_fwd ## Test Result pending CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-09 09:36:52 -06:00
assistant-librarian[bot]	bb15392230	[CK] Add fwd conv group merging to v3 conv instances (#4273 ) ## Proposed changes Added conv group merging to the (universal) V3 fwd conv pipeline. The new instance improves fwd conv performance when the number of input/output channel per group is low. On MI300 (`gfx942`) we get \| CK prof command \| Baseline (TFLOPS) \| V3 group merging (TFLOPS) \| \|:-----\|:------:\|------:\| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1 \| 3.86035 \| 8.36796 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 \| 10.1867 \| 13.4677 \| \| grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1 \| 11.7875 \| 16.3657 \| --- 🔁 Imported from [ROCm/composable_kernel#3675](https://github.com/ROCm/composable_kernel/pull/3675) 🧑‍💻 Originally authored by @vpietila-amd --------- Co-authored-by: Ville Pietilä <> Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2026-02-08 12:34:59 +01:00
Enrico Degregori	f18a97a1f2	[CK] Workaround blockscale wp test failure (#4372 ) ## Motivation Workaround to fix blockscale wp test failure for pipeline v3 ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-06 16:09:08 -08:00
Illia Silin	2a054fc767	[CK] a bunch of CI fixes. (#4361 ) ## Motivation Fixing some of the CK CI issues ## Technical Details fixing paths to dockerfiles and scripts; moving codegen tests to separate stage (collides with main build since you must call cmake from same folder but different options); fixing a couple of clang compilation issues with staging compiler;	2026-02-05 20:06:57 -05:00
Illia Silin	aef327296e	Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 )" (#3705 ) This reverts commit 372a284890dc19cfd3c241c3e9a6076d35e843a5. [ROCm/composable_kernel commit: `569640dc70`]	2026-02-03 09:52:14 -08:00
Max Podkorytov	0f8c7cad09	Remove concrete performance numbers from BUILD_TIME_OPTIMIZATION.md (#3702 ) Replace specific benchmark numbers with qualitative descriptions since measurements vary across environments and may become outdated. Co-authored-by: Claude <noreply@anthropic.com> [ROCm/composable_kernel commit: `3f04d27b68`]	2026-02-03 03:54:18 -07:00
Illia Silin	a5a7527d76	Fix one more lifetimebound error. (#3703 ) * fix staging compiler errors * fix clang format [ROCm/composable_kernel commit: `8b56ffb6ae`]	2026-02-02 18:25:56 -08:00
Zoltán Lakatos	839a37780c	Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 ) * device struct implementation * added xdl grouped multi abd fixed nk testing * wmma implementation fixed * avoid unnecessary device mem allocation and code cleanups * cleanup instances definitions * wmma examples added * code cleanups * fix clang format * typo and compilation fixes related to reference gemm * fix compilation error due to std::remove_cvref_t * added missing hip_check_error includes * correction to example instances * review commentes addressed * removed split-k from testing * code formatting --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `301eb5cf08`]	2026-02-02 13:58:11 -08:00
Jan Patrick Lehr	470f031e58	[Compiler] Addressing new compiler warnings (#3640 ) * [Compiler] Addressing new compiler warnings Clang enables new lifetime warnings in production and we see build errors due to this with the staging compiler. The attributes added in this PR are suggested by the compiler. However, I'm not very familiar with the code base, so the changes may be incorrect. * Update some more instances * Adds file-level ignores via clang diagnostic pragma The number of instances was large, so I decided to use file-level scope to disable the warning via pragma clang diagnostic ignored. It also showed this warning coming from the gtest dependency. For that, I did add the respective command line flag to the CMake variables. I don't know if this is acceptable or not. * This adds the remaining instances For a build on gfx90a. * fix clang format * Adding couple more instances from gfx1200 build * Fixed another few instances --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `069500464d`]	2026-02-02 09:39:48 -08:00
ApoorvaKalyani	55f0489b03	Test fix for gemm_b_scale_xdl_v3. (#3674 ) [ROCm/composable_kernel commit: `70d71b1514`]	2026-01-30 10:34:54 -07:00
Kiefer van Teutem	65c2e81817	Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529 ) * Adding remaining flavors for grouped conv fwd As titled. Following variants are added: - grouped_conv2d_fwd_dynamic_op - grouped_conv3d_fwd_dynamic_op - grouped_conv3d_fwd_bilinear - grouped_conv3d_fwd_convscale - grouped_conv3d_fwd_convinvscale - grouped_conv3d_fwd_convscale_add - grouped_conv3d_fwd_convscale_relu - grouped_conv3d_fwd_scale - grouped_conv3d_fwd_combconvscale - grouped_conv3d_fwd_scaleadd_scaleadd_relu * Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3. * Do not build f8 / bf8 only flavor tests on RDNA3 * Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time. * Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream. * adding int8 and fp16 overloads to the elementwise operations * fixed copilot nits * Addressing review comments: - removed unnecessary examples for dynamic op - removed unnecessary conv specalizations for all the flavors - removed spurious bilinear and scale source files * clang-format * reduced no of tests --------- Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com> [ROCm/composable_kernel commit: `2377a62837`]	2026-01-30 17:02:14 +01:00
Zoltán Lakatos	e7483043e6	fix undefined behaviour in softmax kernel (#3683 ) Co-authored-by: root <zoltan.lakatos@streamhpc.com> [ROCm/composable_kernel commit: `565fea2645`]	2026-01-30 15:22:54 +08:00
Enrico Degregori	a07d76a460	Multi AB support for wave transfer (#3578 ) * Add multi AB support to wave transfer * Improviments to multi ABD examples * Add instances and use intrawave v1 instead of interwave * Apply changes to other transfers * Wave transfer: add support for multiple internal vgpr buffers * Fix compilation error gfx11 [ROCm/composable_kernel commit: `f16d9100e4`]	2026-01-29 10:29:40 -08:00
Johannes Graner	1998be34bf	[Conv] Enable bwd weight splitk autodeduction with cap (#3656 ) * Enable bwd weight splitk autodeduction with cap * Fix error threshold calculations * Add missing logic to wmma multiple d kernel * Fix threshold calculation * Update test with new applicability [ROCm/composable_kernel commit: `fabac7e2c3`]	2026-01-29 17:40:28 +00:00
Bartłomiej Kocot	c2892466a9	Grouped Conv Bwd Weight Direct Load (#3648 ) * Grouped Conv Bwd Weight Direct Load * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp * Implement group merging for bwd_weight and add instances * Link direct load instances * builder fixes * fix * fixes * fix --------- Co-authored-by: Graner, Johannes <johannes.graner@amd.com> [ROCm/composable_kernel commit: `83b58bb0c3`]	2026-01-28 15:31:54 -06:00
Robin Voetter	97d6e59580	[CK_BUILDER] Integrate CKB validation with CK verification (#3649 ) * ck-builder: tensor copy function This function copies one tensor to another, so that the memory layout can be changed between them. * ck-builder: fix ck::bhalf literals These types don't work properly. * ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it This reduces the amount of duplicated code a bit. * ck-builder: add flat tensor iterator This "iterator" type pretends to be a pointer, useful for passing tensors to functions expecting pointer-like types. * ck-builder: integrate validation with ck gpu verification By templating the gpu_verify function over iterators, we can use the new FlatTensorIterator to adapt the function to multi- dimensional tensors without changing either implementation too much. * ck-builder: add check_by_accumulations This changes the gpu_verification.hpp code to also accept "iterator" types for the relevant gpu_verify and gpu_reduce_max functions. * ck: fix test_gpu_verification GenerateRandomData for bhalf is_integer_it<bhalf_t> yields true, but it is not actually an integer. * ck: make gpu_verification kernels be proper persistent kernels Previously these were using a hardcoded value for the grid size. This commit changes that so that the grid size is automatically derived from the kernel's occupancy and the number of multiprocessors on the GPU. * ck: clean up gpu_verification.hpp using block_reduce This implements a small generic block reduce function, and rewrites the rest of gpu_verification.hpp using that function to clean it up a bit. * ck-builder: doc typos * ck-builder: update testing readme with validation interface. * ck-builder: rebase fixes + review comments * ck-builder: fix device integer generation with float types Passing bfloat here causes a nans due to type_convert performing a bitcast. * ck: another bhalf_t bug CK expects that int-generation with ck::bhalf_t yields bhalf integers, not unsigned integers. This makes the logic of FillUniformRandInteger compatible with GeneratorTensor_2<InDataType>, however idiotic that may be. [ROCm/composable_kernel commit: `42048bdb7d`]	2026-01-28 17:41:02 +01:00
linqunAMD	e9af74cb84	[ck] add gridwise base class for in all xdl kernel (#186 ) (#3544 ) 1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes. - to select correct LDS layout and epilogue behavior , three additional parameters is added. - ForceNaiveLdsLayout: disable XOR based LDS layout when it is true - DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9 - IsMxGemm: epilogue has two addtional dimensions 2. Move all LDS descriptor layout related fucntion to base class, including - GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1 - GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock 3. Move several LDS related helper funtions to base class, including - GetSharedMemoryNumberOfByte - GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1 - GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl 4. Move all c epilogue related code to base class, and 4 kind of implementation are provided - RunEpilogueNoShuffle - RunEpilogue - RunMultiDEpilogue - RunMoeEpilogue [ROCm/composable_kernel commit: `23cefda140`]	2026-01-27 12:49:47 -08:00
Michał Kulikowski	8130aa058e	[CK]Refactoring threadwise_tensor_slice_transfer_v3r1.hpp (#3263 ) Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b737f1dee5`]	2026-01-27 10:48:16 -08:00
Illia Silin	71ac48d63a	fix some syntax errors (#3658 ) [ROCm/composable_kernel commit: `b26cb596b0`]	2026-01-27 09:59:39 -08:00
Max Podkorytov	078912ec20	Add build time optimization documentation (#3608 ) This document describes techniques for reducing C++ template instantiation overhead in the Composable Kernel codebase, including: - Replacing recursive templates with pack expansion (O(N) → O(1) depth) - Using named functors instead of lambdas to share instantiations - Replacing template recursion with constexpr loops - Using fold expressions for accumulation operations These techniques can significantly reduce build times for template-heavy code. [ROCm/composable_kernel commit: `b66597ed96`]	2026-01-27 06:07:27 -07:00
Johannes Graner	eb72f85509	[CK tests] Extend conv GPU reference (#3539 ) * test_convnd_fwd * test_convnd_bwd_data * test_conv_bwd_data_scale * test_grouped_convnd_fwd_clamp * test_grouped_convnd_fwd_scale * multiple A/B tensors and D tensor for fwd GPU ref * test_grouped_convnd_fwd_scaleadd_ab * test_grouped_convnd_fwd_bias_clamp * test_grouped_convnd_fwd_bilinear * test_grouped_convnd_fwd_gk_bias_clamp * Extend GPU reference to enable batchnorm epilogue * test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp * test_grouped_conv_bwd_data_bilinear * test_grouped_convnd_bwd_weight_bilinear * Add missing template instantiation * Perform operations in float in reference * Slightly increase tolerance for batchnorm profiler * Revert "Slightly increase tolerance for batchnorm profiler" This reverts commit `a3b2475229`. * Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp" This reverts commit `6da4576060`. * Revert "Extend GPU reference to enable batchnorm epilogue" This reverts commit `e2f75fa10e`. * Clarify variable names * Refactor elementwise ops into helper functions * Make helpers C++17-compatible [ROCm/composable_kernel commit: `c190d8d61f`]	2026-01-27 09:49:42 +01:00
Enrico Degregori	f2c7d07666	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format [ROCm/composable_kernel commit: `2e49b6b2f7`]	2026-01-26 12:57:09 -08:00
yinglu	b980f0febe	ck: add CK_USE_GFX950 macro (#3636 ) [ROCm/composable_kernel commit: `8942a19d5e`]	2026-01-26 11:38:45 -08:00
chris-tsiaousis-hpc	4de19a1601	Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617 ) * Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> [ROCm/composable_kernel commit: `917f35553a`]	2026-01-26 10:20:30 -08:00
Max Podkorytov	bebf8c3720	Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585 ) This change significantly improves compile-time performance by reducing template instantiation depth for sequence generation and merging operations: Optimizations: - sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using __make_integer_seq to generate indices in a single step, then applying the functor via pack expansion - uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq with a helper that applies a constant value via pack expansion - sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction strategy. Added direct concatenation specializations for 1-4 sequences to avoid recursion in common cases, falling back to binary tree merging for 5+ sequences Documentation: - Added extensive inline comments explaining why sequence_merge cannot achieve O(1) depth like sequence_gen (requires computing cumulative sequence lengths from heterogeneous inputs, inherently requiring recursion) - Documented the binary tree reduction approach and why it's superior to fold expressions for this use case Testing: - Added comprehensive unit tests for uniform_sequence_gen with different values, sizes, and edge cases - Added tests for sequence_gen with custom functors (double, square, identity, constant) to verify the new implementation works with arbitrary functors - Added tests for sequence_merge with 4, 5, and many sequences to verify both the direct concatenation path and binary tree reduction path - Added tests for empty sequence edge cases [ROCm/composable_kernel commit: `de59c0716c`]	2026-01-26 10:08:55 -08:00
Ville Pietilä	e587756695	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639 ) Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `7ac3794284`]	2026-01-25 13:42:23 +01:00
chris-tsiaousis-hpc	3c247733af	Remove code duplications in batched gemm wmma (#3580 ) * Moved device struct for batched gemm wmma to a common file Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Use the common device struct in the scaled batched gemm wmma implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: Remove unused includes and ambiguous comment Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved pointer offset calculation and gridwise argument to common struct This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common" Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> [ROCm/composable_kernel commit: `e1c46ff548`]	2026-01-23 12:39:03 -08:00
Wojciech Laskowski	ee595ee58a	WMMA grouped conv fwd large tensor extra flavors (#3582 ) * Additional flavors for WMMA conv fwd large tensor - added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors * changed strategy to handle GemmArgs array * Adding generic instance * Added generic instance to clamp and bias_clamp ops [ROCm/composable_kernel commit: `81ee19bd2c`]	2026-01-23 12:19:51 +01:00
Erwin Terpstra	b079841b10	Implement batched gemm add relu gemm add for rdna4 (#3391 ) * wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation * wip: many fixes in implementation of batched gemm gemm multiple d * wip: batched gemm gemm multiple d gridwise op compiling, not working yet * fix: incorrect d0 grid indexing in batched gemm gemm multipled * feat: add instances for batched gemm add relu gemm add * chore: configure instance with low vector transfer size for odd sizes * chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense * fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes * fix: disable odd size tests on XDL archs * chore: removed temporary logging * chore: update some references to C tensor to E tensor * Tentative fix for example template params * Tentative fix for non-multi-D batched gemm gemm device impl. * Tentative fix for xdl example template params * Tentative fix for profiler build on gfx90a * chore: improve device batched gemm gemm multi D comment to include all ops and dimensions * chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply * fix: make the gemm1 data types match what happens in the device op * feat: add d0s/d1s datatypes and layouts to the device op type string * chore: change element-wise op so addition happens in fp32 * chore: add static asserts for gemm0/gemm1 calculated wave sizes * chore: also updated other element-wise ops to use fp32 calculations * chore: log number of supported instances * chore: update instance comment * chore: disable kernel timing in example by default * fix: gemm1 wave size calculation * fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions * chore: remove increased tolerance in batched gemm gemm multiple d example * chore: add comment explaining that verification fails for certain input values * chore: clarify instance comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `d5ae81b292`]	2026-01-20 13:06:59 -08:00
music-dino	750bd72b3d	Batched gemm softmax gemm descriptor fix (#3564 ) * Add rocm to prefix path for codegen * Fix issue with c0_matrix_mask construction [ROCm/composable_kernel commit: `6300ad3c62`]	2026-01-20 07:25:30 -08:00
Wojciech Laskowski	6ad65bc855	WMMA support for batched_gemm_reduce (#3332 ) Summary: - added new device impl of Batched GEMM Reduce for WMMA - added instance library - added WMMA impl to the Batched GEMM Reduce tests [ROCm/composable_kernel commit: `b09121f860`]	2026-01-20 10:50:46 +01:00
Bartłomiej Kocot	85c5741492	[CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518 ) * [BULDER] Add grouped conv fwd ck tile profiler * [CK TILE] Fix grouped conv kernels splitk and double lds * Updates * Fixes * Move to ckProfiler * Fixes * fix * fix * Change instances to empty list by default * fix * fix * Update grouped_convolution_signatures.hpp * Update grouped_convolution_forward_tile_algs.hpp * [CK TILE] Add grouped convolution forward tests (#3556) * [CK TILE] Add grouped convolution forward tests * fix jenkins * fixes * comments fixes * unit test * unit test fix * Move instances outside builder * fix includes * clang format fix * readme fix * fix includes * fixes [ROCm/composable_kernel commit: `0727e85e52`]	2026-01-19 22:29:01 -07:00
Erwin Terpstra	9c660bfbe3	Implement batched gemm bias permute for RDNA4 (#3534 ) * feat: test setup for batched contraction (aka batched gemm multiple d e permute) * wip: device struct for WMMA batched contraction multiple d based on new gridwise op * feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases * fix: failure to resolve template parameters when calling new function overload * fix: passing reference type as parameter instead of underlying types * fix: merge error caused duplicate definitions * fix: make sure constness of template and parameters types match * fix: don't compile batched contraction test on unsupported architectures * feat: add example for new wmma implementation, and consolidate example code between platforms * style: return inline instead of with branch * chore: add extra assert on vector memory access sizes * chore: clean up some unused variables * fix: correct tail number calculation, added small cases and extra instances to the test * fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method [ROCm/composable_kernel commit: `fe40a5d139`]	2026-01-17 08:30:27 +01:00
logicat	fb918acff9	Remove unnecessary hip_fp16 include from stream_config (#3549 ) [ROCm/composable_kernel commit: `fec81109f1`]	2026-01-16 10:40:05 -08:00
Yung-sheng Tu	97f2fa2912	Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429 ) * add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp * add examples * add instances to test * remove duplicate code between examples [ROCm/composable_kernel commit: `6df2d70143`]	2026-01-15 07:19:31 -08:00
John Shumway	753043b27a	[CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547 ) * Factor helpers out of conv_traits.hpp * Create a non-templated conv_traits struct * Migrate to new instance-specific instance_to_conv_traits functions * Clean up reflection concepts * Clean up ConvTraits helpers * Update testing for convolution traits This is a lot of cleanup on tests to have verbose coverage of feature extraction, explicit tests for each supported device kernel, and simple, readable test code. * Address reviewer comments and resolve merge conflict [ROCm/composable_kernel commit: `5122637215`]	2026-01-15 10:03:21 +01:00
Bartłomiej Kocot	8c72adabeb	Disable ActiveWorkgroupsPerCU for different arch in wmma kernels (#3566 ) [ROCm/composable_kernel commit: `a346cfa960`]	2026-01-14 12:37:12 -08:00
Bartłomiej Kocot	9aea6a52ed	Fix grouped conv bwd data wmma check (#3562 ) [ROCm/composable_kernel commit: `a07c8e38bd`]	2026-01-14 11:04:37 -08:00
Johannes Graner	b313b8eaea	[CK] Refactor GPU verification kernel to gather error stats on GPU (#3551 ) * Refactor GPU verification kernel to gather erorr stats on GPU * Check if result is all zero * non-negative error count doesn't need custom Atomics * Remove unnecessary AtomicMaxFloat function * Simpler warp reduction, remove passed flag * Move verification header to include * Fix header path in test * Fix block reduction loop [ROCm/composable_kernel commit: `f173642087`]	2026-01-14 16:04:50 +01:00
Enrico Degregori	ad907f8d54	Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465 ) - Add support for direct store in epilogue instead of cshuffle - Add padding support for wave transfer without transpose - Add wave transfer with interleaved layout to support direct store - Enable new functionalities on GEMMs - Add optional new functionality support for grouped convolution fwd - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed) [ROCm/composable_kernel commit: `693ff3bbb3`]	2026-01-14 11:02:19 +01:00
Ville Pietilä	e40687bfc3	[CK_BUILDER] Add bwd weight factories (#3509 ) * Add placeholder test. * Initial conv bwd weight factory. * Conv builder test refactoring. * Add missing pieces to bwd weight factory. * Improve compile time erros message when no matching factory is found. * Use amcro to ensure automatic macthing between concepts are their string representations. * Improve compile time diagnostics. * Small improvements. * Improve missing member/wrong type compile-time errors. * Improve compile time diagnostics. * Concept bug fixes. * Remove debug assert. * Update algorithm signature diagnostics. * Factory bug fixes. * First functional version of bwd weight conv factory. * Refactor handing of GEMM-K batch template parameter in conv bwd weight factory. * Concept improvements. * Improve concept diagnostics. * Introduve a common size type for concepts. * Update compiletime diagnostics to use the size type. * Update conv specialization enum. * Fix fwd conv builder tests. * Fix smoke tests. * Separate bwd weigth and bwd data tests into separate targets. * Clean-up CK Tile builder tests. * Add bwd weight XDL CShuffle V3 factory. * Build conv bwd weigth v3 instances successfully. * Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Test fix. * Add instance traits for bwd weight algorithms. * Add unit tests for instance strings. * Build new instance traits unit tests but exclude WMMA for now. * Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Conv bwd weight DL factory. * Final implementation for bwd weight DL factory. * Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle * Treat ref algorithm the same way as real algorithms in the dispatcher. * Refactor large tensor support and WMMA configuration. * Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3. * Update Readme. * Fix WMMA bwd weight tests. * Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3. * Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle. * Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 * Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs. * Fix fwd factories after refactoring. * clang-format * Move compile-time diagnostics to a separate branch. * Fix ref algorithm dispatching. * Fix smoke tests. * clang-format * Fix factory for regular WMMA conv bwd weight. * Clarify builder Readme. * Remove obsolete test file. * Fix test after merge. * clang-format * Remove the C++26 extensions. * Unify conv elementwise ops and layout definitions for fwd and bwd directions. * Remove old layout and elementwise ops. * Unify handling of conv tensor types between fwd and bwd directions. * Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank. * Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms. * clang-format --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `9908a87c31`]	2026-01-13 18:12:38 +02:00
Erwin Terpstra	d69aeffd0d	Implement grouped gemm tile loop for RDNA4 (#3304 ) * feat: grouped gemm tile loop support for RDNA4 * fix: removed extra parameter from grouped gemm example instance * fix: FP8 check incorrectly enabling FP8 on RDNA3 [ROCm/composable_kernel commit: `eb041079a3`]	2026-01-13 07:14:23 +01:00
yadaish	684ebd42da	moe fp8 blockscale use nt (#3524 ) * nt on fp8 blockscale * some improve and tests needs to be fixed * update * fix format * revert useless change * revert any change in amd_buffer_coherence [ROCm/composable_kernel commit: `32408c8bc0`]	2026-01-12 10:48:10 +08:00
Johannes Graner	c427b9ba2a	[CK] Allow tensors larger than 2GB in grouped conv bwd weight (#3169 ) * Take split_k into account when checking 2GB tensor limit. * Revert "Take split_k into account when checking 2GB tensor limit." This reverts commit `adf35c91be`. * Optimize grouped conv bwd wei split_k off calc (cherry picked from commit 2115642ee59050dabd81393c1b8f03b34adc05aa) * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp (cherry picked from commit 900d4d4b466f5730ae1189370d3c96267c35ea69) * Fix tensor descriptors and stride calculations * Don't miss half of the elements * Fix buffer size calculations * Disable hack if stride not divisible by k_batch * Clean up comments * Disallow hack in non-contiguous edge cases * Index -> Dim * Fix broken test * Refactor applicability checks into separate function * fix missed variable name * Fix variable name in info print * update V3 2GB check * No more regression, use templates instead * Code deduplication * Regression fix for cshuffle * arch-guarded atomic_add implementations for gfx11 * Similar for half(4\|8)_t as well * Only use both offset hacks at the same time * Revert "arch-guarded atomic_add implementations for gfx11" This reverts commit `3883fe6935`. This reverts commit `5311ec608d`. * Reapply "arch-guarded atomic_add implementations for gfx11" This reverts commit `1972adeddc`. * Only remove float4 atomic_add * Refactor to single flag * Consolidate template parameters * Consolidate flag in transformers --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `ee2c35b92d`]	2026-01-08 08:02:02 +01:00
Bartłomiej Kocot	abecfaf3a2	Disable fp32 atomic adds on gfx11 (#3510 ) * Disable fp32 atomic adds on gfx11 * Fixes is supported [ROCm/composable_kernel commit: `f449a5faaa`]	2026-01-07 15:32:04 -08:00
Enrico Degregori	6eab5bea54	Wmma support for gemm_bias_add_reduce (#3316 ) * Add tests for gemm_bias_add_reduce * Initial working implementation * Generalize implementation of reduce epilogue * Add tests for all layouts * Add instances * Fix test archs * Fix xdl bug * Remove library/profiler duplications * Fix num_byted error profiler * Fix typos * Fix copyright [ROCm/composable_kernel commit: `aad4cf0985`]	2026-01-07 10:27:16 -08:00
Erwin Terpstra	d074af36c9	Implement grouped gemm fastgelu for RDNA4 (#3303 ) * Implement grouped gemm fastgelu for RDNA4 * chore: some cleanup and minor inconsistencies in grouped gemm profiler * chore: clarified logic and reporting of supported instance warnings [ROCm/composable_kernel commit: `f9c6ba0403`]	2026-01-07 10:20:44 -08:00
Estevan Vedovelli	32e805b853	Add support to gfx1153 and fix gfx115X WMMA config (#3496 ) * Support for gfx115X * Changes for gfx115X * Add gfx1153 * Update changelog --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `1224bc0a82`]	2026-01-05 10:03:30 -08:00

1 2 3 4 5 ...

827 Commits