composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Author	SHA1	Message	Date
Zoltán Lakatos	e7483043e6	fix undefined behaviour in softmax kernel (#3683 ) Co-authored-by: root <zoltan.lakatos@streamhpc.com> [ROCm/composable_kernel commit: `565fea2645`]	2026-01-30 15:22:54 +08:00
MHYangAMD	24cf4cf9a8	Fix redundant cast in model sensitive rmsnorm (#3681 ) * Fix redundant cast * Fix linting [ROCm/composable_kernel commit: `6ff0737843`]	2026-01-30 10:52:19 +08:00
Enrico Degregori	a07d76a460	Multi AB support for wave transfer (#3578 ) * Add multi AB support to wave transfer * Improviments to multi ABD examples * Add instances and use intrawave v1 instead of interwave * Apply changes to other transfers * Wave transfer: add support for multiple internal vgpr buffers * Fix compilation error gfx11 [ROCm/composable_kernel commit: `f16d9100e4`]	2026-01-29 10:29:40 -08:00
Johannes Graner	1998be34bf	[Conv] Enable bwd weight splitk autodeduction with cap (#3656 ) * Enable bwd weight splitk autodeduction with cap * Fix error threshold calculations * Add missing logic to wmma multiple d kernel * Fix threshold calculation * Update test with new applicability [ROCm/composable_kernel commit: `fabac7e2c3`]	2026-01-29 17:40:28 +00:00
Khushbu Agarwal	68b475ad92	[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 ) * initial commit * preshuffleQuant support for ABQuant * fix mxfp4 to use correct QuantGroupSize * addressing review comments and seperated Preshufflequant for A and B * updated grouped gemm example for updated traits definition * fix for CI failure * updated grouped_gemm_abquant test for updated traits definition * updated grouped_gemm_abquant test for updated traits definition [ROCm/composable_kernel commit: `9b168082b7`]	2026-01-28 19:45:09 -08:00
Jeff Huang	29c56b8aae	Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657 ) - Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather - Add is_gather_dim() and get_gather_index() for multi-dim page lookup - Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1}) - Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread - Add static_assert for RowMajor VLayout requirement in batch prefill Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `e3556fed04`]	2026-01-29 07:18:41 +08:00
Bartłomiej Kocot	c2892466a9	Grouped Conv Bwd Weight Direct Load (#3648 ) * Grouped Conv Bwd Weight Direct Load * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp * Implement group merging for bwd_weight and add instances * Link direct load instances * builder fixes * fix * fixes * fix --------- Co-authored-by: Graner, Johannes <johannes.graner@amd.com> [ROCm/composable_kernel commit: `83b58bb0c3`]	2026-01-28 15:31:54 -06:00
Robin Voetter	97d6e59580	[CK_BUILDER] Integrate CKB validation with CK verification (#3649 ) * ck-builder: tensor copy function This function copies one tensor to another, so that the memory layout can be changed between them. * ck-builder: fix ck::bhalf literals These types don't work properly. * ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it This reduces the amount of duplicated code a bit. * ck-builder: add flat tensor iterator This "iterator" type pretends to be a pointer, useful for passing tensors to functions expecting pointer-like types. * ck-builder: integrate validation with ck gpu verification By templating the gpu_verify function over iterators, we can use the new FlatTensorIterator to adapt the function to multi- dimensional tensors without changing either implementation too much. * ck-builder: add check_by_accumulations This changes the gpu_verification.hpp code to also accept "iterator" types for the relevant gpu_verify and gpu_reduce_max functions. * ck: fix test_gpu_verification GenerateRandomData for bhalf is_integer_it<bhalf_t> yields true, but it is not actually an integer. * ck: make gpu_verification kernels be proper persistent kernels Previously these were using a hardcoded value for the grid size. This commit changes that so that the grid size is automatically derived from the kernel's occupancy and the number of multiprocessors on the GPU. * ck: clean up gpu_verification.hpp using block_reduce This implements a small generic block reduce function, and rewrites the rest of gpu_verification.hpp using that function to clean it up a bit. * ck-builder: doc typos * ck-builder: update testing readme with validation interface. * ck-builder: rebase fixes + review comments * ck-builder: fix device integer generation with float types Passing bfloat here causes a nans due to type_convert performing a bitcast. * ck: another bhalf_t bug CK expects that int-generation with ck::bhalf_t yields bhalf integers, not unsigned integers. This makes the logic of FillUniformRandInteger compatible with GeneratorTensor_2<InDataType>, however idiotic that may be. [ROCm/composable_kernel commit: `42048bdb7d`]	2026-01-28 17:41:02 +01:00
Yi DING	bb0986e59e	[CK_TILE] ABQuant New Preshuffle (#3638 ) * Refactor * Gemm quant improvement * Change preshuffle * Fix * Fix grouped gemm ut * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `8e3d84aba3`]	2026-01-27 23:46:49 -08:00
damien-lejeune	373d8dd63d	[CK Tile] multi reduce improvements (#3607 ) * WIP: refactoring * Swap operation/data nested loops order * Improve memory coalescing * Add comments * Enforce same identity element for the reduce operations * Re-add compile time constant * Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> [ROCm/composable_kernel commit: `91e32f305f`]	2026-01-27 12:56:09 -08:00
linqunAMD	e9af74cb84	[ck] add gridwise base class for in all xdl kernel (#186 ) (#3544 ) 1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes. - to select correct LDS layout and epilogue behavior , three additional parameters is added. - ForceNaiveLdsLayout: disable XOR based LDS layout when it is true - DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9 - IsMxGemm: epilogue has two addtional dimensions 2. Move all LDS descriptor layout related fucntion to base class, including - GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1 - GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock 3. Move several LDS related helper funtions to base class, including - GetSharedMemoryNumberOfByte - GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1 - GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl 4. Move all c epilogue related code to base class, and 4 kind of implementation are provided - RunEpilogueNoShuffle - RunEpilogue - RunMultiDEpilogue - RunMoeEpilogue [ROCm/composable_kernel commit: `23cefda140`]	2026-01-27 12:49:47 -08:00
Michał Kulikowski	8130aa058e	[CK]Refactoring threadwise_tensor_slice_transfer_v3r1.hpp (#3263 ) Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b737f1dee5`]	2026-01-27 10:48:16 -08:00
Illia Silin	71ac48d63a	fix some syntax errors (#3658 ) [ROCm/composable_kernel commit: `b26cb596b0`]	2026-01-27 09:59:39 -08:00
Max Podkorytov	078912ec20	Add build time optimization documentation (#3608 ) This document describes techniques for reducing C++ template instantiation overhead in the Composable Kernel codebase, including: - Replacing recursive templates with pack expansion (O(N) → O(1) depth) - Using named functors instead of lambdas to share instantiations - Replacing template recursion with constexpr loops - Using fold expressions for accumulation operations These techniques can significantly reduce build times for template-heavy code. [ROCm/composable_kernel commit: `b66597ed96`]	2026-01-27 06:07:27 -07:00
Bartłomiej Kocot	ab6bbbfee1	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt [ROCm/composable_kernel commit: `3d67e6c492`]	2026-01-27 11:04:11 +02:00
Johannes Graner	eb72f85509	[CK tests] Extend conv GPU reference (#3539 ) * test_convnd_fwd * test_convnd_bwd_data * test_conv_bwd_data_scale * test_grouped_convnd_fwd_clamp * test_grouped_convnd_fwd_scale * multiple A/B tensors and D tensor for fwd GPU ref * test_grouped_convnd_fwd_scaleadd_ab * test_grouped_convnd_fwd_bias_clamp * test_grouped_convnd_fwd_bilinear * test_grouped_convnd_fwd_gk_bias_clamp * Extend GPU reference to enable batchnorm epilogue * test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp * test_grouped_conv_bwd_data_bilinear * test_grouped_convnd_bwd_weight_bilinear * Add missing template instantiation * Perform operations in float in reference * Slightly increase tolerance for batchnorm profiler * Revert "Slightly increase tolerance for batchnorm profiler" This reverts commit `a3b2475229`. * Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp" This reverts commit `6da4576060`. * Revert "Extend GPU reference to enable batchnorm epilogue" This reverts commit `e2f75fa10e`. * Clarify variable names * Refactor elementwise ops into helper functions * Make helpers C++17-compatible [ROCm/composable_kernel commit: `c190d8d61f`]	2026-01-27 09:49:42 +01:00
Enrico Degregori	f2c7d07666	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format [ROCm/composable_kernel commit: `2e49b6b2f7`]	2026-01-26 12:57:09 -08:00
yinglu	b980f0febe	ck: add CK_USE_GFX950 macro (#3636 ) [ROCm/composable_kernel commit: `8942a19d5e`]	2026-01-26 11:38:45 -08:00
Aviral Goel	a26adffadf	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite [ROCm/composable_kernel commit: `b8751e505d`]	2026-01-26 11:27:42 -08:00
Thomas Ning	0983dea2be	Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634 ) * Finished the work * Fix the pipeline [ROCm/composable_kernel commit: `3900e1e7ce`]	2026-01-26 10:29:28 -08:00
chris-tsiaousis-hpc	4de19a1601	Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617 ) * Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> [ROCm/composable_kernel commit: `917f35553a`]	2026-01-26 10:20:30 -08:00
Max Podkorytov	bebf8c3720	Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585 ) This change significantly improves compile-time performance by reducing template instantiation depth for sequence generation and merging operations: Optimizations: - sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using __make_integer_seq to generate indices in a single step, then applying the functor via pack expansion - uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq with a helper that applies a constant value via pack expansion - sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction strategy. Added direct concatenation specializations for 1-4 sequences to avoid recursion in common cases, falling back to binary tree merging for 5+ sequences Documentation: - Added extensive inline comments explaining why sequence_merge cannot achieve O(1) depth like sequence_gen (requires computing cumulative sequence lengths from heterogeneous inputs, inherently requiring recursion) - Documented the binary tree reduction approach and why it's superior to fold expressions for this use case Testing: - Added comprehensive unit tests for uniform_sequence_gen with different values, sizes, and edge cases - Added tests for sequence_gen with custom functors (double, square, identity, constant) to verify the new implementation works with arbitrary functors - Added tests for sequence_merge with 4, 5, and many sequences to verify both the direct concatenation path and binary tree reduction path - Added tests for empty sequence edge cases [ROCm/composable_kernel commit: `de59c0716c`]	2026-01-26 10:08:55 -08:00
Ville Pietilä	e587756695	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639 ) Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `7ac3794284`]	2026-01-25 13:42:23 +01:00
Emily Martins	b6f1e99074	[CK_TILE] Fix alignment in Stream-K workspace buffer (#3625 ) * Fix alignment issue in Stream-K workspace buffer In CK Tile Stream-K, the workspace buffer is used to hold flags and partials, where the first i bytes holds the flags and the remaining bytes hold partials. This change adds padding to the flags prefix of the workspace buffer to ensure the number of bytes is 128B-aligned. Without this alignment, since workgroups do not skip cache when reading from partials, they may read stale partials data in cache, leading to incorrect results. The added padding avoids the stale data reading. This change also re-enables the test_ck_tile_streamk_reduction tests. * Compute reference GEMM on GPU for test verification to decrease testing time [ROCm/composable_kernel commit: `f5c2f09036`]	2026-01-23 16:14:22 -07:00
chris-tsiaousis-hpc	3c247733af	Remove code duplications in batched gemm wmma (#3580 ) * Moved device struct for batched gemm wmma to a common file Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Use the common device struct in the scaled batched gemm wmma implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: Remove unused includes and ambiguous comment Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved pointer offset calculation and gridwise argument to common struct This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common" Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> [ROCm/composable_kernel commit: `e1c46ff548`]	2026-01-23 12:39:03 -08:00
ltqin	90b3476006	Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 )" (#3635 ) This reverts commit 723b7ce0be2884da131036301892bf9157f51876. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `67f0b74ec6`]	2026-01-23 09:03:22 -08:00
Wojciech Laskowski	ee595ee58a	WMMA grouped conv fwd large tensor extra flavors (#3582 ) * Additional flavors for WMMA conv fwd large tensor - added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors * changed strategy to handle GemmArgs array * Adding generic instance * Added generic instance to clamp and bias_clamp ops [ROCm/composable_kernel commit: `81ee19bd2c`]	2026-01-23 12:19:51 +01:00
Po Yen Chen	4ded7e5984	Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 ) This reverts commit ceccf15275645cc64db0a4ae53f5a215c93a7969. [ROCm/composable_kernel commit: `de5a1d730d`]	2026-01-22 21:21:19 -08:00
kensclin	16e6a2c696	GEMM Blockscale ABQuant Optimization (#3620 ) * GEMM Blockscale ABQuant Optimization * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix precommit error * clean * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> [ROCm/composable_kernel commit: `31a35ecab4`]	2026-01-22 09:39:38 -08:00
Bartłomiej Kocot	9c3ab51d9b	[CK TILE] Fix basic gemm pipelines (#3611 ) * [CK TILE] Fix basic pipelines * fixes [ROCm/composable_kernel commit: `44f481a45c`]	2026-01-22 08:11:18 -06:00
Linjun-AMD	f6fac4cea6	[CK_TILE][FMHA]Add new tile size for async (#3623 ) * Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)" This reverts commit cfdad49edda4b2ccef92571f23646a8505bb2859. * Add new tile_size for async pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0b13697a88`]	2026-01-22 16:07:14 +08:00
ltqin	14254656f0	Fp8 block scale quantization for fmha fwd (#3330 ) * add block scale parameters to kernel * add block scale to kernel * add smoke test * format * Revert "format" This reverts commit `356c3c9706`. * only format my code * format py * fix auto not allowd in function prototype * change instance tttt to ttff * fix structured binding issue * change s_acc elementwise op * async pipeline add block scale * add quantation P using shift exp2 * precompute (m - shift) once per row * change blk scale seqstrt ptr name * fix some name * fix for deduction guide * fix some comments * add P scale to qr_ksvs_pipeline * add comment to idx_identity * change the method of calculating descale block index * unify naming style: use block_scale_ as name prefix * unify naming style * update the CHANGELOG.md * Add FP8 block scale quantization support for FMHA forward kernel --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `dd0b4294af`]	2026-01-21 20:58:26 -08:00
Yi DING	a0935f7669	[CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615 ) [ROCm/composable_kernel commit: `fcc9372c00`]	2026-01-21 09:54:46 +08:00
Erwin Terpstra	b079841b10	Implement batched gemm add relu gemm add for rdna4 (#3391 ) * wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation * wip: many fixes in implementation of batched gemm gemm multiple d * wip: batched gemm gemm multiple d gridwise op compiling, not working yet * fix: incorrect d0 grid indexing in batched gemm gemm multipled * feat: add instances for batched gemm add relu gemm add * chore: configure instance with low vector transfer size for odd sizes * chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense * fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes * fix: disable odd size tests on XDL archs * chore: removed temporary logging * chore: update some references to C tensor to E tensor * Tentative fix for example template params * Tentative fix for non-multi-D batched gemm gemm device impl. * Tentative fix for xdl example template params * Tentative fix for profiler build on gfx90a * chore: improve device batched gemm gemm multi D comment to include all ops and dimensions * chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply * fix: make the gemm1 data types match what happens in the device op * feat: add d0s/d1s datatypes and layouts to the device op type string * chore: change element-wise op so addition happens in fp32 * chore: add static asserts for gemm0/gemm1 calculated wave sizes * chore: also updated other element-wise ops to use fp32 calculations * chore: log number of supported instances * chore: update instance comment * chore: disable kernel timing in example by default * fix: gemm1 wave size calculation * fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions * chore: remove increased tolerance in batched gemm gemm multiple d example * chore: add comment explaining that verification fails for certain input values * chore: clarify instance comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `d5ae81b292`]	2026-01-20 13:06:59 -08:00
Max Podkorytov	8b842250da	Add persistent async input scheduler for GEMM kernels (#3520 ) Add signal-based synchronization for persistent GEMM kernels where input data becomes available incrementally. Uses modulo wraparound (like PyTorch's AsyncMM) for chunk index calculation: chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks Key components: - PersistentAsyncInputScheduler struct with tiles_per_chunk_m, chunk_signals, tile_idx_pivot_m, and num_chunks fields - wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency - IsSupportedArgument validation for scheduler parameters - Example demonstrating async input scheduling with simulated producer - GTest unit tests covering all layout combinations [ROCm/composable_kernel commit: `91b4102a59`]	2026-01-20 10:37:09 -08:00
Linjun-AMD	e227e837be	Revert "[CK_TILE][FMHA] Add new tile size for async (#3586 )" (#3613 ) This reverts commit 217ac48fd83deef3d0d5084815689e8c79958cc1. [ROCm/composable_kernel commit: `8f75869408`]	2026-01-20 09:40:54 -08:00
music-dino	750bd72b3d	Batched gemm softmax gemm descriptor fix (#3564 ) * Add rocm to prefix path for codegen * Fix issue with c0_matrix_mask construction [ROCm/composable_kernel commit: `6300ad3c62`]	2026-01-20 07:25:30 -08:00
Wojciech Laskowski	6ad65bc855	WMMA support for batched_gemm_reduce (#3332 ) Summary: - added new device impl of Batched GEMM Reduce for WMMA - added instance library - added WMMA impl to the Batched GEMM Reduce tests [ROCm/composable_kernel commit: `b09121f860`]	2026-01-20 10:50:46 +01:00
Bartłomiej Kocot	85c5741492	[CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518 ) * [BULDER] Add grouped conv fwd ck tile profiler * [CK TILE] Fix grouped conv kernels splitk and double lds * Updates * Fixes * Move to ckProfiler * Fixes * fix * fix * Change instances to empty list by default * fix * fix * Update grouped_convolution_signatures.hpp * Update grouped_convolution_forward_tile_algs.hpp * [CK TILE] Add grouped convolution forward tests (#3556) * [CK TILE] Add grouped convolution forward tests * fix jenkins * fixes * comments fixes * unit test * unit test fix * Move instances outside builder * fix includes * clang format fix * readme fix * fix includes * fixes [ROCm/composable_kernel commit: `0727e85e52`]	2026-01-19 22:29:01 -07:00
Cong Ma	c42cd28370	[CK TILE] remove dependency on std chrono (#3599 ) * [CK TILE] remove dependency on std chrono * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0517d43d31`]	2026-01-19 15:31:02 -08:00
Linjun-AMD	ecda0fe2e9	[CK_TILE][FMHA] Add new tile size for async (#3586 ) * add new tile size for async Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `f3aafb9555`]	2026-01-19 15:22:33 -08:00
Adam Osewski	a9ff38bc89	[CK_BUILDER] Convolution forward transfer concepts. (#3535 ) * Rename member variable to better reflect its actuall meaning. * Add transfer checks for conv fwd xdl. * Validate tensor layouts & vector size conv fwd v3. * Add combined transfer concepts. * Add transfer concepts for conv fwd factories. * Fix clang format * Add helper instruction to get max mem vector instruction width. * Apply review comments. * Rename thread cluster access(->arrange) order concept * FIx merge artifacts. * Add generic access order limits into block transfer concept. [ROCm/composable_kernel commit: `1a6d1b59ef`]	2026-01-19 10:54:10 +01:00
Erwin Terpstra	9c660bfbe3	Implement batched gemm bias permute for RDNA4 (#3534 ) * feat: test setup for batched contraction (aka batched gemm multiple d e permute) * wip: device struct for WMMA batched contraction multiple d based on new gridwise op * feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases * fix: failure to resolve template parameters when calling new function overload * fix: passing reference type as parameter instead of underlying types * fix: merge error caused duplicate definitions * fix: make sure constness of template and parameters types match * fix: don't compile batched contraction test on unsupported architectures * feat: add example for new wmma implementation, and consolidate example code between platforms * style: return inline instead of with branch * chore: add extra assert on vector memory access sizes * chore: clean up some unused variables * fix: correct tail number calculation, added small cases and extra instances to the test * fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method [ROCm/composable_kernel commit: `fe40a5d139`]	2026-01-17 08:30:27 +01:00
Cong Ma	487f1beee9	[CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584 ) [ROCm/composable_kernel commit: `f9104ef9b3`]	2026-01-16 15:27:39 -08:00
logicat	fb918acff9	Remove unnecessary hip_fp16 include from stream_config (#3549 ) [ROCm/composable_kernel commit: `fec81109f1`]	2026-01-16 10:40:05 -08:00
Yung-sheng Tu	97f2fa2912	Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429 ) * add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp * add examples * add instances to test * remove duplicate code between examples [ROCm/composable_kernel commit: `6df2d70143`]	2026-01-15 07:19:31 -08:00
Estevan Vedovelli	09d084bfb4	Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `e30207985a`]	2026-01-15 09:35:24 -05:00
Jeff Huang	445ec888ba	[FMHA] Enable page size 16 for batch prefill kernel (#3568 ) * [FMHA] Enable page size 16 for batch prefill kernel * Refactor batch prefill KV offset logic to simplify template arguments - Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args. - Add static assert to forbid `page_size=1` with vectorized layout. [ROCm/composable_kernel commit: `993d3e2f0e`]	2026-01-15 22:11:44 +08:00
John Shumway	753043b27a	[CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547 ) * Factor helpers out of conv_traits.hpp * Create a non-templated conv_traits struct * Migrate to new instance-specific instance_to_conv_traits functions * Clean up reflection concepts * Clean up ConvTraits helpers * Update testing for convolution traits This is a lot of cleanup on tests to have verbose coverage of feature extraction, explicit tests for each supported device kernel, and simple, readable test code. * Address reviewer comments and resolve merge conflict [ROCm/composable_kernel commit: `5122637215`]	2026-01-15 10:03:21 +01:00
Bartłomiej Kocot	8c72adabeb	Disable ActiveWorkgroupsPerCU for different arch in wmma kernels (#3566 ) [ROCm/composable_kernel commit: `a346cfa960`]	2026-01-14 12:37:12 -08:00

1 2 3 4 5 ...

1423 Commits