composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
Enrico Degregori	968e54f90f	Multi AB support for wave transfer (#3578 ) * Add multi AB support to wave transfer * Improviments to multi ABD examples * Add instances and use intrawave v1 instead of interwave * Apply changes to other transfers * Wave transfer: add support for multiple internal vgpr buffers * Fix compilation error gfx11 [ROCm/composable_kernel commit: `f16d9100e4`]	2026-01-29 10:29:40 -08:00
Johannes Graner	e8bf2e1418	[Conv] Enable bwd weight splitk autodeduction with cap (#3656 ) * Enable bwd weight splitk autodeduction with cap * Fix error threshold calculations * Add missing logic to wmma multiple d kernel * Fix threshold calculation * Update test with new applicability [ROCm/composable_kernel commit: `fabac7e2c3`]	2026-01-29 17:40:28 +00:00
assistant-librarian[bot]	84daa4d305	Merge commit 'e33f15709f8c1e05f5056edc7295276e121dc253' into develop	2026-01-29 15:20:56 +00:00
Robin Voetter	4008976a26	ck-builder: fix test related to changed xdl bwd cshuf v3 interface (#3677 ) Force merging because I verified this fix manually: git checkout develop git pull ninja smoke-builder (failed to build, as expected) git checkout rvoetter/ckb-fix ninja smoke-builder (passed!) [ROCm/composable_kernel commit: `e33f15709f`]	2026-01-29 07:15:56 -08:00
assistant-librarian[bot]	961da131ed	Merge commit '9b168082b7aa19bcf50fd9991baf10a0c77d105b' into develop	2026-01-29 04:42:46 +00:00
Khushbu Agarwal	9fc9cc598f	[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 ) * initial commit * preshuffleQuant support for ABQuant * fix mxfp4 to use correct QuantGroupSize * addressing review comments and seperated Preshufflequant for A and B * updated grouped gemm example for updated traits definition * fix for CI failure * updated grouped_gemm_abquant test for updated traits definition * updated grouped_gemm_abquant test for updated traits definition [ROCm/composable_kernel commit: `9b168082b7`]	2026-01-28 19:45:09 -08:00
assistant-librarian[bot]	308d69498c	Merge commit 'e3556fed0453e66cdebc5dad6b903f5e902cd9b4' into develop	2026-01-29 00:45:02 +00:00
Jeff Huang	4fb9c4b82a	Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657 ) - Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather - Add is_gather_dim() and get_gather_index() for multi-dim page lookup - Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1}) - Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread - Add static_assert for RowMajor VLayout requirement in batch prefill Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `e3556fed04`]	2026-01-29 07:18:41 +08:00
assistant-librarian[bot]	0c3ea9d826	Merge commit '83b58bb0c3ff12f426d45383900a6fd91b4116a1' into develop	2026-01-28 21:42:05 +00:00
Bartłomiej Kocot	017d96faaa	Grouped Conv Bwd Weight Direct Load (#3648 ) * Grouped Conv Bwd Weight Direct Load * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp * Implement group merging for bwd_weight and add instances * Link direct load instances * builder fixes * fix * fixes * fix --------- Co-authored-by: Graner, Johannes <johannes.graner@amd.com> [ROCm/composable_kernel commit: `83b58bb0c3`]	2026-01-28 15:31:54 -06:00
ltqin	ee4e216716	Fix block scale init value (#3666 ) * Make blockscale descale range adaptive to data type max value * format [ROCm/composable_kernel commit: `654bec3362`]	2026-01-28 12:37:15 -08:00
assistant-librarian[bot]	dbadcf487a	Merge commit '42048bdb7d8d931966af76c6dacfedce1c9da90a' into develop	2026-01-28 17:20:56 +00:00
Robin Voetter	7d1574a9ab	[CK_BUILDER] Integrate CKB validation with CK verification (#3649 ) * ck-builder: tensor copy function This function copies one tensor to another, so that the memory layout can be changed between them. * ck-builder: fix ck::bhalf literals These types don't work properly. * ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it This reduces the amount of duplicated code a bit. * ck-builder: add flat tensor iterator This "iterator" type pretends to be a pointer, useful for passing tensors to functions expecting pointer-like types. * ck-builder: integrate validation with ck gpu verification By templating the gpu_verify function over iterators, we can use the new FlatTensorIterator to adapt the function to multi- dimensional tensors without changing either implementation too much. * ck-builder: add check_by_accumulations This changes the gpu_verification.hpp code to also accept "iterator" types for the relevant gpu_verify and gpu_reduce_max functions. * ck: fix test_gpu_verification GenerateRandomData for bhalf is_integer_it<bhalf_t> yields true, but it is not actually an integer. * ck: make gpu_verification kernels be proper persistent kernels Previously these were using a hardcoded value for the grid size. This commit changes that so that the grid size is automatically derived from the kernel's occupancy and the number of multiprocessors on the GPU. * ck: clean up gpu_verification.hpp using block_reduce This implements a small generic block reduce function, and rewrites the rest of gpu_verification.hpp using that function to clean it up a bit. * ck-builder: doc typos * ck-builder: update testing readme with validation interface. * ck-builder: rebase fixes + review comments * ck-builder: fix device integer generation with float types Passing bfloat here causes a nans due to type_convert performing a bitcast. * ck: another bhalf_t bug CK expects that int-generation with ck::bhalf_t yields bhalf integers, not unsigned integers. This makes the logic of FillUniformRandInteger compatible with GeneratorTensor_2<InDataType>, however idiotic that may be. [ROCm/composable_kernel commit: `42048bdb7d`]	2026-01-28 17:41:02 +01:00
kabrahamAMD	140727e7c1	[CK_BUILDER] Add reflection for wmma and bwd weight instances to ck builder reflection (#3592 ) * added reflection for conv_fwd_multiple_d_wmma_cshuffle.hpp * added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle * added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle v3 * added reflection of max_transpose parameters * fix printing of std optional parameters * fix use of undefined ck::index * added conv traits for device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle * added xdl two stage instance to reflection * added additional variables * added reflection for grouped_conv_bwd_weight_multiple_d_wmma_cshuffle, _v3, grouped_conv_two_stage_wmma_cshuffle_v3, * added reflection for device_grouped_conv_bwd_weigh_wmma_cshuffle_v3 * added reflection for bwd_weight_wmma_cshuffle * added comments back in * add printed output for optional parameters * update README * fix typo * added num_gemm_k_prefetch_stage and small fixes * modified test string due to reflection of new parameter --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `d6cccf6093`]	2026-01-28 09:33:45 -07:00
assistant-librarian[bot]	78b36a13ab	Merge commit 'bc6083bdd466d1e060253e7a49626c923293c483' into develop	2026-01-28 15:18:44 +00:00
Johannes Graner	04ed7d9ba9	Update pytorch version in convolution dataset test generation (#3667 ) * Update torch version in dataset test gen [ROCm/composable_kernel commit: `bc6083bdd4`]	2026-01-28 07:38:10 -07:00
assistant-librarian[bot]	19a000f6c3	Merge commit '8e3d84aba3be5e851de5d6c6c3e9c08cadbce1da' into develop	2026-01-28 08:16:51 +00:00
Yi DING	83c5a3b025	[CK_TILE] ABQuant New Preshuffle (#3638 ) * Refactor * Gemm quant improvement * Change preshuffle * Fix * Fix grouped gemm ut * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `8e3d84aba3`]	2026-01-27 23:46:49 -08:00
assistant-librarian[bot]	d4b61e4db5	Merge commit '91e32f305fa4d809103431a81594c52240753d40' into develop	2026-01-27 22:14:22 +00:00
damien-lejeune	24d3cbc30d	[CK Tile] multi reduce improvements (#3607 ) * WIP: refactoring * Swap operation/data nested loops order * Improve memory coalescing * Add comments * Enforce same identity element for the reduce operations * Re-add compile time constant * Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> [ROCm/composable_kernel commit: `91e32f305f`]	2026-01-27 12:56:09 -08:00
linqunAMD	5713c658c6	[ck] add gridwise base class for in all xdl kernel (#186 ) (#3544 ) 1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes. - to select correct LDS layout and epilogue behavior , three additional parameters is added. - ForceNaiveLdsLayout: disable XOR based LDS layout when it is true - DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9 - IsMxGemm: epilogue has two addtional dimensions 2. Move all LDS descriptor layout related fucntion to base class, including - GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1 - GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock 3. Move several LDS related helper funtions to base class, including - GetSharedMemoryNumberOfByte - GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1 - GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1 - GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl 4. Move all c epilogue related code to base class, and 4 kind of implementation are provided - RunEpilogueNoShuffle - RunEpilogue - RunMultiDEpilogue - RunMoeEpilogue [ROCm/composable_kernel commit: `23cefda140`]	2026-01-27 12:49:47 -08:00
assistant-librarian[bot]	362767eba8	Merge commit 'b737f1dee5a097f8b62156335e21259d8dd2784c' into develop	2026-01-27 19:18:39 +00:00
Michał Kulikowski	bdc1f4846a	[CK]Refactoring threadwise_tensor_slice_transfer_v3r1.hpp (#3263 ) Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b737f1dee5`]	2026-01-27 10:48:16 -08:00
assistant-librarian[bot]	417ad9c7f1	Merge commit 'b26cb596b0cbea9f40ae36b3f245b5aa7120c5c9' into develop	2026-01-27 18:19:38 +00:00
Illia Silin	7dd38d592e	fix some syntax errors (#3658 ) [ROCm/composable_kernel commit: `b26cb596b0`]	2026-01-27 09:59:39 -08:00
assistant-librarian[bot]	21657a1f32	Merge commit '0cc83cb8e8c9d9d926469f862bc1272ef0cf0dc8' into develop	2026-01-27 16:17:03 +00:00
spolifroni-amd	4cdc3132b3	CK: removed the api reference (#3571 ) * removed the api reference * updating to the latest rocm-docs-core min version * fixed a formatting issue with buffer views * removed reference links from code snippets * removed reference links from code snippets --------- Co-authored-by: John Afaganis <john.afaganis@amd.com> [ROCm/composable_kernel commit: `0cc83cb8e8`]	2026-01-27 07:36:47 -08:00
assistant-librarian[bot]	fea562ee53	Merge commit 'b66597ed96180ce21e7e6a6678dfc232ed07c800' into develop	2026-01-27 14:20:24 +00:00
Max Podkorytov	dbb766d951	Add build time optimization documentation (#3608 ) This document describes techniques for reducing C++ template instantiation overhead in the Composable Kernel codebase, including: - Replacing recursive templates with pack expansion (O(N) → O(1) depth) - Using named functors instead of lambdas to share instantiations - Replacing template recursion with constexpr loops - Using fold expressions for accumulation operations These techniques can significantly reduce build times for template-heavy code. [ROCm/composable_kernel commit: `b66597ed96`]	2026-01-27 06:07:27 -07:00
assistant-librarian[bot]	d7af03f452	Merge commit '3d67e6c4927a9daea9076fab75b23fb44fdc22b1' into develop	2026-01-27 09:19:15 +00:00
Bartłomiej Kocot	42638c34b0	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt [ROCm/composable_kernel commit: `3d67e6c492`]	2026-01-27 11:04:11 +02:00
Johannes Graner	8049ce9be4	[CK tests] Extend conv GPU reference (#3539 ) * test_convnd_fwd * test_convnd_bwd_data * test_conv_bwd_data_scale * test_grouped_convnd_fwd_clamp * test_grouped_convnd_fwd_scale * multiple A/B tensors and D tensor for fwd GPU ref * test_grouped_convnd_fwd_scaleadd_ab * test_grouped_convnd_fwd_bias_clamp * test_grouped_convnd_fwd_bilinear * test_grouped_convnd_fwd_gk_bias_clamp * Extend GPU reference to enable batchnorm epilogue * test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp * test_grouped_conv_bwd_data_bilinear * test_grouped_convnd_bwd_weight_bilinear * Add missing template instantiation * Perform operations in float in reference * Slightly increase tolerance for batchnorm profiler * Revert "Slightly increase tolerance for batchnorm profiler" This reverts commit `a3b2475229`. * Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp" This reverts commit `6da4576060`. * Revert "Extend GPU reference to enable batchnorm epilogue" This reverts commit `e2f75fa10e`. * Clarify variable names * Refactor elementwise ops into helper functions * Make helpers C++17-compatible [ROCm/composable_kernel commit: `c190d8d61f`]	2026-01-27 09:49:42 +01:00
assistant-librarian[bot]	aa3b7866b0	Merge commit 'cc75948d1c7f732d102c8e31dc007a2ccd07761f' into develop	2026-01-27 01:42:04 +00:00
Robin Voetter	aa35585fc7	[CK_BUILDER] conv bwd weight testing (#3618 ) * ck-builder: restructure testing conv In order to prepare for bwd of conv testing, this commit moves some files and types around so that we can reuse ckt::Args for both forward and backwards convolution. * ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp This will allow us to more easily include fwd.hpp from backwards definitions, which is required for initializing bwd values. * ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3 Turns out that the supplied layout isn't actually supported... * ck-builder: ck and reference conv integration for bwd weight * ck-builder: ck bwd weight execution test * ck-builder: ckt::run support for ck-tile bwd weight * ck-builder: ck tile bwd weight execution test * ck-builder: extra debug printing in MatchesReference * ck-builder: make ckt::run return RunResult This type is more convenient than std::tuple, as it will allow us to use google test matchers with this in the future. * ck-builder: RunResult matcher Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error message about how and why running an algorithm failed. * ck-builder: doc fixes * ck-builder: add missing headers [ROCm/composable_kernel commit: `cc75948d1c`]	2026-01-26 23:50:15 +01:00
assistant-librarian[bot]	65be39bfd1	Merge commit '8654c0628f83261d3dd64cfb4ec80e9dd2b29fa5' into develop	2026-01-26 22:14:16 +00:00
Andrew Clark	2b90408685	Finished testing failure types. Removed testing code. [ROCm/composable_kernel commit: `8654c0628f`]	2026-01-26 15:09:49 -07:00
Andrew Clark	c2cfd318da	Removed working tests. Validating remaining tests. [ROCm/composable_kernel commit: `402f21d0a6`]	2026-01-26 15:09:49 -07:00
Andrew Clark	ec4a6be1ed	Removed working tests. Validating remaining tests. [ROCm/composable_kernel commit: `1397924c21`]	2026-01-26 15:09:49 -07:00
Andrew Clark	22abf1b0d9	Testing a pattern to support all text variations [ROCm/composable_kernel commit: `6c596b9553`]	2026-01-26 15:09:49 -07:00
Andrew Clark	c3c318c340	Removing working cases to test other failure examples [ROCm/composable_kernel commit: `58e1d03244`]	2026-01-26 15:09:49 -07:00
Andrew Clark	c490f137b3	Adding forcing failure to test notifications [ROCm/composable_kernel commit: `95768d1b22`]	2026-01-26 15:09:49 -07:00
Andrew Clark	9e7b7fe59a	Fixing Jenkinsfile too large error [ROCm/composable_kernel commit: `786965b95e`]	2026-01-26 15:09:49 -07:00
Andrew Clark	76b261ef00	Updating failure patterns to be more reliable and adding tests to verify they are caught in the logs [ROCm/composable_kernel commit: `42a731b791`]	2026-01-26 15:09:49 -07:00
John Shumway	1ea438b909	Add python analysis scripts for Clang's time trace (#3644 ) This PR introduces a Python toolkit for analyzing Clang's `-ftime-trace` build performance data. This is the foundation for our systematic effort to reduce CK and CK-Tile build times (#3575). The toolkit provides fast parsing of trace JSON files into pandas DataFrames using orjson, with specialized functions for analyzing template instantiation costs and compilation phase breakdowns. It includes a core library (`trace_analysis/`), example scripts for quick analysis, a comprehensive README with usage documentation, and an interactive Jupyter notebook demonstration. Key features include memory-efficient DataFrame schemas with optimized dtypes, recursive hierarchical phase analysis, automatic metadata extraction (source file, compilation timing), and template instantiation filtering. The design supports both standalone scripts and interactive Jupyter notebook workflows. This single-file analysis capability lays the groundwork for future multi-file analysis across thousands of compilation units, enabling data-driven optimization and build time regression detection. [ROCm/composable_kernel commit: `a213ce676b`]	2026-01-26 13:44:36 -08:00
assistant-librarian[bot]	63dde06485	Merge commit '2e49b6b2f79d5ab0fe2fca79812affd44de94db7' into develop	2026-01-26 21:13:59 +00:00
Enrico Degregori	6e95bf8179	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format [ROCm/composable_kernel commit: `2e49b6b2f7`]	2026-01-26 12:57:09 -08:00
assistant-librarian[bot]	1298575103	Merge commit 'bd5fec81afdb6df7f4637128a3ba86dbfd6bcca1' into develop	2026-01-26 20:15:40 +00:00
Thrupti Raj Lakshmana Gowda	7636e64d55	Removing [4,64,16] warp tile from Tile Engine (#3643 ) [ROCm/composable_kernel commit: `bd5fec81af`]	2026-01-26 11:56:06 -08:00
yinglu	1b369a210f	ck: add CK_USE_GFX950 macro (#3636 ) [ROCm/composable_kernel commit: `8942a19d5e`]	2026-01-26 11:38:45 -08:00
Aviral Goel	2a17f6e537	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite [ROCm/composable_kernel commit: `b8751e505d`]	2026-01-26 11:27:42 -08:00

... 5 6 7 8 9 ...

4099 Commits