composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-21 05:19:20 +00:00

Author	SHA1	Message	Date
Andrew Clark	b87853431e	Removing working cases to test other failure examples [ROCm/composable_kernel commit: `58e1d03244`]	2026-01-26 15:09:49 -07:00
Andrew Clark	8c91ce81bf	Adding forcing failure to test notifications [ROCm/composable_kernel commit: `95768d1b22`]	2026-01-26 15:09:49 -07:00
Andrew Clark	18e95f26aa	Fixing Jenkinsfile too large error [ROCm/composable_kernel commit: `786965b95e`]	2026-01-26 15:09:49 -07:00
Andrew Clark	e2f587ad01	Updating failure patterns to be more reliable and adding tests to verify they are caught in the logs [ROCm/composable_kernel commit: `42a731b791`]	2026-01-26 15:09:49 -07:00
John Shumway	7565ca5310	Add python analysis scripts for Clang's time trace (#3644 ) This PR introduces a Python toolkit for analyzing Clang's `-ftime-trace` build performance data. This is the foundation for our systematic effort to reduce CK and CK-Tile build times (#3575). The toolkit provides fast parsing of trace JSON files into pandas DataFrames using orjson, with specialized functions for analyzing template instantiation costs and compilation phase breakdowns. It includes a core library (`trace_analysis/`), example scripts for quick analysis, a comprehensive README with usage documentation, and an interactive Jupyter notebook demonstration. Key features include memory-efficient DataFrame schemas with optimized dtypes, recursive hierarchical phase analysis, automatic metadata extraction (source file, compilation timing), and template instantiation filtering. The design supports both standalone scripts and interactive Jupyter notebook workflows. This single-file analysis capability lays the groundwork for future multi-file analysis across thousands of compilation units, enabling data-driven optimization and build time regression detection. [ROCm/composable_kernel commit: `a213ce676b`]	2026-01-26 13:44:36 -08:00
assistant-librarian[bot]	63dde06485	Merge commit '2e49b6b2f79d5ab0fe2fca79812affd44de94db7' into develop	2026-01-26 21:13:59 +00:00
Enrico Degregori	f2c7d07666	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format [ROCm/composable_kernel commit: `2e49b6b2f7`]	2026-01-26 12:57:09 -08:00
assistant-librarian[bot]	1298575103	Merge commit 'bd5fec81afdb6df7f4637128a3ba86dbfd6bcca1' into develop	2026-01-26 20:15:40 +00:00
Thrupti Raj Lakshmana Gowda	ab65977dae	Removing [4,64,16] warp tile from Tile Engine (#3643 ) [ROCm/composable_kernel commit: `bd5fec81af`]	2026-01-26 11:56:06 -08:00
yinglu	b980f0febe	ck: add CK_USE_GFX950 macro (#3636 ) [ROCm/composable_kernel commit: `8942a19d5e`]	2026-01-26 11:38:45 -08:00
Aviral Goel	a26adffadf	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite [ROCm/composable_kernel commit: `b8751e505d`]	2026-01-26 11:27:42 -08:00
assistant-librarian[bot]	39405747ab	Merge commit '3900e1e7ceacfa32cb8d1522260ed30befd4dae3' into develop	2026-01-26 19:16:22 +00:00
Thomas Ning	0983dea2be	Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634 ) * Finished the work * Fix the pipeline [ROCm/composable_kernel commit: `3900e1e7ce`]	2026-01-26 10:29:28 -08:00
SamiAario-AMD	e01c295551	Re enable f8 x bf8 tests on compv3 and compv4 (#3605 ) * Re-enable f8 x bf8 tests on CompV3 as they now pass * On CompV4, fp8 x bf8 tests now pass with K_BlockSize I32 * Add a changelog entry --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `834642202c`]	2026-01-26 10:23:26 -08:00
chris-tsiaousis-hpc	4de19a1601	Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617 ) * Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> [ROCm/composable_kernel commit: `917f35553a`]	2026-01-26 10:20:30 -08:00
assistant-librarian[bot]	06fb853279	Merge commit 'de59c0716c631edfa4742e4309ee11d4379ef6e8' into develop	2026-01-26 18:17:51 +00:00
Max Podkorytov	bebf8c3720	Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585 ) This change significantly improves compile-time performance by reducing template instantiation depth for sequence generation and merging operations: Optimizations: - sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using __make_integer_seq to generate indices in a single step, then applying the functor via pack expansion - uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq with a helper that applies a constant value via pack expansion - sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction strategy. Added direct concatenation specializations for 1-4 sequences to avoid recursion in common cases, falling back to binary tree merging for 5+ sequences Documentation: - Added extensive inline comments explaining why sequence_merge cannot achieve O(1) depth like sequence_gen (requires computing cumulative sequence lengths from heterogeneous inputs, inherently requiring recursion) - Documented the binary tree reduction approach and why it's superior to fold expressions for this use case Testing: - Added comprehensive unit tests for uniform_sequence_gen with different values, sizes, and edge cases - Added tests for sequence_gen with custom functors (double, square, identity, constant) to verify the new implementation works with arbitrary functors - Added tests for sequence_merge with 4, 5, and many sequences to verify both the direct concatenation path and binary tree reduction path - Added tests for empty sequence edge cases [ROCm/composable_kernel commit: `de59c0716c`]	2026-01-26 10:08:55 -08:00
Illia Silin	70ffbc577b	add dockerfile for manylinux (#3651 ) [ROCm/composable_kernel commit: `054c437dec`]	2026-01-26 09:23:19 -08:00
assistant-librarian[bot]	8968bceee4	Merge commit '7ac379428408337a231a86f8a8b7353b5b45aa2d' into develop	2026-01-25 13:22:29 +00:00
Ville Pietilä	e587756695	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639 ) Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `7ac3794284`]	2026-01-25 13:42:23 +01:00
assistant-librarian[bot]	6a21c125a0	Merge commit 'f5c2f09036cdc22dc8944719215dd47003c50a24' into develop	2026-01-24 00:38:47 +00:00
Emily Martins	b6f1e99074	[CK_TILE] Fix alignment in Stream-K workspace buffer (#3625 ) * Fix alignment issue in Stream-K workspace buffer In CK Tile Stream-K, the workspace buffer is used to hold flags and partials, where the first i bytes holds the flags and the remaining bytes hold partials. This change adds padding to the flags prefix of the workspace buffer to ensure the number of bytes is 128B-aligned. Without this alignment, since workgroups do not skip cache when reading from partials, they may read stale partials data in cache, leading to incorrect results. The added padding avoids the stale data reading. This change also re-enables the test_ck_tile_streamk_reduction tests. * Compute reference GEMM on GPU for test verification to decrease testing time [ROCm/composable_kernel commit: `f5c2f09036`]	2026-01-23 16:14:22 -07:00
assistant-librarian[bot]	e2e058bcbc	Merge commit 'e1c46ff548cf7bc8b0e1b41a3d559f05317ec2da' into develop	2026-01-23 21:13:05 +00:00
chris-tsiaousis-hpc	3c247733af	Remove code duplications in batched gemm wmma (#3580 ) * Moved device struct for batched gemm wmma to a common file Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Use the common device struct in the scaled batched gemm wmma implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: Remove unused includes and ambiguous comment Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved pointer offset calculation and gridwise argument to common struct This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common" Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> [ROCm/composable_kernel commit: `e1c46ff548`]	2026-01-23 12:39:03 -08:00
assistant-librarian[bot]	c244e1454a	Merge commit '67f0b74ec6687192fac14c359c57aca237d3cf2a' into develop	2026-01-23 17:17:05 +00:00
ltqin	90b3476006	Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 )" (#3635 ) This reverts commit 723b7ce0be2884da131036301892bf9157f51876. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `67f0b74ec6`]	2026-01-23 09:03:22 -08:00
assistant-librarian[bot]	078593e052	Merge commit '2e08a7e5ab51b020c90008b45c75dc35c2ba426c' into develop	2026-01-23 12:19:23 +00:00
Wojciech Laskowski	a0445aff5f	WMMA grouped conv fwd large tensor bias bnorm clamp (#3595 ) * Added bias_bnorm_clamp for WMMA conv fwd large tensor. Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout. - grouped_conv2d_fwd_bias_bnorm_clamp - grouped_conv3d_fwd_bias_bnorm_clamp * changed strategy to handle GemmArgs array * Adding generic instance * fixed last nits from reviewers and copilot [ROCm/composable_kernel commit: `2e08a7e5ab`]	2026-01-23 12:20:00 +01:00
Wojciech Laskowski	ee595ee58a	WMMA grouped conv fwd large tensor extra flavors (#3582 ) * Additional flavors for WMMA conv fwd large tensor - added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors * changed strategy to handle GemmArgs array * Adding generic instance * Added generic instance to clamp and bias_clamp ops [ROCm/composable_kernel commit: `81ee19bd2c`]	2026-01-23 12:19:51 +01:00
assistant-librarian[bot]	ebc79763fb	Merge commit '7b3db1a878181004fc5db7cdb82840623beaadb5' into develop	2026-01-23 10:14:37 +00:00
Bartłomiej Kocot	5c6c0f5ad1	Grouped conv fwd direct load vector=2 (#3632 ) [ROCm/composable_kernel commit: `7b3db1a878`]	2026-01-23 10:29:59 +01:00
assistant-librarian[bot]	36253637d1	Merge commit 'de5a1d730dc77d1471ad53ca18dfd7c1474e9873' into develop	2026-01-23 07:17:14 +00:00
Po Yen Chen	4ded7e5984	Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 ) This reverts commit ceccf15275645cc64db0a4ae53f5a215c93a7969. [ROCm/composable_kernel commit: `de5a1d730d`]	2026-01-22 21:21:19 -08:00
assistant-librarian[bot]	e0b24cbecc	Merge commit 'f30d04654e6bb9b064cf96c6bb4e3fff960efbd8' into develop	2026-01-23 00:39:51 +00:00
damien-lejeune	10e975e7af	Add missing check target in reduce tile engine op (#3631 ) Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> [ROCm/composable_kernel commit: `f30d04654e`]	2026-01-22 16:06:02 -08:00
assistant-librarian[bot]	88d27d2141	Merge commit 'eb2dc8f466cd2978490ccc3ff794d898cad9535a' into develop	2026-01-22 23:13:55 +00:00
Vidyasagar Ananthan	74e270abab	Speed up glob recurse. (#3626 ) [ROCm/composable_kernel commit: `eb2dc8f466`]	2026-01-22 14:44:47 -08:00
assistant-librarian[bot]	1654f02845	Merge commit 'b9bb1db5d932c4c0445994cfc1d37f66a3744659' into develop	2026-01-22 21:15:42 +00:00
arai713	bfa37887fb	Addition of Stream-K tests using Tile Engine (#3514 ) * Addition of Stream-K tests using Tile Engine This change adds an implementation for generating Stream-K tests using Tile Engine. This will generate various test executables for different combinations based on the config files. This addition has simple tests running for bf16 and fp16, with both atomic and reduction strategies and compv3 pipeline. The tests rely on the implementation of Stream-K in Tile Engine. * integrating addition of tree reduction and editing the README * temporarily removing parallel and tree reduction from configs while bugs regarding them are being resolved [ROCm/composable_kernel commit: `b9bb1db5d9`]	2026-01-22 12:53:52 -08:00
assistant-librarian[bot]	84e961765d	Merge commit '31a35ecab4e403f63ec4b76f4a709c21172c39de' into develop	2026-01-22 18:17:10 +00:00
kensclin	16e6a2c696	GEMM Blockscale ABQuant Optimization (#3620 ) * GEMM Blockscale ABQuant Optimization * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix precommit error * clean * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> [ROCm/composable_kernel commit: `31a35ecab4`]	2026-01-22 09:39:38 -08:00
Vidyasagar Ananthan	7ce0127e8f	Adding dispatcher architecture (#3300 ) * WIP POC of dispatcher * Dispatcher python workflow setup. * Dispatcher cleanup and updates. Further dispatcher cleanup and updates. Build fixes Improvements and python to CK example Improvements to readme * Fixes to python paths * Cleaning up code * Improving dispatcher support for different arch Fixing typos * Fix formatting errors * Cleaning up examples * Improving codegeneration * Improving and fixing C++ examples * Adding conv functionality (fwd,bwd,bwdw) and examples. * Fixes based on feedback. * Further fixes based on feedback. * Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug. * Another round of improvements based on feedback. * Trimming out unnecessary code. * Fixing the multi-D implementation. * Using gpu verification for gemms and fixing convolutions tflops calculation. * Fix counter usage issue and arch filtering per ops. * Adding changelog and other fixes. * Improve examples and resolve critical bugs. * Reduce build time for python examples. * Fixing minor bug. * Fix compilation error. * Improve installation instructions for dispatcher. * Add docker based installation instructions for dispatcher. * Fixing arch-based filtering to match tile engine. * Remove dead code and fix arch filtering. * Minor bugfix. * Updates after rebase. * Trimming code. * Fix copyright headers. * Consolidate examples, cut down code. * Minor fixes. * Improving python examples. * Update readmes. * Remove conv functionality. * Cleanup following conv removable. [ROCm/composable_kernel commit: `9e049a32a1`]	2026-01-22 09:34:33 -08:00
assistant-librarian[bot]	5289967b9b	Merge commit '44f481a45ca75b234ba60fdc3dc68974b1b86164' into develop	2026-01-22 14:20:23 +00:00
Bartłomiej Kocot	9c3ab51d9b	[CK TILE] Fix basic gemm pipelines (#3611 ) * [CK TILE] Fix basic pipelines * fixes [ROCm/composable_kernel commit: `44f481a45c`]	2026-01-22 08:11:18 -06:00
assistant-librarian[bot]	872c034358	Merge commit '8daf6ea3026aebe3481792c03026692631059725' into develop	2026-01-22 09:19:13 +00:00
ApoorvaKalyani	ec0f5c82ca	Grouped conv_fwd_bias_bnorm_clamp instances and tests (#3525 ) * Added bias_bnorm_clamp instances. * fwd_bias_bnorm_clamp comp instances * fwd_bias_bnorm_mem_inter and mem_intra instances * fwd_bias_bnorm_merged_group_instances * fwd_bias_bnorm_clamp_conv3d_bf16 and f16 instances * Device level changes for fwd_bias_bnorm_clamp * Added the test to the regression test list. * Removed the part 2 and 2x instances * Removed the irrelevant checks in wmma * Refactored the instances to adapt to new device implementation * Updated the reference and include files * enabling tests * Added missing profiler * Added missing instance entry , deleted by mistake * Reduce bias bnorm clamp instances to only a single generic one. * Clean up cmakelists file * clang-format * Change bias bnorm clamp tests to use monotone initialization values to avoid tiny off-integer gemm results on RDNA3 from blowing up. * Renaming some instance lists and add functions to be more standardized. * Commented out non default instances. --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `8daf6ea302`]	2026-01-22 09:53:59 +01:00
assistant-librarian[bot]	8f10da355a	Merge commit '0b13697a88e77a733d36b14353df1c0a7ae756df' into develop	2026-01-22 08:17:11 +00:00
Linjun-AMD	f6fac4cea6	[CK_TILE][FMHA]Add new tile size for async (#3623 ) * Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)" This reverts commit cfdad49edda4b2ccef92571f23646a8505bb2859. * Add new tile_size for async pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0b13697a88`]	2026-01-22 16:07:14 +08:00
assistant-librarian[bot]	45e2275fc4	Merge commit 'dd0b4294afcf188f4a9154b7eea19f8e786c9539' into develop	2026-01-22 05:20:00 +00:00
ltqin	14254656f0	Fp8 block scale quantization for fmha fwd (#3330 ) * add block scale parameters to kernel * add block scale to kernel * add smoke test * format * Revert "format" This reverts commit `356c3c9706`. * only format my code * format py * fix auto not allowd in function prototype * change instance tttt to ttff * fix structured binding issue * change s_acc elementwise op * async pipeline add block scale * add quantation P using shift exp2 * precompute (m - shift) once per row * change blk scale seqstrt ptr name * fix some name * fix for deduction guide * fix some comments * add P scale to qr_ksvs_pipeline * add comment to idx_identity * change the method of calculating descale block index * unify naming style: use block_scale_ as name prefix * unify naming style * update the CHANGELOG.md * Add FP8 block scale quantization support for FMHA forward kernel --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `dd0b4294af`]	2026-01-21 20:58:26 -08:00

... 5 6 7 8 9 ...

4050 Commits