composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-30 19:57:40 +00:00

Author	SHA1	Message	Date
Sami Remes	6f90564708	fix formatting	2025-10-31 20:16:52 +00:00
Sami Remes	fe92102baf	add some documentation and 2d block scale example	2025-10-31 20:13:43 +00:00
Cong Ma	bcccafee40	Update tile distribution for 2D bquant	2025-10-30 22:55:43 -04:00
Sami Remes	68e41da5f2	fix formatting	2025-10-30 08:48:39 +00:00
Sami Remes	1290b1b28a	simplify conditions that are needed for tile distributions	2025-10-29 17:33:44 +00:00
Sami Remes	5e0a356e19	Remove commented code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-29 11:19:02 +02:00
Sami Remes	e1475d4a52	one more fix to tile dstr, and revert debug initialization	2025-10-28 19:13:16 +00:00
Sami Remes	7c93551878	fix formatting	2025-10-28 18:55:51 +00:00
Sami Remes	e12ab566e8	Fix some issues from the merge	2025-10-28 18:55:17 +00:00
Sami Remes	a449728fdd	Merge remote-tracking branch 'origin/develop' into samremes/bmatrix_2d_blockscale	2025-10-28 17:49:53 +00:00
Mateusz Ozga	da4247a6df	[CK_TILE] Fixed multi-abd GEMM test, NaN problem (#2979 ) * Multi-ABD NaN problem * Rollback tests --------- Co-authored-by: root <root@splinter-126-008d.aus.dcgpu> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-10-28 15:53:36 +01:00
Aviral Goel	4368fd9f57	[CK_TILE] Add Bquant to Grouped Gemm (#3063 ) * update test cases * format codes * use GTEST_FAIL * add bquant to grouped_gemm * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * fix a bug in test_grouped_gemm_util * tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm * Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat: add bf8 support * chore: remove unnecessary decltype usage * chore: add default quant_mode to function signature as fallback * fix: pass correct runtime pipeline params in grouped_gemm bquant kernel Calculate has_hot_loop, num_loop, and tail_number on device side for each GEMM problem instead of using default values. This fixes incorrect results when different problems in the group have different K dimensions. * chore: set default quant mode in function signature * test: add additional test cases to cover edge case of no hotloop * chore: clang formatting --------- Co-authored-by: kyle-256 <Kyle.Zhao@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-28 10:20:24 -04:00
Khushbu Agarwal	b11f53a484	Fix quant scale matrix layout for block scale gemm (#3079 ) * Adding support for TiledPermuteN * Adding test * moving shuffle functions to common place * resolving commit hook * fix formatting	2025-10-27 13:56:07 -07:00
Johannes Graner	5c1974065e	[CK_TILE] Add conv fwd + bias + clamp example (#3012 ) * Implement argument passing to element-wise functions for fwd convolution * Add files for fwd + bias + clamp example * Implement Bias * Implement Clamp * Elementwise function composition * Composition unit test * Implement fwd + bias + clamp example * Simplify argument passing and composition * elfunc -> bias_and_clamp * Rename function to specify example * Move element-wise function instantiation to kernel * Make bias a runtime tensor * No ugly namespace aliasing * Initialize element-wise function on host * Remove function initialization helper, simplify Compose initialization * Remove unintended LSP compatibility patch * Clean up includes and unused code * Switch names in cshuffle epilogue * Move CDElementwise to conv traits * Re-add required include * Initialize bias in same way as other tensors * Better type specification for ds pointer * Disable 1D convolution * Add warning for non-group-constant bias	2025-10-27 18:43:09 +01:00
arai713	054fdb765c	[CK_TILE] Stream-K operator() Reboot (#3064 ) * Persistent Stream-K Kernel Implementation This change implements an operator() function in the reboot::StreamKKernel class that is enabled when the Persistent flag is set to true. In this case, the data-parallel portion and the Stream-K portion of the kernel are fully persistent. The changes were made in the reboot namespace. A future PR will remove the old Stream-K kernel class and remove the reboot namespace. * Unit Tests for Persistent Stream-K Kernel This change contains the inital test suite for the Persitent Stream-K Kernel. The files contain "reboot" in the name; a future PR will remove tests for the old Stream-K Kernel and remove the "reboot" naming. A future commit will add tests for the non-persistent kernel. Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase class. This allows us to estimate the number of accumulations done per macro tile in C to use during validation when computing relative and absolute tolerance. * Adding implementation for the Non-Persistent Stream-K kernel This code is adding the operator() function for the Non-Persistent Stream-K kernel. Persistency of the kernel is determined through a template argument. The Non-Persistent kernel will allocate additional workgroups for the data parallel section, leading to a different structure for processing the data parallel and Stream-K sections. There has been an addition to the TilePartitioner to get access to the whether Persistent has been set to true or false in the StreamKKernel. * Adding in the tests for the Non-Persistent Stream-K kernel * Refactor Stream-K Reboot Unit Tests This commit makes the following changes: - Update test cases to determine M, N, and K based on the number of CUs. This ensures that each test case is one of Edge Case, SK Only, DP Only, or DP + 2 Tile SK regardless of the architecture. - Since the DP + 2 Tile SK test case takes long to run, this change moves this case into a separate .inc file and labels it as an extended test. - Since the extended test takes > 30 seconds to run, this test is added to the list of regression tests. * Fix spelling errors in comments for test cases Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Changes based on review Removed const volatile for typenames Set up alias for is_tuple_t Naming changes for clarity: GemmCommon -> BaseGemm Moved std::enable_if_t out of template parameters and changed to a return type for operator() Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance --------- Co-authored-by: Emily Martins <emily.martins@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-27 09:14:17 -07:00
Sami Remes	470d6e4df4	Merge remote-tracking branch 'origin/develop' into samremes/bmatrix_2d_blockscale	2025-10-27 15:17:14 +00:00
Sami Remes	2d86cd0081	fix formatting	2025-10-27 14:28:06 +00:00
Sami Remes	98deefac3e	Enable NWarps replication for bquant tile dstr	2025-10-27 14:09:07 +00:00
Sami Remes	37738e4cb8	Add more specialized tile distributions	2025-10-27 13:43:02 +00:00
Adam Osewski	f53d857b25	[CK_Builder] Add name member to unary elementwise ops & update builder traits. (#3093 ) * Add name member to unary elementwise ops. * Update elementwise_op_name to check for name attribute. * Require that the layout is derived from BaseTensorLayout struct.	2025-10-25 07:27:03 -07:00
Max Podkorytov	86d542f663	[CK-Tile][Async gemm] add missing sync and f8 inputs test cases (#3000 ) * add missing sync and f8 test cases * reformat test cases * comment failing cases * bump * reintroduce compv4 shapes	2025-10-24 12:16:01 -07:00
Khushbu Agarwal	0584399571	[CK_TILE] Adding support for TiledPermuteN on preshuffle Block Scale Gemm (#3019 ) * Adding support for TiledPermuteN * Adding test * resolving remod.py --------- Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>	2025-10-24 11:06:51 -07:00
Max Podkorytov	fdcc1f75c3	limit the rotating count to prevent oom (#3087 )	2025-10-24 08:55:34 -07:00
kyle-256	3c12a02827	[CK_TILE] add tensorwise quant in grouped gemm (#3007 ) * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * change code based on comments --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-10-24 07:41:54 -07:00
Gino Lu	bedade2572	[CK_TILE] Add fp4 warp gemm 16x16x128 (#2738 ) * first commit * fix format error * fix vec size error * fix clang format * fix type error * add interface in warp_gemm_impl * fix interface * fix bug * fix bug --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-10-23 10:55:51 -07:00
Qianfeng	fbd101b1ac	[CK_TILE] Fix in set_slice_tile (#2232 ) Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-10-23 10:34:02 -07:00
Haocong WANG	0d3860dfdb	[CKTILE] FMHA fwd trload lse fix (#3046 ) * enable storelse for fmha_fwd_trload kernel * fix lse in trload * fix the mask related bug	2025-10-23 09:33:33 +08:00
lalala-sh	211d64e18a	[CK_TILE] Update flatmm related kernels (#3022 ) --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: felix <felix.li@amd.com>	2025-10-22 22:36:11 +08:00
Johannes Graner	cbd1279ae6	[CK_TILE] Conv bwd splitN support (#3047 ) * Conv bwd splitN support * Adjust splitting calculations to lengths format * Prepare indexing for future splitK support	2025-10-22 13:34:06 +02:00
Sami Remes	d100ab690a	fix formatting	2025-10-22 11:12:39 +00:00
Sami Remes	f179a8a97b	remove commented code and enable all tests again	2025-10-22 11:09:21 +00:00
MHYangAMD	5a27a97391	Introduce tree reduction for BlockReduce2dCrossWarpSync (#2588 ) * Introduce tree reduction for BlockReduce2dCrossWarpSync * Rename original impl to BlockReduce2dLinearCrossWarpSync * Replace warp_size with get_warp_size() --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-10-22 14:41:35 +08:00
Sami Remes	bb52cd9889	Fix handling of n dim blocks in tile windows etc	2025-10-21 15:51:23 +00:00
Yashvardhan Agarwal	35754d2ec8	fix identity value of AbsMax (#3058 ) * fix identity value of AbsMax - Identity value of AbsMax should be 0 not numeric<T>::lowest() * Update include/ck_tile/core/utility/reduce_operator.hpp resolved comment Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> --------- Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2025-10-21 14:42:08 +02:00
Johannes Graner	4043401db1	Fix race conditions in ck_tile remod (#3061 )	2025-10-21 09:35:04 +02:00
Max Podkorytov	2570462ecf	[CK_TILE] Fix transpose_vectors for 2x2 8-bit tiles (#3042 ) fix transpose_vectors logic for 2x2 8-bit tiles add a test which goes through this code path. factor out constexpr'd cases into smaller functions. add inline docs about the data movement impact: gemms with 8-bit non-rcr inputs on gfx942	2025-10-20 13:40:44 -07:00
Sami Remes	36b88c665c	WIP	2025-10-20 15:42:39 +00:00
Gino Lu	fb1d090f3c	[CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044 ) * Patch for pk_fp4_raw_t buffer load and ref check	2025-10-20 14:47:04 +08:00
AviralGoelAMD	b03764ca5a	docs: add inline comments about flush_cache and rotating buffer	2025-10-17 12:56:47 -04:00
Yashvardhan Agarwal	889ffc0b1d	fix identity values in Max and AbsMax (#3048 ) - The identity value method returned the minimum positive number while we need the lowest number for Max and AbsMax operations	2025-10-17 09:49:21 -07:00
Emily Martins	352dee5225	Fix CK Tile Stream-K BF16 Validation Errors (#3039 ) Prior to this change, the number of accumulations passed into calculate_rtol_atol was 1. That said, in most cases, this is not correct when there are multiple workgroups contributing to the same macro tile in C. This change ensures uses the function estimate_num_wgs_per_tile, which was extracted into a common file and generalized, to estimate the number of workgroups per macro tile. This estimate is passed into calculate_rtol_atol to ensure we get a better relative and absolute tolerance.	2025-10-17 09:33:38 -07:00
Johannes Graner	8a4cd32d86	Pre-commit in CI (#3029 ) * Pre-commit in CI * Specify python version, and install dos2unix for remod * Refactor remod hook to correctly install dependencies * Run pre-commit	2025-10-17 09:28:38 -07:00
Johannes Graner	d40b50b9d5	Update pre-commit to fixed versions, run remod for ck_tile (#2895 ) * Fix ruff linter errors * Fix remod dos2unix command * Clang format * Ignore utility in remod * Run remod * Specify clang-format version in pre-commit * Specify ruff version * Include PoolKernelArgs in reference_pool * Add calculate_total_elements to reference batched contraction * Fix calculate_total_elements declaration * Refactor remod pre-commit hook * Fix Aquant tests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-10-16 15:29:17 -07:00
Sami Remes	9988a46af2	WIP: trying to figure out tile dstr and/or indexing for scale matrix	2025-10-16 15:53:32 +00:00
Emily Martins	cb83d52301	Style updates and cleanup The following changes were made - Renamed iter to iter_start - Renamed tile_iter to tile_iter_start - Moved documentation from member variables to getters - Removed double underscore from extra_iters_before_me variable - Defined parent header in impl file - Removed unused inlcudes	2025-10-16 08:47:06 -06:00
Astha	8f75d7cea6	Addition of the derived structs for the new Stream-K TilePartitioner There are 2 derived structs based on whether Stream-K is persistent or not. If it's persistent that means that both the data parallel and Stream-K sections are data parallel. If it's non-persistent that means that only the Stream-K section is persistent, while the data parallel section will have separate workgroups allocated for it. Both structs will have a template argument for Persistent. The 2 derived classes will inherit common variables and functions from the Stream-K TilePartitioner base class. There are additional variables for the differing data parallel sections that will be added to each derived class, that are in charge of the indexing/bookkeeping for the data parallel sections. The only additional function that will differ between the 2 structs is GridSize(), as the non-persistent will allocate extra workgroups for data parallel. Unit tests for the derived structs are included.	2025-10-16 08:47:06 -06:00
Emily Martins	f87f768d16	Stream-K Tile Partitioner Base Class with Tests To better align with the original Stream-K paper, this change implements a new Stream-K tile partitioner base class. This class will handle the Stream-K setup that is common to both a persistent and non-persistent DP section. A later change will implement derived classes to handle the differences between persistent and non-persistent DP. This change also includes unit tests for the base tile partitioner.	2025-10-16 08:47:06 -06:00
Illia Silin	3348f01e6f	re-enable clang-format by default (#3030 ) * re-enable clang-format by default * fix clang format	2025-10-15 07:43:11 -07:00
felix	4c826abfff	Felix/opt sorting (#2902 ) * merge felix/sorting * opt moe sorting (#2822) * opt moe storing for 2k --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: coderfeli <coderfeli@163.com>	2025-10-15 09:24:03 +08:00
joyeamd	b9d74e7746	update s_barrier's logic in gfx12 architecture (#3003 ) change s_waitcnt's logic in gfx1250 change s_waitcnt's logic in gfx1250 update comment	2025-10-14 08:49:34 -07:00

1 2 3 4 5 ...

504 Commits