composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-25 15:24:39 +00:00

Author	SHA1	Message	Date
Anton Gorenko	d0142f8223	[CK_TILE] FMHA Fix synchronization issue in FWD splitkv combine pipeline (#2934 ) * Fix validation of rotary embedding with time_kernel_ When rotary embedding is used, the appendkv kernel modifies the q tensor (multiple times when time_kernel_ is set). We need to reset the q buffer and rerun all kernels. * Fix synchronization issue in splitkv combine pipeline Different warps can read and then rewrite the same values of lse_acc_lds. Sometimes warps progress at different speeds, one warp can rewrite values that are still being read by another warp. Running the tests multiple times and, preferably, with multiple processes on the same GPU helps to trigger this issue: bin/test_ck_tile_fmha_fwd_fp16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure --gtest_filter="TestCkTileFmhaFwd/KV" [ROCm/composable_kernel commit: `c6bfd97c2d`]	2025-09-27 08:16:10 +05:00
emezh	daabe29bff	fix copy-paste bug in get_matrix_b; re-enable all tests in multi_abd (#2939 ) [ROCm/composable_kernel commit: `2aa06fbd45`]	2025-09-26 22:55:18 -04:00
assistant-librarian[bot]	088b4670ae	Merge commit 'ee9769616a51ed85edd8860fe5b976cec0cde037' into develop	2025-09-26 21:11:12 +00:00
lalala-sh	857566c8aa	fix wp gemm bug when permuteN is false (#2935 ) * fix wp gemm bug when permuteN is false * code clean --------- Co-authored-by: valarLip <340077269@qq.com> [ROCm/composable_kernel commit: `ee9769616a`]	2025-09-26 13:28:54 -07:00
assistant-librarian[bot]	dd38b01ac5	Merge commit 'a44bea45b205a84552e417a7b069d962d73c6cb1' into develop	2025-09-26 17:11:27 +00:00
Aviral Goel	5ebdd30e58	Integrate Multi D GEMMs into Grouped GEMMs along with unit tests (#2923 ) * feat(grouped_gemm_multi_d): add new example that integrates grouped_gemm and multi_d_gemm feature * feat: generalized grouped_gemm_kernel.hpp * feat: generalized grouped_gemm_kernel.hpp even further by removing hardcoded 0 * refactor: grouped_gemm_multi_d relies on grouped_gemm_kernel * tests(grouped_gemm): grouped_gemm test suite passes with minor adjustments * fix: segfault fix by passing correct parameters for d tensors * docs: add multi d info and trim down outdated content * tests: add unit tests for grouped_gemm_multi_d and minor changes in grouped_gemm related test for compatibility * style: clang format * fix: incorrect validation method and Dtensor layout in test suite [ROCm/composable_kernel commit: `a44bea45b2`]	2025-09-26 09:59:58 -07:00
assistant-librarian[bot]	77dcfaa687	Merge commit 'e40c0acef25cab3e6b2ac046e76886764fed0239' into develop	2025-09-26 16:13:26 +00:00
Geo Min	5f3f69dfc5	[TheRock CI] Adding MIOpen at HEAD (#2929 ) * Adding MIOpen at HEAD * Adding container and also adding CI run for .github paths * Adding correct flags * Adding patches * Adding exception for ck * rocm-libraries at new path * adding global safe dir * reorder * Fixing paths * Adding sharding [ROCm/composable_kernel commit: `e40c0acef2`]	2025-09-26 09:08:15 -07:00
rahjain-amd	8ad7f1b2ca	Disable Rapid Json to be used by Default (#2936 ) To enable the json dump we can now build with -DCK_ENABLE_JSON_DUMP=1 [ROCm/composable_kernel commit: `e92e69318e`]	2025-09-26 09:05:35 -07:00
Christopher Millette	659a331d36	Update CODEOWNERS [ROCm/composable_kernel commit: `f92b3c7a1e`]	2025-09-26 09:41:33 -06:00
assistant-librarian[bot]	f709601bbc	Merge commit '32773fe5cb176efd2fcbb361f183164fc6525d8a' into develop	2025-09-26 09:12:43 +00:00
Yi DING	5d7bc8b578	[CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918 ) [ROCm/composable_kernel commit: `32773fe5cb`]	2025-09-26 16:42:59 +08:00
assistant-librarian[bot]	11262543b7	Merge commit '518d24e6628eb0c91a56748d26ac8910813c8dcb' into develop	2025-09-26 05:13:10 +00:00
Jeff Huang	0957b78f76	Add sequence padding and variable length support in fmha (#2932 ) * * [CK_TILE] Add sequence padding and variable length support in fmha (and v3) - Group Mode Padding: Introduces the `-s_qpad` argument to support physically padded layouts. Kernels now use padded start pointers (`seqstart_padded__ptr`) for memory addressing. - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens` arguments for efficient processing of variable-length sequences by passing cumulative effective lengths (`cu_seqlen__ptr`) to the kernel. - FMHA examples: Support padding and variable length both in group and batch mode. Dispatcher is updated as well (dispatch to kPadSeqLenK enabled pipeline). - New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and `test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that specifically validate/benchmark the new padding and variable-length functionalities in both group and batch modes. * [CK_TILE] Fix build error in fmha unit tests * [CK_TILE] add mqa, gqa to sequence padding unit tests * [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI * [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel [ROCm/composable_kernel commit: `518d24e662`]	2025-09-26 12:36:27 +08:00
assistant-librarian[bot]	19f49ee63e	Merge commit 'b0a2d99d100f2e4212ebbed080acb49a404035ab' into develop	2025-09-26 01:40:00 +00:00
kyle-256	3e6c83e13a	use inline function in hpp (#2922 ) [ROCm/composable_kernel commit: `b0a2d99d10`]	2025-09-25 18:29:26 -07:00
emezh	3c207a18b0	Verify `HostTensorDescriptor` when it is created (#2829 ) * add proper GEMM layout verification * Handle "auto" strides. CalculateStrides only called when tensor's strides are empty or all of them are <=0 (auto strides). CalculateStrides now supports GEMM::ColumnsMajor order. The assumption is still that it applies only to the inner two dims. ValidateStrides throws if any of the tensor's strides is <=0. profile_gemm_multiply_add updated to support "auto" strides for tensors. Manual tests for profile_gemm_multiply_add (matrix B in Row and Col modes) auto-strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 -1 -1 -1 -1 -1 Note, -1 should be deprecated (use 0 instead) explicit strides (same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 128 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 128 128 128 128 128 explicit strides (not the same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 mix of explicit and auto strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 0 invalid stride bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 64 terminate called after throwing an instance of 'std::runtime_error' what(): Invalid strides for RowMajor: mLens: 128 128 , mStrides: 64 1 Aborted (core dumped) * - add more names to ck::tensor_layout for easier namespace hierarchy checking - updated convolutional layouts to use explicit ones or BaseConvolutionalLayout where it is not clear which layout to use (TBD) - see include/ck/library/utility/convolution_host_tensor_descriptor_helper.hpp * added handling of partially initialized strides for GEMM. fixed more tests. * clang-format and more fixes * replace long dash by a simple hyphen - causes build failure in CK codegen. * increase sizeof input, otherwise output size becomes zero or negative with large filter size * select stride based on layout * specify layout explicitly to avoid errors in HostTensorDescriptor creation * add validation for higher GEMM tensor dimensions.; Add docstring to `HostTensorDescriptor` * Not clear why permute test in test/permute_scale/test_permute_scale.cpp uses a lot of invalid strides. Setting layout to BypassLayoutVerification to avoid a lot of errors * fix test (incl removing invalid config) * fix moe examples: - (in .cpp) add layout argument to non-2D tensors - (in .hpp) fix asserts/failures that show up in Debug mode, specifically addressing 2D tensor by a single index (and 3D tensor by 2d index) * fix moe_gemm2 example. * fix profile and wmma examples * clean-up early mods for ckprofile. verified with: ``` ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 # ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 1 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 2 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 3 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 128 128 128 # ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 0 0 0 0 # ckProfiler gemm_add_relu 0 1 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 2 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 3 1 1 0 1 128 128 128 0 0 0 0 # not implemented ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 128 128 128 128 # ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 1 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 2 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 3 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 130 132 134 136 138 # example_gemm_add_multiply_dl_fp16 example_gemm_add_multiply_xdl_fp16 # ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 0 0 0 ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 128 128 128 ``` * temporary skip first 8 test configs - they throw error * temporary skip first 8 test configs in wmma too - they throw error --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `db2524be2d`]	2025-09-25 18:22:13 -07:00
assistant-librarian[bot]	e575ac4332	Merge commit 'ec4d16b991d16379b785f61b0043ebcfa3fb0914' into develop	2025-09-25 23:11:46 +00:00
Illia Silin	4567c988ca	Enable CI on gfx1100 (#2930 ) * run CI on different versions of gfx11 * do not use gfx1151 systems [ROCm/composable_kernel commit: `ec4d16b991`]	2025-09-25 16:10:54 -07:00
assistant-librarian[bot]	b8448ab68d	Merge commit '8c1a95991330118930f23e6a2ba8e76068d8ca22' into develop	2025-09-25 18:15:45 +00:00
Illia Silin	a4f310c7b1	use default docker for build/test on gfx950 (#2928 ) [ROCm/composable_kernel commit: `8c1a959913`]	2025-09-25 10:40:45 -07:00
Cong Ma	578566f809	Congma/ck tile/remove cpp 20 code (#2873 ) * Remove C++20 code C++20 features should not be used in CK. Remove all C++20 code. * fix c++17 build * format * fix merge issue --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `a5d1e25ec7`]	2025-09-25 10:34:28 -07:00
Khushbu Agarwal	bb5eeef2af	Fix for Add the API to load SGPR (#2913 ) * Revert "Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)" This reverts commit `5cc40c160f`. * Fix: sgpr minor issue * cyclic dependency resolved * clang formatted * removing unused variable * clang formatted --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b56e5d1d79`]	2025-09-25 10:32:42 -07:00
Illia Silin	5a39b14c52	Add AITER test_mha_varlen (#2927 ) * add aiter test_mha_varlen * don't fail until all aiter test run * use the original way to run tests, just add new test [ROCm/composable_kernel commit: `64e61b8647`]	2025-09-25 10:00:20 -07:00
Illia Silin	80f0af1e91	fix clang format (#2926 ) [ROCm/composable_kernel commit: `9f6fc9fe09`]	2025-09-25 09:35:35 -07:00
assistant-librarian[bot]	0e513e86a4	Merge commit '929291741d44e05ab3b199f836d9be97c6e294f8' into develop	2025-09-25 15:27:24 +00:00
Jobbins	b7a9ea456b	[Jenkins] Remove 'Jenkins - ' prefix (#2920 ) The prefix is causing the status updates from gitStatusWrapper to be unique to the status updates that are created by the Jenkins server, which creates duplicates [ROCm/composable_kernel commit: `929291741d`]	2025-09-25 09:08:29 -06:00
assistant-librarian[bot]	9d8734c878	Merge commit 'ab22f91a7c63a34af3198411d064a760b1edebbc' into develop	2025-09-25 03:25:33 +00:00
ltqin	24a8daf662	fix fmha fwd kernel name (#2880 ) * fix fmha fwd kernel name * if the input and output types are the same, keep the original code [ROCm/composable_kernel commit: `ab22f91a7c`]	2025-09-24 20:00:10 -07:00
assistant-librarian[bot]	58b3560182	Merge commit 'df97a286d5486de76bcd2bd7c634b11287cd12ca' into develop	2025-09-25 01:39:57 +00:00
yinglu	c5fdba5a96	Conv:TF32: add more instances - 1 (#2867 ) * conv:tf32:add more instances * add instances of device_grouped_conv_fwd_xdl_f32_comp_instances * add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances * add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances * remove gnhwc/ngchw/ngcdhw instances [ROCm/composable_kernel commit: `df97a286d5`]	2025-09-25 09:27:18 +08:00
assistant-librarian[bot]	3ef0545001	Merge commit 'f076f207ceb3d8199ddc8219a2859b38a63d3c5e' into develop	2025-09-24 20:12:53 +00:00
linqunAMD	0c45597a4e	[CK] Fix misc issues in CK examples (#2890 ) * [CK] Fix misc CK issues * revert fp8 change, it causes CI fail. * resubmit fp8 change [ROCm/composable_kernel commit: `f076f207ce`]	2025-09-24 11:28:20 -07:00
assistant-librarian[bot]	bef885dc89	Merge commit '8fe3838c65ab4c290423ff0e952e882c19e2c60d' into develop	2025-09-24 17:12:28 +00:00
Illia Silin	7e537fd72f	Upgrade to ROCm7.0.1 compiler. (#2909 ) * upgrade default docker to rocm7.0.1 * turn on build and test on gfx950 by default * use rocm-dev instead of rocm * link libhiprtc for codegen targets * resolving codegen compilation errors: removed calls to other std functions, resolved issues with int32_t: needed the correct header, put use of e8m0 into header guards --------- Co-authored-by: Astha Rai <astha.rai713@gmail.com> [ROCm/composable_kernel commit: `8fe3838c65`]	2025-09-24 10:00:53 -07:00
assistant-librarian[bot]	95324c306e	Merge commit 'fe0a47a011c2adcb54dfc94a3029feb7b9980deb' into develop	2025-09-24 09:13:05 +00:00
Yi DING	02db6094b9	[CK_TILE] FMHA BWD Add D96 Instances (#2916 ) [ROCm/composable_kernel commit: `fe0a47a011`]	2025-09-24 17:04:23 +08:00
Johannes Graner	408b3945c3	[CK Tile] Implement Invoker pattern for remaining grouped convolution examples (#2894 ) * Invoker for grouped_conv_fwd * Invoker for grouped_conv_bwd_data * Fix incorrect out layout identifier [ROCm/composable_kernel commit: `15fff74503`]	2025-09-24 10:22:38 +02:00
assistant-librarian[bot]	dff23bcae1	Merge commit '68056847887d7479a6055db6579739f555348c69' into develop	2025-09-24 08:14:46 +00:00
Jingwei Liao	e868ffa390	add fmha dtype fp32 (#2914 ) [ROCm/composable_kernel commit: `6805684788`]	2025-09-24 15:28:39 +08:00
assistant-librarian[bot]	167e5ab3b5	Merge commit 'dcd33a6ecc30e18cc8491ed03926ab5ac8b6f1c3' into develop	2025-09-24 06:15:34 +00:00
Sami Remes	aac547782b	[CK_TILE] Fix cshuffle epilogue issue with IsLoadableTile (#2903 ) * Fix issue with constexpr checks in scaling/cshuffle * Remove IsLoadableTile * Move amd_wave_read_first_lane before first usage [ROCm/composable_kernel commit: `dcd33a6ecc`]	2025-09-23 23:08:18 -07:00
Thomas Ning	8a563fc79d	Fix the gfx950 numerical errors (#2911 ) * Update grouped_gemm example and pipeline * find the root cause error in did not enable the transpose in gfx950 correctly * Fix v3 pipeline, row and col major * Disable f8 datatype tests, it fails on gfx950 * fix the abd test by clear the runtime argument unsupported --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com> [ROCm/composable_kernel commit: `b159841a06`]	2025-09-23 22:54:52 -07:00
assistant-librarian[bot]	a55a7e37ec	Merge commit 'f161b5b738781c71bd5f2c191561b81f679ba9ed' into develop	2025-09-23 23:11:18 +00:00
asleepzzz	5cc40c160f	Revert "[CK-Tile] Add the API to load SGPR (#2878 )" (#2904 ) This reverts commit `fb5e953a05`. [ROCm/composable_kernel commit: `f161b5b738`]	2025-09-23 14:33:51 -07:00
assistant-librarian[bot]	c39d5ca2c5	Merge commit '959df2a15563155329f1d77b2151c3744ff2d749' into develop	2025-09-23 17:11:10 +00:00
Haocong WANG	add2107be0	[FMHA FWD] gfx950 Accuracy enhancement & bug fix (#2900 ) * disable cast_tile_pk_fp16_fp32 on gfx950 * fix wrong encoding when hdim is not exponentiation of 2 --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `959df2a155`]	2025-09-24 00:59:41 +08:00
assistant-librarian[bot]	7eedf242f1	Merge commit '7b16782d7cbf05be6d03d5c001081fad8df97919' into develop	2025-09-23 13:18:39 +00:00
Haocong WANG	0eede5af24	[CK_TILE] Fix fmha bwd (#2865 ) * Fix fmha bwd filter * remove unnecessary change * enable test cases --------- Co-authored-by: Yi DING <yi.ding@amd.com> [ROCm/composable_kernel commit: `7b16782d7c`]	2025-09-23 19:59:27 +08:00
assistant-librarian[bot]	a1c9274f98	Merge commit '2cbbf5dcb3bf315b9486a2c677ffcd6aa72b5298' into develop	2025-09-23 09:13:08 +00:00

... 21 22 23 24 25 ...

3885 Commits