composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-16 16:51:26 +00:00

Author	SHA1	Message	Date
Christopher Millette	659a331d36	Update CODEOWNERS [ROCm/composable_kernel commit: `f92b3c7a1e`]	2025-09-26 09:41:33 -06:00
Yi DING	5d7bc8b578	[CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918 ) [ROCm/composable_kernel commit: `32773fe5cb`]	2025-09-26 16:42:59 +08:00
Jeff Huang	0957b78f76	Add sequence padding and variable length support in fmha (#2932 ) * * [CK_TILE] Add sequence padding and variable length support in fmha (and v3) - Group Mode Padding: Introduces the `-s_qpad` argument to support physically padded layouts. Kernels now use padded start pointers (`seqstart_padded__ptr`) for memory addressing. - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens` arguments for efficient processing of variable-length sequences by passing cumulative effective lengths (`cu_seqlen__ptr`) to the kernel. - FMHA examples: Support padding and variable length both in group and batch mode. Dispatcher is updated as well (dispatch to kPadSeqLenK enabled pipeline). - New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and `test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that specifically validate/benchmark the new padding and variable-length functionalities in both group and batch modes. * [CK_TILE] Fix build error in fmha unit tests * [CK_TILE] add mqa, gqa to sequence padding unit tests * [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI * [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel [ROCm/composable_kernel commit: `518d24e662`]	2025-09-26 12:36:27 +08:00
kyle-256	3e6c83e13a	use inline function in hpp (#2922 ) [ROCm/composable_kernel commit: `b0a2d99d10`]	2025-09-25 18:29:26 -07:00
emezh	3c207a18b0	Verify `HostTensorDescriptor` when it is created (#2829 ) * add proper GEMM layout verification * Handle "auto" strides. CalculateStrides only called when tensor's strides are empty or all of them are <=0 (auto strides). CalculateStrides now supports GEMM::ColumnsMajor order. The assumption is still that it applies only to the inner two dims. ValidateStrides throws if any of the tensor's strides is <=0. profile_gemm_multiply_add updated to support "auto" strides for tensors. Manual tests for profile_gemm_multiply_add (matrix B in Row and Col modes) auto-strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 -1 -1 -1 -1 -1 Note, -1 should be deprecated (use 0 instead) explicit strides (same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 128 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 128 128 128 128 128 explicit strides (not the same as auto) bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 bin/ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 mix of explicit and auto strides bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 128 128 128 128 0 invalid stride bin/ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 64 terminate called after throwing an instance of 'std::runtime_error' what(): Invalid strides for RowMajor: mLens: 128 128 , mStrides: 64 1 Aborted (core dumped) * - add more names to ck::tensor_layout for easier namespace hierarchy checking - updated convolutional layouts to use explicit ones or BaseConvolutionalLayout where it is not clear which layout to use (TBD) - see include/ck/library/utility/convolution_host_tensor_descriptor_helper.hpp * added handling of partially initialized strides for GEMM. fixed more tests. * clang-format and more fixes * replace long dash by a simple hyphen - causes build failure in CK codegen. * increase sizeof input, otherwise output size becomes zero or negative with large filter size * select stride based on layout * specify layout explicitly to avoid errors in HostTensorDescriptor creation * add validation for higher GEMM tensor dimensions.; Add docstring to `HostTensorDescriptor` * Not clear why permute test in test/permute_scale/test_permute_scale.cpp uses a lot of invalid strides. Setting layout to BypassLayoutVerification to avoid a lot of errors * fix test (incl removing invalid config) * fix moe examples: - (in .cpp) add layout argument to non-2D tensors - (in .hpp) fix asserts/failures that show up in Debug mode, specifically addressing 2D tensor by a single index (and 3D tensor by 2d index) * fix moe_gemm2 example. * fix profile and wmma examples * clean-up early mods for ckprofile. verified with: ``` ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 0 0 0 0 0 ckProfiler gemm_multiply_add 0 0 1 1 0 1 128 128 128 130 132 134 136 138 ckProfiler gemm_multiply_add 0 1 1 1 0 1 128 128 128 130 132 134 136 138 # ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 1 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 2 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 3 1 2 0 1 128 128 128 0 0 0 ckProfiler gemm_fastgelu 1 0 1 2 0 1 128 128 128 128 128 128 # ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 0 0 0 0 # ckProfiler gemm_add_relu 0 1 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 2 1 1 0 1 128 128 128 0 0 0 0 # not implemented # ckProfiler gemm_add_relu 0 3 1 1 0 1 128 128 128 0 0 0 0 # not implemented ckProfiler gemm_add_relu 0 0 1 1 0 1 128 128 128 128 128 128 128 # ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 1 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 2 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 3 1 1 0 0 128 128 128 0 0 0 0 0 ckProfiler gemm_add_relu_add_layernorm 1 0 1 1 0 0 128 128 128 130 132 134 136 138 # example_gemm_add_multiply_dl_fp16 example_gemm_add_multiply_xdl_fp16 # ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 0 0 0 ckProfiler gemm_blockscale_wp 7 1 1 1 1 0 1 128 128 128 128 128 128 ``` * temporary skip first 8 test configs - they throw error * temporary skip first 8 test configs in wmma too - they throw error --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `db2524be2d`]	2025-09-25 18:22:13 -07:00
Illia Silin	4567c988ca	Enable CI on gfx1100 (#2930 ) * run CI on different versions of gfx11 * do not use gfx1151 systems [ROCm/composable_kernel commit: `ec4d16b991`]	2025-09-25 16:10:54 -07:00
Illia Silin	a4f310c7b1	use default docker for build/test on gfx950 (#2928 ) [ROCm/composable_kernel commit: `8c1a959913`]	2025-09-25 10:40:45 -07:00
Cong Ma	578566f809	Congma/ck tile/remove cpp 20 code (#2873 ) * Remove C++20 code C++20 features should not be used in CK. Remove all C++20 code. * fix c++17 build * format * fix merge issue --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `a5d1e25ec7`]	2025-09-25 10:34:28 -07:00
Khushbu Agarwal	bb5eeef2af	Fix for Add the API to load SGPR (#2913 ) * Revert "Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)" This reverts commit `5cc40c160f`. * Fix: sgpr minor issue * cyclic dependency resolved * clang formatted * removing unused variable * clang formatted --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `b56e5d1d79`]	2025-09-25 10:32:42 -07:00
Illia Silin	5a39b14c52	Add AITER test_mha_varlen (#2927 ) * add aiter test_mha_varlen * don't fail until all aiter test run * use the original way to run tests, just add new test [ROCm/composable_kernel commit: `64e61b8647`]	2025-09-25 10:00:20 -07:00
Illia Silin	80f0af1e91	fix clang format (#2926 ) [ROCm/composable_kernel commit: `9f6fc9fe09`]	2025-09-25 09:35:35 -07:00
Jobbins	b7a9ea456b	[Jenkins] Remove 'Jenkins - ' prefix (#2920 ) The prefix is causing the status updates from gitStatusWrapper to be unique to the status updates that are created by the Jenkins server, which creates duplicates [ROCm/composable_kernel commit: `929291741d`]	2025-09-25 09:08:29 -06:00
ltqin	24a8daf662	fix fmha fwd kernel name (#2880 ) * fix fmha fwd kernel name * if the input and output types are the same, keep the original code [ROCm/composable_kernel commit: `ab22f91a7c`]	2025-09-24 20:00:10 -07:00
yinglu	c5fdba5a96	Conv:TF32: add more instances - 1 (#2867 ) * conv:tf32:add more instances * add instances of device_grouped_conv_fwd_xdl_f32_comp_instances * add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances * add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances * remove gnhwc/ngchw/ngcdhw instances [ROCm/composable_kernel commit: `df97a286d5`]	2025-09-25 09:27:18 +08:00
linqunAMD	0c45597a4e	[CK] Fix misc issues in CK examples (#2890 ) * [CK] Fix misc CK issues * revert fp8 change, it causes CI fail. * resubmit fp8 change [ROCm/composable_kernel commit: `f076f207ce`]	2025-09-24 11:28:20 -07:00
Illia Silin	7e537fd72f	Upgrade to ROCm7.0.1 compiler. (#2909 ) * upgrade default docker to rocm7.0.1 * turn on build and test on gfx950 by default * use rocm-dev instead of rocm * link libhiprtc for codegen targets * resolving codegen compilation errors: removed calls to other std functions, resolved issues with int32_t: needed the correct header, put use of e8m0 into header guards --------- Co-authored-by: Astha Rai <astha.rai713@gmail.com> [ROCm/composable_kernel commit: `8fe3838c65`]	2025-09-24 10:00:53 -07:00
Yi DING	02db6094b9	[CK_TILE] FMHA BWD Add D96 Instances (#2916 ) [ROCm/composable_kernel commit: `fe0a47a011`]	2025-09-24 17:04:23 +08:00
Johannes Graner	408b3945c3	[CK Tile] Implement Invoker pattern for remaining grouped convolution examples (#2894 ) * Invoker for grouped_conv_fwd * Invoker for grouped_conv_bwd_data * Fix incorrect out layout identifier [ROCm/composable_kernel commit: `15fff74503`]	2025-09-24 10:22:38 +02:00
Jingwei Liao	e868ffa390	add fmha dtype fp32 (#2914 ) [ROCm/composable_kernel commit: `6805684788`]	2025-09-24 15:28:39 +08:00
Sami Remes	aac547782b	[CK_TILE] Fix cshuffle epilogue issue with IsLoadableTile (#2903 ) * Fix issue with constexpr checks in scaling/cshuffle * Remove IsLoadableTile * Move amd_wave_read_first_lane before first usage [ROCm/composable_kernel commit: `dcd33a6ecc`]	2025-09-23 23:08:18 -07:00
Thomas Ning	8a563fc79d	Fix the gfx950 numerical errors (#2911 ) * Update grouped_gemm example and pipeline * find the root cause error in did not enable the transpose in gfx950 correctly * Fix v3 pipeline, row and col major * Disable f8 datatype tests, it fails on gfx950 * fix the abd test by clear the runtime argument unsupported --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com> [ROCm/composable_kernel commit: `b159841a06`]	2025-09-23 22:54:52 -07:00
asleepzzz	5cc40c160f	Revert "[CK-Tile] Add the API to load SGPR (#2878 )" (#2904 ) This reverts commit `fb5e953a05`. [ROCm/composable_kernel commit: `f161b5b738`]	2025-09-23 14:33:51 -07:00
Haocong WANG	add2107be0	[FMHA FWD] gfx950 Accuracy enhancement & bug fix (#2900 ) * disable cast_tile_pk_fp16_fp32 on gfx950 * fix wrong encoding when hdim is not exponentiation of 2 --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `959df2a155`]	2025-09-24 00:59:41 +08:00
Haocong WANG	0eede5af24	[CK_TILE] Fix fmha bwd (#2865 ) * Fix fmha bwd filter * remove unnecessary change * enable test cases --------- Co-authored-by: Yi DING <yi.ding@amd.com> [ROCm/composable_kernel commit: `7b16782d7c`]	2025-09-23 19:59:27 +08:00
Thomas Ning	fb5e953a05	[CK-Tile] Add the API to load SGPR (#2878 ) * Have a workable version for SGPR * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * substitute with the new sgpr read api * update the CHANGELOG * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * change to static for logic * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. [ROCm/composable_kernel commit: `2cbbf5dcb3`]	2025-09-23 01:23:56 -07:00
Haocong WANG	d85ca87d97	[CK_TILE] FMHA FWD bug fix (#2888 ) * tempsave debug * fix the bug in fmha fwd_kernel * Remove unnecessary changes * Fix the buggy part * remove fmha fwd known failure cases [ROCm/composable_kernel commit: `b6e8994386`]	2025-09-23 15:00:46 +08:00
Yi DING	bfa145c418	FMHA BWD Avoid SetZero (#2799 ) [ROCm/composable_kernel commit: `ad259eeae2`]	2025-09-23 14:37:48 +08:00
Enrico Degregori	12225ce645	Wmma support for multiple ABD GEMM (#2803 ) * multi_abd wmma support: - Add multiple A and B support to multiple D implementation (gridwise level) - Add multi_abd GEMM (device level) - Add instances (xdl parity) - Add tests (both xdl and wmma) - Add examples - Add ckProfiler support (both xdl and wmma) * Fix bug in device print function * Fix unused template parameter * Fix batched gemm for multiABD gridwise implementation * Fix gemm_universal_reduce with multiABDs gridwise implementation --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3d29bff2f0`]	2025-09-22 18:49:06 -07:00
Max Podkorytov	799dc99e55	fixup build for #2871 when multiple device targets are used (#2885 ) [ROCm/composable_kernel commit: `de47ae2fdf`]	2025-09-22 08:02:41 -07:00
jakpiase	30403d077b	[CK_TILE] Add conv bwd weight two stage support (#2855 ) * resolved conflicts * add conv bwd weight twostage * fix one file * fixes after review * fixes * fixes * Fix --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `624c46866e`]	2025-09-22 15:31:25 +02:00
Sami Remes	8d2a444c55	[CK_TILE] Tensor-wise scaled quant gemm kernel (#2846 ) * rename gemm_group_quant to gemm_quant * Add TensorWise quant mode * Cshuffle epilogue tests with tensor scaling * Add tensor quant to example * Don't use readfirstlane for reading scales - doesn't work for some reason * Add to changelog * revert include - from a merge problem? * revert common.hpp include * revert host.hpp include * remove unused utility function * rename quant pipeline problem * refactor quant tests * remove aquant utils * use TEST_F * fix all tests by changing gemm config * Use typed tests * fix copyright [ROCm/composable_kernel commit: `4363a82bd6`]	2025-09-19 16:52:35 -07:00
Illia Silin	ee43f0f0be	Revert "[CK_TILE] Add sequence padding and variable length support in fmha (a…" (#2883 ) This reverts commit `7ede589f4b`. [ROCm/composable_kernel commit: `b765fe78f3`]	2025-09-19 08:15:02 -07:00
Bartłomiej Kocot	38e1718bda	Disable bwd weight split-k autodeduce for single stage kernels (#2856 ) * Disable bwd weight split-k autodeduce for single stage kernels * update interface tests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `29446da1d5`]	2025-09-19 16:27:50 +02:00
Yi DING	240de6ee26	[CK_TILE] FMHA BWD Fix Decode Accuracy (#2881 ) * [CK_TILE] FMHA BWD Fix Decode Accuracy * use s_waitcnt utils [ROCm/composable_kernel commit: `6cf3fdd21c`]	2025-09-19 21:45:02 +08:00
Jeff Huang	7ede589f4b	[CK_TILE] Add sequence padding and variable length support in fmha (a… (#2851 ) * [CK_TILE] Add sequence padding and variable length support in fmha (and v3) - Group Mode Padding: Introduces the `-s_qpad` argument to support physically padded layouts. Kernels now use padded start pointers (`seqstart_padded__ptr`) for memory addressing. - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens` arguments for efficient processing of variable-length sequences by passing cumulative effective lengths (`cu_seqlen__ptr`) to the kernel. - FMHA examples: Support padding and variable length both in group and batch mode. Dispatcher is updated as well (dispatch to kPadSeqLenK enabled pipeline). - New padding test cases: Add padding test cases to `smoke_test_fwd.sh`, and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that specifically validate/benchmark the new padding and variable-length functionalities in both group and batch modes. * [CK_TILE] Fix build error in fmha unit tests --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> [ROCm/composable_kernel commit: `86dd59cd01`]	2025-09-19 17:36:49 +08:00
Anton Gorenko	a64deec3ba	[CK_TILE] FMHA Fix synchronization issues in BWD pipelines (#2876 ) * Run ctest with --output-on-failure * Fix synchronization issues in bwd pipelines The bwd kernel reuses the same area of LDS for ds (SGrad), bias and dbias (BiasGrad). This means that there must be block_sync_lds between loading one tensor and storing another to the same area. Heavy instructions like MFMA/WMMA and global loads are executed between reuses of the same memory so in MOST cases loading is finished by all warps before storing is started. However, sometimes warps progress at different speeds. Running the tests multiple times and, preferably, with multiple processes on the same GPU helps to trigger this issue: bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure [ROCm/composable_kernel commit: `2aec38f9ec`]	2025-09-19 11:34:45 +05:00
ltqin	fd80c78f50	Add input fp8 and output bf16 attention (#2726 ) * change host using fp16 to check * fp8 to fp8 compare * rewrite input parameters * add not squant * remove some output code * for scale = 1 * format * saturates only for fp8 * add fp8bf16 data type * add fp8bf16 data type * fix test fp8 code * add run_fp8bf16_tests * change fmha fwd example parameter(adding fp8bf16) * Support fp8bf16 for Aiter * Support aiter fp8bf16 in c++ * fix comment about fp8 in readme.md * add fp8fp32 * add fp8fp32 test * remove range_q etc. * format * fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type * add fp8bf16 to data_type function * change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8 * format * reset atol for fp8 * fix bug for atol --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `dd249f1cd6`]	2025-09-19 14:26:43 +08:00
Max Podkorytov	a00705b4fd	poc convert fnuz fp8 to non-native dtype similar to ocp (#2871 ) [ROCm/composable_kernel commit: `e469fee046`]	2025-09-18 22:51:01 -07:00
SamiAario-AMD	30b63f4c04	Add gemm weight preshuffle pk_int_t support (#2858 ) * Factor out the three separate copies of load_interleaved_pk_type into a common utility class * Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the weight preshuffle GEMM example * Remove a duplicate function * Add support for B tensor type pk_int4_t for the weight preshuffle GEMM, with tests included * I4 support introduced more failing test cases that mirror the existing ones for F8 * Simplify the check for which tests to skip (they all have F8 as A tensor type) * Add a changelog entry * add the test for v2 wp pipeline, polish the code, add the support of int4 for v2 wp pipeline * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `47cd0d5cff`]	2025-09-18 21:26:10 -07:00
Mateusz Ozga	64e1f86daf	[CK_TILE] Multiple-ABD GEMM example (#2788 ) * Multi ABD - initial commit * Clang-foramt fix * block gemm, unify the name of CDataType * Apply chnages to mem-pipeline * Rollback prefix for DType and Layout * Gemm Kernel Basic, rename * WMMA config * Grouped GEMM * Clang-format * Dropout, name * Review v2 * Move element_wise fn to unnary, remov old ones fn * clang-format * Fix issue review * WP operator adjust to universal gemm * v2 prepare * Remove unused comment * Remove vectorsize * Rollback * Adjust pipeline for abd * Shuffle argument * CI-fail fix quant * Fix ag_br pipeline * Failing tests * Typo * Single argument support [ROCm/composable_kernel commit: `30ab1d6a71`]	2025-09-19 01:14:11 +02:00
Rostyslav Geyyer	83e2403545	Fix UB caused by reinterpret_cast (#2849 ) * Use bit_cast instead of reinterpret_cast to avoid UB * Apply same fix in ck_tile [ROCm/composable_kernel commit: `14bbc545ea`]	2025-09-18 07:12:37 -07:00
Yi DING	8bc9d6226d	[CK_TILE] FMHA Test Ignore Known Errors (#2872 ) [ROCm/composable_kernel commit: `7ee7915e94`]	2025-09-18 16:51:21 +08:00
aledudek	a9d74c3208	[CK_TILE] Fix batched_gemm tests for gfx950 (#2869 ) [ROCm/composable_kernel commit: `427dca076b`]	2025-09-17 16:43:41 -07:00
yinglu	19463895a8	TF32 POC in Conv3d on MI30x platform #2763 (second attempt) (#2852 ) * Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)" This reverts commit `954db22b39`. * fix compile error on gf12x * only run tf32 example on gfx942 * only build tf32 instance on gfx942 * ckProfiler:only support tf32 in gfx942 * delete unuseful messages [ROCm/composable_kernel commit: `dd7af118d7`]	2025-09-17 14:50:15 -07:00
Aviral Goel	a7a7fa13bb	build(grouped_gemm): added appropriate compiler flag to resolve numerical error for fp8 on gfx950 (#2868 ) [ROCm/composable_kernel commit: `7c934b72ab`]	2025-09-17 11:04:21 -07:00
Michał Kulikowski	5334a45c0e	[CK][Examples] - fixing grouped_conv_bwd_weight command parser. (#2840 ) -added parameter to change group count for grouped_gemm examples. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> [ROCm/composable_kernel commit: `5c4f52a02a`]	2025-09-17 10:39:48 -07:00
pmaybank	377a3da125	[CK_TILE] Add support for gfx12 in tile_engine for GEMM benchmarking (#2802 ) * initial work on adding support of gfx12 in tile_engine for GEMM benchmarking * add stage("Run TILE_ENGINE_GEMM Tests on gfx1201") to Jenkins config * make tile_[m/n/k] validation arch dependent [ROCm/composable_kernel commit: `592d73ad73`]	2025-09-17 17:59:01 +01:00
Gino Lu	f9660c00dc	[CK_TILE] Refine pk_fp4's fill, pack, and unpack (#2845 ) * fix bug * let pack/unpack return pk_fp4_t * fix clang-format [ROCm/composable_kernel commit: `c2997f2b7f`]	2025-09-17 10:54:06 +08:00
Aviral Goel	0fb1cfa4b7	fix(grouped_gemm): pipeline selection when tail_num varies per group and leads to numerical error (#2863 ) * fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num * WIP: add temp config to stress test numerical error correction * refactor: remove comments [ROCm/composable_kernel commit: `db79fad16f`]	2025-09-16 18:43:19 -07:00
Wojciech Laskowski	302398f3fd	Added wmma support for gemm quantization: (#2841 ) - profiler for gemm quantization for DL/XDL - tests for gemm quantization for DL/XDL - implementation for gemm quantization for WMMA - profiler/tests for gemm qunatization for WMMA Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `f97b2a3f5d`]	2025-09-16 16:23:29 -07:00

1 2 3 4 5 ...

2391 Commits