composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
Aviral Goel	ee7a68b10f	chore(copyright): update copyright header for include directory (#3293 ) [ROCm/composable_kernel commit: `de6466481f`]	2025-11-26 11:00:05 -07:00
Po Yen Chen	7713c5071b	[CK_TILE] Share partition index across threads and specify offset in load_tile()/async_load_tile()/load_tile_transpose() (#2905 ) * Allow sharing partition index across threads * Fix typo PartitoinIndex -> PartitionIndex * Remove C++20 'requires' usages * Add missing template arguments * Fix load_tile() overload ambiguity issue * Use SFINAE to exclude invalid arguments * Add additional offset parameter to the async_load_tile() * Remove async_load_tile() default argument to avoid ambiguity * Extract tile_window coordinate compute logic as method * Use warp-shared LDS base address in tile_window::async_load() * Add constraint to tile_window::load() templates * Fix wrong type traits is_class_v<> usages * Add missing constraint to async_load_tile() * Add missing tile_window::load() overload * Add more constraint to avoid load_tile() call ambiguity * Rename ParitionIndex as ReplacementPartitionIndex * Update pre_computed_warp_coords_ in move_extended() * Fix inconsistency between template parameters and documentation * Allow specifying pre-computed parition index * Add type straits is_sequence<> & is_tile_distribution<> * Add type straits is_tensor_view<> * Add type constraints to make_tile_window() templates * Allow passing partition_index to set_tile_if() * Allow specifying partition_index to store_tile() * Add missing template parameter of replace_bottom_tensor_view() * Allow passing partition_index to Default2DEpilogue * Make get_partition_index() public * Add _with_offset() postfix to avoid resolution error * Remove ReplacementPartitionIndex template param * Add missing comments * Add load_tile_transpose_with_offset() overload [ROCm/composable_kernel commit: `40d2ed0f2a`]	2025-11-12 10:26:14 +08:00
Ville Pietilä	4694b1b4a7	[CK_TILE] Improve grouped conv kernel name generation (#3028 ) * Improve the grouped conv kernel name generation in CK Tile. * Fix building CShuffle epilogue tests. --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `9ee9f4d2a3`]	2025-10-30 13:19:07 +01:00
Johannes Graner	3b8e9864c6	[CK_TILE] Add conv fwd + bias + clamp example (#3012 ) * Implement argument passing to element-wise functions for fwd convolution * Add files for fwd + bias + clamp example * Implement Bias * Implement Clamp * Elementwise function composition * Composition unit test * Implement fwd + bias + clamp example * Simplify argument passing and composition * elfunc -> bias_and_clamp * Rename function to specify example * Move element-wise function instantiation to kernel * Make bias a runtime tensor * No ugly namespace aliasing * Initialize element-wise function on host * Remove function initialization helper, simplify Compose initialization * Remove unintended LSP compatibility patch * Clean up includes and unused code * Switch names in cshuffle epilogue * Move CDElementwise to conv traits * Re-add required include * Initialize bias in same way as other tensors * Better type specification for ds pointer * Disable 1D convolution * Add warning for non-group-constant bias [ROCm/composable_kernel commit: `5c1974065e`]	2025-10-27 18:43:09 +01:00
lalala-sh	63e0a73bd3	[CK_TILE] Update flatmm related kernels (#3022 ) --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `211d64e18a`]	2025-10-22 22:36:11 +08:00
Thomas Ning	be09203966	[CK Tile] CShuffle Tile Permute N all warp compatible (#2966 ) * solve the hard_code issue of kM2 * clang format [ROCm/composable_kernel commit: `b4a4aa2b64`]	2025-10-03 09:46:13 -07:00
Cong Ma	578566f809	Congma/ck tile/remove cpp 20 code (#2873 ) * Remove C++20 code C++20 features should not be used in CK. Remove all C++20 code. * fix c++17 build * format * fix merge issue --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `a5d1e25ec7`]	2025-09-25 10:34:28 -07:00
Sami Remes	aac547782b	[CK_TILE] Fix cshuffle epilogue issue with IsLoadableTile (#2903 ) * Fix issue with constexpr checks in scaling/cshuffle * Remove IsLoadableTile * Move amd_wave_read_first_lane before first usage [ROCm/composable_kernel commit: `dcd33a6ecc`]	2025-09-23 23:08:18 -07:00
Sami Remes	8d2a444c55	[CK_TILE] Tensor-wise scaled quant gemm kernel (#2846 ) * rename gemm_group_quant to gemm_quant * Add TensorWise quant mode * Cshuffle epilogue tests with tensor scaling * Add tensor quant to example * Don't use readfirstlane for reading scales - doesn't work for some reason * Add to changelog * revert include - from a merge problem? * revert common.hpp include * revert host.hpp include * remove unused utility function * rename quant pipeline problem * refactor quant tests * remove aquant utils * use TEST_F * fix all tests by changing gemm config * Use typed tests * fix copyright [ROCm/composable_kernel commit: `4363a82bd6`]	2025-09-19 16:52:35 -07:00
Mateusz Ozga	64e1f86daf	[CK_TILE] Multiple-ABD GEMM example (#2788 ) * Multi ABD - initial commit * Clang-foramt fix * block gemm, unify the name of CDataType * Apply chnages to mem-pipeline * Rollback prefix for DType and Layout * Gemm Kernel Basic, rename * WMMA config * Grouped GEMM * Clang-format * Dropout, name * Review v2 * Move element_wise fn to unnary, remov old ones fn * clang-format * Fix issue review * WP operator adjust to universal gemm * v2 prepare * Remove unused comment * Remove vectorsize * Rollback * Adjust pipeline for abd * Shuffle argument * CI-fail fix quant * Fix ag_br pipeline * Failing tests * Typo * Single argument support [ROCm/composable_kernel commit: `30ab1d6a71`]	2025-09-19 01:14:11 +02:00
Cong Ma	9b65e9ec43	[CK TILE GEMM] set correct value to TiledMMAPermuteN_ (#2839 ) - TiledMMAPermuteN_ should be set to true when config if GemmConfigPreshufflePrefill [ROCm/composable_kernel commit: `e5d73da2da`]	2025-09-13 20:54:08 -07:00
lalala-sh	f8c8263798	[CK_TILE] Add permuteN optimization to remove lds operation in c_shuffle (#2764 ) * permuteN optimization to remove lds operation in c_shuffle * add the change log --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `75570d0fa8`]	2025-09-08 22:02:48 -07:00
Thomas Ning	8e30f972f6	[FIX] fix on fmha_bwd (#2784 ) * fix on fmha_bwd * Add 'const' to the Default2DEpilogue call operator * Fix more calls to Default2DEpilogue --------- Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> [ROCm/composable_kernel commit: `42a43d1523`]	2025-09-08 14:31:27 +08:00
Sami Remes	abf4f7a7b2	[CK_TILE] Row/Col quant gemm (#2729 ) * Add cshuffle epilogue test * add the poc implementation to the epilogue and tests * refactor cshuffle epilogue * WIP: adding tensor/tile usage to scale_tile * fix usage of tile_elementwise_inout * add gemm_quant_kernel for generalizing gemm quant kernel * Add problem specific to different quants, add QuantType to Traits * Add quant_type to quant_kernel template parameters * Create aq/bq_block_windows and views depending on QuantType * Use tile windows as inputs in cshuffle epilogue * Fix some issues in epilogue * initial new example code for new general gemm quant kernel test * Fix issues in kernel * Add verification check for rowcol Quantmode * use AccDataType instead of AQ in pipeline * fix aquant preshuffle * fix formatting * some cleanup * remove gemm_aquant_basic.cpp * remove gemm_aquant_kernel.hpp * fix tests for the renamed quant kernel * fix formatting * clean example files * fix some merge conflicts * fix preshufflequant rename issue * fix some templates after merging with develop * fix test preshuffle parameter * fix formatting * Unify bquant kernel to the common quant kernel * remove bquant kernel also from common header * fix formatting * clean up commented code * fix formatting config hpp * fix merge mistake * Non-const for movable windows * fix formatting * Fix grammar in README Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Remove #include<bit> and clean up example * fix strides * Add some descriptions for move_windows --------- Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `c6010f2953`]	2025-09-04 16:17:12 -07:00
msaffari-amd	c48a0b3a9b	refactor: use snake_case naming in ck_tile/core components (#2766 ) [ROCm/composable_kernel commit: `47d020a993`]	2025-09-03 09:34:11 +02:00
Mateusz Ozga	e4010d5ea1	[CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752 ) * Init commit * Quick fix, CI fails * Remove CDElementWise * Add CDEELementWise --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `0758883fa4`]	2025-08-28 12:45:50 -07:00
John Afaganis	cef79d5f82	Revert "[CK-TILE] Default epilogue, adding support for D (#2629 )" (#2746 ) This reverts commit `92037686ae`. [ROCm/composable_kernel commit: `508e7912f9`]	2025-08-26 09:48:49 -07:00
Mateusz Ozga	92037686ae	[CK-TILE] Default epilogue, adding support for D (#2629 ) * Extend 2d-epilogue, D support * Added tests & update * Remove unused attribute * Extend tests --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `d43228fbca`]	2025-08-25 19:29:35 -07:00
linqunAMD	615ca9842d	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `9fcc1ee9fd`]	2025-08-18 10:08:31 -07:00
Tianyuan Wu	abb90422b4	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `68134b60e4`]	2025-08-15 16:22:27 -07:00
Yi DING	19ef22e567	[CK_TILE] FMHA BWD Decode Pipeline (#2643 ) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 [ROCm/composable_kernel commit: `8e1eb0c1ee`]	2025-08-12 17:02:52 +08:00
Illia Silin	24f228df3b	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config [ROCm/composable_kernel commit: `504b101da3`]	2025-07-28 11:34:07 -07:00
Cong Ma	29e1e00edd	ck_tile kernel for gemm with groupwise quantized A tensor (#2473 ) * ck_tile kernel for gemm with groupwise quantized A or B tensor. This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. 1. fp8, fp8 -> f32 2. bf8, bf8 -> f32 3. i4, fp8 -> f32 4. i4, bf8 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers. * [Block Scale Gemm] Only run gemm quant examples on __gfx94__ - Only run gemm quant examples on __gfx94__ for usage of `v_cvt_pk_fp8_f32` - Format the code * [Block Scale Gemm] Remove Bquant Gemm BlockScale This cleanup is in preparation for future development of bquant. By isolating Aquant-related code, we can streamline the codebase and make it easier to add and maintain bquant functionality in subsequent updates. * [Block Scale Gemm] Format code with clang-format-12 The latest clang-format (v19) in ROCm 7.0 generate different result than clang-format-12 which is used in CK CI. Format code with clang-format-12 for consistency. * [Block Scale Gemm] Split the k direction loop - Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp to make the logic clearer. - Disable C transposition. * [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm * [Block Scale Gemm] Update copyright * test * Add TailHandler * Move TileDistributionEncodingPatternAQ * Refactor * refactor * fix bug * fix bug * help solve the PR comment * Format the code * [Block Scale Gemm] Add unit tests * [Block Scale Gemm] Add support to 16x16x32 MFMA - Add support to 16x16x32 MFMA - Fix a bug when exchange data crossing lanes --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com> Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `e62710e461`]	2025-07-23 00:10:16 -07:00
Bartłomiej Kocot	29cfe38b42	[CK TILE] Grouped Convolution Forward Kernel (#2188 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * rebase fixes * fixes * fixes [ROCm/composable_kernel commit: `cebdee4d9e`]	2025-06-20 15:44:36 -07:00
Max Podkorytov	7c10189a27	Reland fix default epilogue (#2367 ) * Revert "Revert "Fix default epilogue (#2358)" (#2364)" This reverts commit `f85c70b31e`. * add operator() with old signature [ROCm/composable_kernel commit: `11eb9f1c77`]	2025-06-19 10:39:30 -07:00
Thomas Ning	f85c70b31e	Revert "Fix default epilogue (#2358 )" (#2364 ) This reverts commit `b29e3830a6`. [ROCm/composable_kernel commit: `64a2fda713`]	2025-06-17 22:43:05 -07:00
Max Podkorytov	b29e3830a6	Fix default epilogue (#2358 ) * [ck-tile] fix default epilogue in gemm universal * argument validation needs vector size D * operator() needs to specify dram windows * copy/paste from cshuffle epilogue * clang-format * mark unused argument --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cd606f72c1`]	2025-06-17 17:30:21 -07:00
Mateusz Ozga	6b3ddd0e23	[CK_TILE] Multiple-D GEMM example (#2219 ) * Multiple d, initial commit * Check Ds Layout * Readme and clang format * Update branch & conflicts * Multiple D - fix clang-formatter * Rename elemetwise_op * Fix CI * Code review part1 * Remove printf * Remove unnecessary comment * Add new tests with Col layout * Review part 2 * Added support for Multiple D GEMM * Update comment * Remove maybe_unused * Clang-format * Review part 3 * Add comment to function * Add comment to function: another * Take number of params for a refrence function * Remove additional d param for 0 tensor * Change name of function * Fix CI fails [ROCm/composable_kernel commit: `bd96ac9742`]	2025-06-13 19:39:11 +02:00
kylasa	afbc0625f4	Code drop for 2 warp ping pong scheduler along K dimension. (#2276 ) * Code drop for 2 warp ping pong scheduler along K dimension. * Addressing code review comments. * Addressing Clang formatting issues. * Addressing build issues. * Addressing build issues of other GEMM pipelines with ping pong scheduler code drop. * Fix for LDS memory size for GEMM pipelines. * Addressing code review feedback comments. * Change log update. * Addressing code review comments and build issues. * Added new policy for pipeline specific logic about LDS needs. * Clang Fix during build. [ROCm/composable_kernel commit: `5f1ad09b61`]	2025-06-12 18:24:02 -07:00
Thomas Ning	46624a1abd	Epilogue cshuffle Improvement (#2312 ) * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * update cshuffle logic * update cshuffle_logics * add some change within review * update some codes following the code review * update epilogue logic * remove from problem * update codes following review. * fix some issues * solve the previous PR error, refine the code * Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Comment addressed * handling tile_engine failing case * handling tile_engine failing case --------- Co-authored-by: joyeamd <John.Ye@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: khushbu agarwal <khuagarw@amd.com> [ROCm/composable_kernel commit: `06e0b8436c`]	2025-06-10 22:44:50 -07:00
Khushbu Agarwal	7f9758a186	[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262 ) * debugging issue * debugging issue * debugging * debugging * reverting debugging code * clang formatted * updating default_config.json * fix ci failure * clang formatted [ROCm/composable_kernel commit: `59a85cb4bc`]	2025-06-03 20:16:10 -07:00
Illia Silin	27e480ff4c	Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185 )" (#2260 ) This reverts commit `aef91ee587`. [ROCm/composable_kernel commit: `4e561af18c`]	2025-05-29 16:22:16 -07:00
joyeamd	aef91ee587	add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue (#2185 ) * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * update cshuffle logic * update cshuffle_logics * add some change within review * update some codes following the code review * update epilogue logic * remove from problem * update codes following review. * fix some issues [ROCm/composable_kernel commit: `fd6a859b44`]	2025-05-29 14:31:14 +02:00
jakpiase	4e5e1d469c	fix for default epilogue (#2167 ) [ROCm/composable_kernel commit: `cb07ad84d5`]	2025-05-07 10:46:53 -07:00
jakpiase	b77cfe1ad5	[CK_TILE] Remove scratch usage from universal gemm (#2001 ) * moves kbatch condition outside of kernel * add reviewer comments * fixes * fix tests * fixes after review --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `0bcb804ad0`]	2025-05-05 18:46:44 +02:00
Khushbu Agarwal	7795e976da	Support for MFMA_16x16x128 for fp8/bf8 (#2125 ) * Adding 16x16x128 support for gfx950 * Support for fp8 and bf8 * fix input arguments for MFMA scale instruction * clang-formatted * Fixes for lwpck-3145 (#2138) * Fix lds tile & cmake dep & default epilogue * Fallback BTypeToUse to ADataType in WOQ cases * reverting instance json file * reverting instance json file --------- Co-authored-by: Yi DING <yi.ding@amd.com> [ROCm/composable_kernel commit: `d107f3c3a5`]	2025-04-28 18:19:50 -07:00
ruanjm	ce1d20c2c6	[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861 ) * 50ms -> 28ms * Fix bug in non fuse_add_store cases * Fine tuned setting for 2 pass pipeline * adjust workload * remove unnecessary change * add layernorm * Adding output quant and unquant results at the same time. * fix test * fix format * tune for cases 128x640 and 128x1024 * bug ifx [ROCm/composable_kernel commit: `d49abdaa87`]	2025-03-25 20:09:45 +08:00
kylasa	676d236a5e	Addressing (Post Merge) code review comments for PR 1845 (#1883 ) * Addressing code review comments. * Addressing code review comments. * Reorganized code for better readability. * add ck_tile gemms for new types in CI * fix jenkins syntax * fix script syntax * Add the test cases back * Address the review comments * Address review comments * clang format * Solve the merging issues * Addressed the comments * clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `66c5f5b0b6`]	2025-03-06 11:40:30 -08:00
kylasa	0aee5c2d16	Support for dtypes (fp8, bf8, bf16 and fp16) for the ck_tile/03_gemm example. (#1845 ) * Support bf16/fb8/bf8 datatypes for ck_tile/gemm * remove commented out code. * Addressing code review comments and enabling universal_gemm for all the supported data types. * Merge conflict resolution. * Solve the memory pipeline compilation error. Merge with the new change of CShuffle * finish the feature, pass the tests * Fix the pipeline and add the benchmark script for other data types --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `ab5d027866`]	2025-02-07 15:07:06 -07:00
Bartłomiej Kocot	5becfaa24f	[CK TILE] Implement cschuflle algorithm (#1842 ) * [CK TILE] Implement cschuflle algorithm * Rebase * Vector store size fixes * fixes * Fixes * fixes * fmha fix * fixes * fixes of fixes [ROCm/composable_kernel commit: `25e2e0f04a`]	2025-01-30 11:57:39 +01:00
ruanjm	e540af436e	[CK_TILE] Add Various Fusion Functions to RMSNorm (#1802 ) * Add shortcut to RMSNorm * Modify test for adding shortcut for RMSNorm * Add fused parameter into tests * 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp * 1. Supports various stride and percisions. * Add support of Epilogue * Add fuse and epilogue support to rmsnorm ref * Modify rmsnorm example * Refactor tests/examples * Bug fix for newly added tests/examples * Bug fix for new tests 2 * Modify smoke test scripts remove dbg code * Supports non-smooth dyanmic quant * Update Rmsnorm2dFwd::GetName() * rename xscale and prec_sx to smoothscale and prec_sm Bug fix after rename Remove files * change example_rmsnorm2d_fwd.cpp * update performance calculator * Fix issue in two-pass when fuse add is enabled * Remove comment of beta --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `04dd314883`]	2025-01-15 10:23:48 +08:00
Bartłomiej Kocot	a5a7f2675f	[CK TILE] GEMM and Batched GEMM SplitK support (#1724 ) * [CK TILE] Add split K support in GEMM * Updates * Fixes * rebase * fix * Fix * fixes * support for batched gemm [ROCm/composable_kernel commit: `af66494880`]	2024-12-28 14:40:17 +01:00
carlushuang	232d1462a1	[CK_TILE] layernorm have more accurate residual (#1623 ) * more accurate residual * modify comment * Fix literal case in README.md --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `cb6c5d39dc`]	2024-11-02 13:30:16 +08:00
carlushuang	38d6f8a8e2	[CK_TILE] layernorm support fused-quant/fused-add (#1604 ) * add prenorm/postnorm support, refactor using generate.py * update README * update README * fix format * update some description and fix format * update format * format * use non-raw for loading * format and update n4096 * dynamic-quant ready * update readme * support fused dynamic-quant * update fused-quant, with smooth * update README * update args * update some based on comment [ROCm/composable_kernel commit: `c3a4800c5f`]	2024-10-31 14:54:53 +08:00
Thomas Ning	0d711b3edf	Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535 ) * ake the cshuffle compilable * modify Mhe reference on gpu and cpu. Correaccess of cshuffle * fix the cpu reference code * Complete the in tile shuffle logic * restructure the kernel template input * change the naming pattern of ck_tile gemm pipeline * Re-format files using remod.py * Solve the fmha conflict with gemm * Comment Addressed from Carlus --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `6f27bc9872`]	2024-10-10 18:02:22 +08:00
carlushuang	bb5dbf2f06	introducing ck_tile! (#1216 ) * enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * initial enablement of gfx950 * fix clang format * disable examples 31 and 41 int8 on gfx950 * add code * fix build wip * fix xx * now can build * naming * minor fix * wip fix * fix macro for exp2; fix warpgemm a/b in transposedC * unify as tuple_array * Update the required Python version to 3.9 * Update executable name in test scripts * re-structure tuple/array to avoid spill * Merge function templates * Fix format * Add constraint to array<> ctor * Re-use function * Some minor changes * remove wrong code in store_raw() * fix compile issue in transpose * Rename enum Rename 'cood_transform_enum' to 'coord_transform_enum' * let more integral_constant->constant, and formating * make sure thread_buffer can be tuple/array * temp fix buffer_store spill * not using custom data type by default, now we can have ISA-level same code as opt_padding * fix compile error, fp8 not ready now * fix fp8 duplicated move/shift/and/or problem * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode * fix scratch in fp8 kernel * update some readme * fix merge from upstream * sync with upstream * sync upstream again * sync 22 * remove unused * fix clang-format * update README of ck_tile example * fix several issue * let python version to be 3.8 as minimal * remove ck_tile example from default cmake target like all/install/check * remove mistake * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg * fix some bug in group-mode masking and codegen. update README * F8 quantization for FMHA forward (#1224) * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline * Add element function to fmha api * Adjust P elementwise function * Fix bug of elementwise op, our elementwise op is not inout * Add some elementwise op, prepare to quantization * Let generate.py can generate different elementwise function * To prevent compiler issue, remove the elementwise function we have not used. * Remove f8 pipeline, we should share the same pipeline even in f8 * Remove remove_cvref_t * Avoid warning * Fix wrong fp8 QK/KV block gemm setting * Check fp8 rounding error in check_err() * Set fp8 rounding error for check_err() * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode * 1. codgen the f8 api and kernel 2. f8 host code * prevent warning in filter mode * Remove not-in-use elementwise function kargs * Remove more not-in-use elementwise function kargs * Small refinements in C++ source files * Use conditional_t<> to simplify code * Support heterogeneous argument for binary function types * Re-use already-existing scales<> functor template * Fix wrong value produced by saturating * Generalize the composes<> template * Unify saturates<> implementation * Fix type errors in composes<> * Extend less_equal<> * Reuse the existing template less_equal<> in check_err() * Add equal<float> & equal<double> * Rename check_err() parameter * Rename check_err() parameter * Add FIXME comment for adding new macro in future * Remove unnecessary cast to void * Eliminate duplicated code * Avoid dividing api pool into more than 2 groups * Use more clear variable names * Use affirmative condition in if stmt * Remove blank lines * Donot perfect forwarding in composes<> * To fix compile error, revert generate.py back to `4439cc107d` * Fix bug of p element function * Add compute element op to host softmax * Remove element function in api interface * Extract user parameter * Rename pscale and oscale variable * rename f8 to fp8 * rename more f8 to fp8 * Add pipeline::operator() without element_functor * 1. Remove deprecated pipeline enum 2. Refine host code parameter * Use quantization range as input * 1. Rename max_dtype to dtype_max. 2. Rename scale to scale_s 3.Add init description * Refine description * prevent early return * unify _squant kernel name in cpp, update README * Adjust the default range. * Refine error message and bias range * Add fp8 benchmark and smoke test * fix fp8 swizzle_factor=4 case --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `db376dd8a4`]	2024-04-15 19:27:12 -05:00

46 Commits