composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 05:01:25 +00:00

Author	SHA1	Message	Date
Tianyuan Wu	68134b60e4	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit `07a79e797d`. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit `ceee918007`. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit `7a7241085e`. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-08-15 16:22:27 -07:00
Gino Lu	e5623d3825	fix wrong nan producion. (#2640 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-08-14 15:12:31 +08:00
Haocong WANG	05a6e92705	Re-enable optimization for gfx950 fmha fwd (#2671 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format * refactor blockgemm change, isolate to v2; --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-08-13 14:57:43 +08:00
asleepzzz	5b39de4bb6	Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 ) This reverts commit `b7322a521a`.	2025-08-12 20:27:10 +08:00
Haocong WANG	b7322a521a	Optimize fmha fwd decode & prefill for gfx950 (#2641 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-08-12 19:43:14 +08:00
Yi DING	4fde1646e5	[CK_TILE] FMHA BWD Optimization For GFX950 (#2628 ) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp	2025-08-12 11:11:55 +08:00
Yashvardhan Agarwal	191c62967b	Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 ) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format	2025-08-11 15:01:33 +02:00
Gino Lu	5d6d236b25	Add e8m0 scaled convert into CK_TILE (#2617 ) * first commit * remove redundent code * modify according to comments. * fix type_convert error with scaled_type_convert	2025-08-07 21:37:28 +08:00
Yi DING	b0a97498b0	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad	2025-08-07 21:24:43 +08:00
Sami Remes	ffdee5e774	[CK_TILE] Enable printing more structures in CK-Tile (#2443 ) * Add more printing to core cktile * Revert other changes in static encoding pattern * Refactor to using a free print() function * Remove loops and print just the containers * Print tuple with better formatting, fix sequence compilation * Add some tests for print utility * Add print utility header * Print for static_encoding_pattern * add buffer_view printing * Align vector_traits * Fix formatting * Lower-case enum strings Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Remove empty comment lines * Fix test with lower-case too * Reduce repeated code in print tests, move helper function closer to type definition, test X&Y * Add test_print_common.hpp * add print.hpp in core.hpp --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-08-07 15:45:27 +03:00
Yashvardhan Agarwal	4750b293fe	General 2D Reduction Kernel (#2535 ) * General 2D Reduction Kernel * Move the reduction kernel from the example * Split the code and add the necessary policy, problem, shape files as per ck_tile convention * Add/modify the headers * Modified the example to work with the 'new' kernel * Added tests for the kernel * N-D refernce reduce * Added support for N-D input with transform to 2D * Added padding to support various input sized tensors * Bug fix in the thread buffer constructor * Some comments to explain the reduce2d block kernel * comments resolution * clang-format * comments resolution * clang-format * clang-format * comments resolution * clang-format	2025-08-06 15:36:59 +02:00
Adam Osewski	2622ff06cb	Remove unused lds direct load instruction. (#2573 ) This functionality is replaced by amd_async_buffer_load Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-08-06 15:16:12 +02:00
Illia Silin	788e8a878e	update the switch condition for buffer built-ins (#2602 )	2025-08-01 14:30:07 -07:00
Anton Gorenko	7b074249f4	[CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571 ) * Add tests for host convesion f32/f16 to f8 * Add tests for host convesion from f8 to f32/f16 * Fix UB and corner cases in f32/f16 to/from f8 conversion * There are UBs when very small values are converted to f8: bitshifts can be larger that type width. Using unsigned long long does not help because exponent_diff >= 64 in such cases. This causes that values like 2.117582368e-22 are converted to non-zero f8 in host validation of FMHA tests, test_f8 crashes with segfault in completely irrelevant code like GTest internals or produces non-deterministic results etc. * Fix FNUZ conversion to return NaN for NaN inputs. * Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16 conversion. * Replace some magic numbers with values from numeric_traits * Build tests only on devices supporting the type	2025-07-31 09:54:17 +05:00
Gino Lu	b25d512e8a	add constexpr to pk_fp4::pack/unpack() (#2586 )	2025-07-30 10:29:04 -04:00
Thomas Ning	9d4b494f07	Expand the bandwidth of direct_global_to_lds for gfx950 (#2576 ) * Expand the bandwidth of direct_global_to_lds for gfx950 * clang-format * fix the remod.py and script for clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-28 23:56:53 -07:00
Yi DING	1926cd0cb8	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-29 09:31:14 +08:00
Illia Silin	504b101da3	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config	2025-07-28 11:34:07 -07:00
Gheorghe-Teodor Bercea	cbfa62e4b6	Refactor async loads to work on all GPUs (#2545 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-26 22:04:59 -07:00
Mateusz Ozga	b507d889c1	[CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520 ) * Init commit new API * apply clang-format * PreShuffle preapring * Apply Preshuffle condition to universal_gemm * Fix: convert size_t to index_t * Review changes * Mode 100755 -> 100644 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-07-24 20:39:56 +02:00
Yashvardhan Agarwal	606b0cc947	[CK_TILE] Support for elementwise kernel (#2246 ) * Elementwise kernel implementation Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: yashagar <yashagar@amd.com> * Elementwise with generalized nDims * Adding the n-ary input tensor feature * Generalize dimensions on top of inputs * Add TFLOPS + remove std usage for tuples * 1D basecase optimization * Cleanup code + refactoring to a common interface * Generalize to unary and add an example * Cleanup, refactoring and commenting * Suggestions for LWPCK-3170: elementwise kernel improvements * Clang-format: remod.py * Replace InputTensorType with XDataType as the type of input_tensors * Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs * Move examples to folder 19_elementwise * Add missing copyright headers and fix some existing ones * Replace an assert with throw std::runtime_error in elementwise example * Avoid reading the output by using make_static_distributed_tensor for y_tile * Removed two unused includes * No need to move windows to the next block when each workgroup processes a single tile * Only copy input tensors to the device * Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example * Adding output strides to the kernel, transposition example and update the other examples * Changes made by remod.py * Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp * Reuse generic reference binary/unary operation in examples + refactoring the transpose reference * Fix comments in elementwise_example.cpp - Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses - ElementWiseTraits was renamed to ElementWiseShape - Adopt suggestions made by Copilot when prompted to check for factual or typographical errors * Simplify CMakeLists.txt and remove the unused variables this uncovers * Rename a file and fix some copyright statements * Changes made by script/clang-format-overwrite.sh * Add basic unit test for ElementWiseKernel * Remove left-over uninformative comment in apply unit test * Changes made by clang-format-overwrite.sh * fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp * Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array * Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally * Move examples to folder 20_elementwise * Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK * Fix CLang formating * Bump up the elementwise example folder number * Elementwise: add padding + minor cleanup * Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view * Add isSupportedArg to Elementwise kernel + addapt example and unit tests * Fix clang-format on the unit test file --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-07-24 11:21:45 +02:00
Haocong WANG	a5fdc663c8	fix async copytest bug (#2509 ) * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * change bit width check * refactor macros into constexpr functions which still get inlined * wrap s_waitcnt api * parameterize test * cleanup * cleanup fp8 stub * add fp8 test cases; todo which input parameters are valid? * replace n for fp8 in test cases * add large shapes; fp8 fails again * change input init * test sync/async * time the test * clang-format test * use float instead of bfloat to cover a 4-byte type * fix logic - arg sections should be 'or'd * make block_sync_lds_direct_load interface similar to old ck * fix a few comment typos * name common shapes * revert the example to original logic of not waiting lds * clang-format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-23 00:14:02 -07:00
Cong Ma	e62710e461	ck_tile kernel for gemm with groupwise quantized A tensor (#2473 ) * ck_tile kernel for gemm with groupwise quantized A or B tensor. This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. 1. fp8, fp8 -> f32 2. bf8, bf8 -> f32 3. i4, fp8 -> f32 4. i4, bf8 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers. * [Block Scale Gemm] Only run gemm quant examples on __gfx94__ - Only run gemm quant examples on __gfx94__ for usage of `v_cvt_pk_fp8_f32` - Format the code * [Block Scale Gemm] Remove Bquant Gemm BlockScale This cleanup is in preparation for future development of bquant. By isolating Aquant-related code, we can streamline the codebase and make it easier to add and maintain bquant functionality in subsequent updates. * [Block Scale Gemm] Format code with clang-format-12 The latest clang-format (v19) in ROCm 7.0 generate different result than clang-format-12 which is used in CK CI. Format code with clang-format-12 for consistency. * [Block Scale Gemm] Split the k direction loop - Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp to make the logic clearer. - Disable C transposition. * [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm * [Block Scale Gemm] Update copyright * test * Add TailHandler * Move TileDistributionEncodingPatternAQ * Refactor * refactor * fix bug * fix bug * help solve the PR comment * Format the code * [Block Scale Gemm] Add unit tests * [Block Scale Gemm] Add support to 16x16x32 MFMA - Add support to 16x16x32 MFMA - Fix a bug when exchange data crossing lanes --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com> Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-07-23 00:10:16 -07:00
Yi DING	f1d8ad2818	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>	2025-07-16 23:56:22 -07:00
Gino Lu	141bf2d54d	[CK_TILE] Add pk_fp4 data type (#2422 ) * [draft] Add pk_fp4 and test * Add hw conversion for fp4 * Refine test code and pk_fp4 constructor. * fix test indent * modify according to comment. * fix clang-format * modify according comments. --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-07-14 20:35:06 +08:00
Illia Silin	e033a1b4bf	fix compilation errors with clang20 (#2464 )	2025-07-07 19:40:30 -07:00
Thomas Ning	f240ae3248	Enable Async Copy for MI355 (#2425 ) * add for async load builtin * add async load api * fix some compiling errors * fix a compiling error * fix some compiling errors * add a pipeline which copies from v4 * add a new pipeline for async load * fix some compiling errors * add async load tests * fix some issues in async load * fix * fix async inline assembly * fix async inline assembly * add ignore header file * comment some not gfx950 codes * comment some not gfx950 codes * fix a error * update async load apis * fix lds descriptor * fix a compiling error * fix some compiling errors * fix a descriptor issue * update lds descriptor * change async pipeline's tile distribution pattern from thread to warp * fix clang format * update async policy * fix a CRTP issue * fix a typo error * change lds layout * fix some sync issues * improve codes * delete the async test * fix a commented format issue * avoid compiling device functions when compile host * make gemm run * add the copy kernel support * finish the feature * Address comment * add the support for buffer_builtin * solved the merging problem * Comment Addressed --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: joyeamd <John.Ye@amd.com>	2025-07-07 10:08:49 -07:00
carlushuang	0aecb5ab68	default skip y point to r (#2457 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-06 23:54:34 -07:00
carlushuang	a8742f7e31	[CK_TILE][CORE] enhance slice_tile api (#2430 ) * support slice cross p * fix some bug in y_len * more case * fix a bug when R exist * support -1 to hint end of current length * format * change commit	2025-07-06 20:13:12 -07:00
Max Podkorytov	158ddeb8ce	[CK-TILE] File-level documentation for static encoding pattern (#2433 ) * add file-level comment * Finished the write-up --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-07-04 02:26:18 -07:00
Thomas Ning	e03293ebce	[CK Tile] Int8 Support on CK Tile GEMM (#2267 ) * updates to support int8 in 03_gemm example * added comments, using aliases, helper functions * test(gemm_universal): add test cases for int8 gemm pipeline * fix(test_gemm): fix for failing test unit test for int8 * test(ck_tile): add int8 unit test for gemm universal * refactor(gemm_universal): GPU reference verification for GEMM code improved * style(gemm_universal): removed extra comments and did clang format * merging recent changes to universal gemm to tile_engine * ck tile engine integration work * feat(tile_engine): add int8 support to tile engine ops/gemm * feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8 * style: Format code with clang-format-12 * refactor(tile_engine): address review comments * style: removed unhelpful comments & unused variables. * build: tile engine uses default config * feat: add int8 support for CK_TILE GEMM * style: added trailing commas to codegen_utils.py * refactor: tile engine * refactor: formatting and code review * refactor: code formatting for python files * fix: suppress build warning * add support for gfx950 * refactor:KWarpTile size in gemms util * Fix the branch and wrap up the k warp tile * Add bf8 integration * refactor: clang format and rebase --------- Co-authored-by: zjli2013 <leezhengjiang@gmail.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>	2025-06-25 08:20:35 -07:00
Yi DING	b8212864cf	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-06-24 01:33:31 +08:00
Po Yen Chen	7d669440a6	[CK_TILE] Fix compilation errors introduced in #2320 , #2219 and #2214 (#2388 ) * Fix compilation errors * Fix more ck_tile example compilation errors	2025-06-23 12:29:15 +08:00
Thomas Ning	107e3623c7	Transpose builtin macro defense (#2374 ) * add the macro defense * add the static assert check	2025-06-20 11:24:54 -07:00
joyeamd	a2f01141aa	transpose load api development (#2177 ) * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * change directory name * delete the duplicated directory * update cmakelists file * delete the unused codes * update function names * update transpose policy * update code after remod.py * update codes * add some comment * Polish the instr infrastructure * build up the fixed instr * redesign the transpose api, currently it has numerical error * add the bf16 transpose * fix some issues * add some comments * update document * Finished the refactor of API and pass through the verification * fix the merging issue --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-06-18 01:28:34 -07:00
Satyanvesh Dittakavi	4c57157d50	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-06-17 11:54:30 -07:00
ruanjm	b34c234f51	Add support for specifying valid flag when fetching elements for tile_scatter_gather (#2332 ) * Add support for specifying valid flag when fetching elements for tile_scatter_gather Add constexpr for operator[] of TrueGenerator * Use different path when valid is enabled	2025-06-16 17:17:03 +08:00
Mateusz Ozga	bd96ac9742	[CK_TILE] Multiple-D GEMM example (#2219 ) * Multiple d, initial commit * Check Ds Layout * Readme and clang format * Update branch & conflicts * Multiple D - fix clang-formatter * Rename elemetwise_op * Fix CI * Code review part1 * Remove printf * Remove unnecessary comment * Add new tests with Col layout * Review part 2 * Added support for Multiple D GEMM * Update comment * Remove maybe_unused * Clang-format * Review part 3 * Add comment to function * Add comment to function: another * Take number of params for a refrence function * Remove additional d param for 0 tensor * Change name of function * Fix CI fails	2025-06-13 19:39:11 +02:00
kylasa	5f1ad09b61	Code drop for 2 warp ping pong scheduler along K dimension. (#2276 ) * Code drop for 2 warp ping pong scheduler along K dimension. * Addressing code review comments. * Addressing Clang formatting issues. * Addressing build issues. * Addressing build issues of other GEMM pipelines with ping pong scheduler code drop. * Fix for LDS memory size for GEMM pipelines. * Addressing code review feedback comments. * Change log update. * Addressing code review comments and build issues. * Added new policy for pipeline specific logic about LDS needs. * Clang Fix during build.	2025-06-12 18:24:02 -07:00
Thomas Ning	f59b8c7d3d	OCP FP8 Macro restructure (#2331 ) * solved the problem	2025-06-12 09:46:33 -07:00
John Afaganis	6635d1bb88	Remove usage of 'warpSize' variable as it has been deprecated (#2295 ) * SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround * SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo --------- Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com>	2025-06-10 07:34:54 -07:00
MHYangAMD	9fcf21a4ec	Fix fmha fwd precision issue on MI3XX series (#2285 ) * Fix fmha fwd precision issue on MI3XX series For fmha fwd fp16 cases, we found that using impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero. For examaple, fixing K,V to be all 1, and Q is random, which outputs are expected to be all 1. But we found that it would have some incorrect outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 = 0.0005 < 0.001) Thus, ck do not report this error. * Add option to switch rtn/rtz for fmha fwd	2025-06-10 15:03:23 +08:00
Sami Remes	1c6f83df6c	[CK_TILE] Tileloop persistent gemm - resubmit (#2299 ) * Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293) This reverts commit `233e274077`. * Add missing header for kentry --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-06 14:18:49 -07:00
valarLip	8482977a37	extend buffer load to support load 32 bf16/fp16 at same time (#2291 )	2025-06-06 17:21:19 +08:00
Illia Silin	233e274077	Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 ) This reverts commit `ffb52783d0`.	2025-06-05 09:24:00 -07:00
Sami Remes	ffb52783d0	[CK_TILE] Tile loop persistent gemm kernel (#2191 ) * Implement tile loop persistent gemm kernel * Enable timing * Add tests for persistent gemm * Fix formatting * Fix gemm_basic * Rename True/False to Persistent/NonPersistent * Use only one set of layouts for persistent tests * Fix gemm example persistent template parameter * Fix formatting	2025-06-04 11:46:28 +03:00
valarLip	0fdbf6bcd1	extend buffer load for fp16/bf16x16 (#2270 ) * extend buffer load for fp16/bf16x16 * format	2025-06-02 10:29:54 +08:00
Illia Silin	8146e471f1	fix the buffer intrinsic names for clang >=20 (#2228 )	2025-05-23 14:58:25 -07:00
Illia Silin	1b846143c6	Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192 )" (#2227 ) This reverts commit `58f9e9ffbc`.	2025-05-22 15:41:17 -07:00
Aviral Goel	534d4594d0	Refactor tile_window.hpp, tile_window_linear.hpp into a CK Tile Hierarchy (#2214 ) * window_origin variable now in base class * abstracted more functions * consolidated tile_window_static_distribution and tile_window_static_lengths * clang format * skeleton code for tile_window and tile_window_linear consolidation * more abstraction * moved variables from child to parent * clang format * removed comments * removed debug code * removed debug code * abstracting traits WIP * consolidated traits * removed comments and clang formatted	2025-05-21 23:28:00 -07:00

1 2 3

128 Commits