composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 18:17:44 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	d5ffc2b1b9	Add Grouped Convolution and GEMM documentation (#1719 ) * Add Grouped Convolution docs * Add gemm docs * Update docs * fix [ROCm/composable_kernel commit: `85d6fcd30a`]	2025-02-04 16:41:49 +01:00
Bartłomiej Kocot	3cddb295b1	Fix duplication of pk_add_f16 symbols (#1858 ) [ROCm/composable_kernel commit: `11e4082dd8`]	2025-02-04 14:42:11 +01:00
Bartłomiej Kocot	5835ed012d	Fix pk_int4 cast and add pk_int4 dtype in ck tile (#1854 ) * Fix pk_int4 cast and add pk_int4 dtype in ck tile * fixes * Improvements * fix typo [ROCm/composable_kernel commit: `9ee69dd297`]	2025-02-04 10:32:07 +01:00
arai713	1ca4ad2739	Codegen hipRTC compilation (#1579 ) * updating codegen build for MIOpen access: adding .cmake for codegen component * updating CMake * adding in header guards for some headers due to issues with hiprtc compilation in MIOpen * some more header guards * putting env file in header guard * cleaning up some includes * updated types file for hiprtc purposes * fixed types file: bit-wise/memcpy issue * updating multiple utility files to deal with standard header inclusion for hiprtc * added some more header guards in the utility files, replacing some standard header functionality * added some more header guards * fixing some conflicts in utility files, another round of header guards * fixing errors in data type file * resolved conflict errors in a few utility files * added header guards/replicated functionality in device files * resolved issues with standard headers in device files: device_base and device_grouped_conv_fwd_multiple_abd * resolved issues with standard headers in device files: device_base.hpp, device_grouped_conv_fwd_multiple_abd.hpp, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp * added header guards for gridwise gemm files: gridwise_gemm_multiple_abd_xdl_cshuffle.hpp and gridwise_gemm_multiple_d_xdl_cshuffle.hpp * fixed issue with numerics header, removed from transform_conv_fwd_to_gemm and added to device_column_to_image_impl, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3, device_image_to_column_impl * replaced standard header usage and added header guards in block to ctile map and gridwise_gemm_pipeline_selector * resolved errors in device_gemm_xdl_splitk_c_shuffle files in regards to replacement of standard headers in previous commit * added replicated functionality for standard header methods in utility files * replaced standard header functionality in threadwise tensor slice transfer files and added header guards in element_wise_operation.hpp * temp fix for namespace error in MIOpen * remove standard header usage in codegen device op * removed standard header usage in elementwise files, resolved namespace errors * formatting fix * changed codegen argument to ON for testing * temporarily removing codegen compiler flag for testing purposes * added codegen flag again, set default to ON * set codegen flag default back to OFF * replaced enable_if_t standard header usage in data_type.hpp * added some debug prints to pinpoint issues in MIOpen * added print outs to debug in MIOpen * removed debug print outs from device op * resolved stdexcept include error * formatting fix * adding includes to new fp8 file to resolve ck::enable_if_t errors * made changes to amd_wave_read_first_lane * updated functionality in type utility file * fixed end of file issue * resovled errors in type utility file, added functionality to array utility file * fixed standard header usage replication in data_type file, resolves error with failing examples on navi3x * formatting fix * replaced standard header usage in amd_ck_fp8 file * added include to random_gen file * removed and replicated standard header usage from data_type and type_convert files for fp8 changes * replicated standard unsigned integer types in random_gen * resolved comments from review: put calls to reinterpret_cast for size_t in header guards * updated/added copyright headers * removed duplicate header * fixed typo in header guard * updated copyright headers --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `2e3183af4f`]	2025-01-31 09:48:39 -08:00
Adam Osewski	e029eaedae	[CK Tile] Spatially local GEMM tile partitioner. (#1843 ) * Add spatially local tile partitioner * Use 1D Grid size & create partitioner object. * Docs & use 1D partitioner in example. * Clang format. * Change kernel grid size Now: X is the # of output C-tiles, Y is the batch count Z is the splitK * Formatting & more doc. * Clang format. * Fix batched gemm test. Use 1d partitioner. * Move condition. * FIx ctor. * clang-format. [ROCm/composable_kernel commit: `ce448002ee`]	2025-01-31 00:10:16 +01:00
Bartłomiej Kocot	4f2c699f90	[CK TILE] Implement cschuflle algorithm (#1842 ) * [CK TILE] Implement cschuflle algorithm * Rebase * Vector store size fixes * fixes * Fixes * fixes * fmha fix * fixes * fixes of fixes [ROCm/composable_kernel commit: `25e2e0f04a`]	2025-01-30 11:57:39 +01:00
fangche123	e4d8548dc5	add batched_transpose implement (#1660 ) * add batched_transpose implement --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com> Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `c5fff071e5`]	2025-01-28 16:22:02 -08:00
darren-amd	54ae306398	Change flag to CK_GFX90A_DENORM_WORKAROUND (#1817 ) * Change flag from CK_WORKAROUND_DENORM_FIX to CK_GFX90A_DENORM_WORKAROUND for more clarity. Also changed the definition macros to be more clear. [ROCm/composable_kernel commit: `d6a4605e1c`]	2025-01-28 09:58:39 -05:00
Andriy Roshchenko	78f8490cb6	Add OCP FP8 support in CK_TILE (#1829 ) * Add OCP FP8 to CK_TILE * Validate OCP FP8 in FMHA FWD under VALID=1 [ROCm/composable_kernel commit: `35aebe5936`]	2025-01-27 11:59:49 -07:00
Adam Osewski	89093ac431	[CK-Tile] Enable vectorized reads on all layouts & improve perf. (#1835 ) * Refactor universal gemm policy. * Adapt example to refactor changes. * Introduce static encoding pattern * Adding shuffled encoding patterns. * Fix err in reverse tuple. * Add transpose_tile2d * Small refactoring + doc * Enable reading on contiguous dimension in all layouts. * Transpose A/B register tile if needed for comp v3 pipeline. * Take contiguous dim size when calculating dram vector load size. * A/B smem pack size taken from WarpGemm attributes * Update B LDS layout and setup tile distribution pattern at class level. * Fix static assert. * Fix errors in examples. * Formatting & fix IsTranspose * Fix VectorSize & refactor. * Add error loging messages. * Fix VecLoadSize and TranspseC for mem pipeline. * Update unit-tests & disable mem pipeline. * Clang format * Update include/ck_tile/core/tensor/tile_window.hpp Co-authored-by: jakpiase <jakub.piasecki@amd.com> * Fix compilation and reviewers comments. * Refactor unit-test. Fallback to non-universal gemm. Need to use GemmPipelineAGmemBGmemCRegV1 for now, since GemmKernel is now supporting also non-K major vector reads. --------- Co-authored-by: jakpiase <jakub.piasecki@amd.com> [ROCm/composable_kernel commit: `39dc25a9b8`]	2025-01-27 16:37:19 +01:00
ruanjm	dce7207ece	Implement fp8 quant for layernorm and rmsnorm (#1814 ) [ROCm/composable_kernel commit: `64d5c4d6cb`]	2025-01-24 16:40:43 +08:00
carlushuang	ae3d2e47c4	[CK_TILE] not using structures under ck_tile/ops for ck_tile/host (#1834 ) * not using structures under ck_tile/ops for ck_tile/host * update as constexpr function * Rename fn * Update other examples. --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> [ROCm/composable_kernel commit: `5b9b083dbc`]	2025-01-24 15:35:54 +08:00
carlushuang	74b2592535	add fp8 as dst (#1830 ) [ROCm/composable_kernel commit: `052a72655c`]	2025-01-22 17:34:27 +08:00
Mateusz Ozga	91965f3411	Simplify static_cast if-lands (#1828 ) [ROCm/composable_kernel commit: `3db77bc4f2`]	2025-01-21 23:23:19 +01:00
Mateusz Ozga	b63e4bc4b8	CK-Tile Grouped GEMM refactor and post PR fixes (#1756 ) * Grouped gemm simple code refactor * Offset invoker * Invoke generic Run, and replace name of parrtitioner variable * Tests fix type * Removed namespaces * Add template param to avoid implicit cast * Remove generic function * Constant value * underline enum to int16_t * Generalize partitioner function * Remove whitespaces * Rename function * Using support * Clang-format * Clang-format * Fn-partitioner description fn * Typo * Typo 2 * Better description * Better description * Refactor after review * Use ctr instead of set fn * Inovke ctr and typo * Comments * Remove unnecessary comment * Review, remove modulo [ROCm/composable_kernel commit: `3c93d3c444`]	2025-01-21 21:06:10 +01:00
lucbruni-amd	80a206156b	Add CK_TIME_KERNEL as toggleable CMake Variable (#1794 ) * Disable CK_TIME_KERNEL by Default, Add as CMake Variable * Enable CK_TIME_KERNEL by Default, Maintaining CMake Variable Functionality. * Fix build error. [ROCm/composable_kernel commit: `3fb2f5acc7`]	2025-01-20 07:09:19 -08:00
Mingtao Gu	462b62be6b	fix a bug for int4 scale weight only kernel (#1820 ) Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `86d1b46aa6`]	2025-01-19 11:18:18 +08:00
Bartłomiej Kocot	6472bdb4ed	[CK_TILE] Add error threshold calculation for gemm examples (#1821 ) [ROCm/composable_kernel commit: `bdddf1eace`]	2025-01-18 01:01:52 +01:00
Bartłomiej Kocot	e65a010b5e	Fix and optimize dynamic unary elementwise (#1818 ) * Fix and optimize dynamic unary elementwise * fix [ROCm/composable_kernel commit: `1519ce91a3`]	2025-01-16 13:48:39 -08:00
carlushuang	2fec988802	[CK_TILE] Fix mock token id, support g1u1/g1u0 through same inline code block (#1808 ) * fix mock token id * prepare host for g1u1 * reformat inline-asm * restructure uk_0 * restructure gate_up * done * change default to init=1 * update readme * fix a bug in interleave pipeline * rcp for silu [ROCm/composable_kernel commit: `1ff50e78c6`]	2025-01-16 17:51:10 +08:00
Bartłomiej Kocot	2c4a1cce43	Add rounding for float to bf16 conversion as default (#1812 ) * Add rounding for float to bf16 conversion * Add bhalf test * Add inf test bhalf * Refactor * update cmake * Fixes [ROCm/composable_kernel commit: `7790e8c3f7`]	2025-01-15 16:41:21 +01:00
ruanjm	9f9eddd0cf	[CK_TILE] Add Various Fusion Functions to RMSNorm (#1802 ) * Add shortcut to RMSNorm * Modify test for adding shortcut for RMSNorm * Add fused parameter into tests * 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp * 1. Supports various stride and percisions. * Add support of Epilogue * Add fuse and epilogue support to rmsnorm ref * Modify rmsnorm example * Refactor tests/examples * Bug fix for newly added tests/examples * Bug fix for new tests 2 * Modify smoke test scripts remove dbg code * Supports non-smooth dyanmic quant * Update Rmsnorm2dFwd::GetName() * rename xscale and prec_sx to smoothscale and prec_sm Bug fix after rename Remove files * change example_rmsnorm2d_fwd.cpp * update performance calculator * Fix issue in two-pass when fuse add is enabled * Remove comment of beta --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `04dd314883`]	2025-01-15 10:23:48 +08:00
Thomas Ning	70e79bc56f	CK Tile GEMM CICD fixed & register block method refactor (#1776 ) * refactor the block_gemm_areg_breg_creg_v1 and add the v2 policy with 2x2 warp gemm * Finished the 2x2 warp gemm policy and the block selection mechanism * Clang format * address poyen's comment * Address feedbacks * Fixed the compilation issue * Change the function name [ROCm/composable_kernel commit: `5d671a5fc4`]	2025-01-13 13:10:44 +08:00
Qianfeng	3cc02417a9	Update for fmha_fwd qs_ks_vs pipeline (#1810 ) * Update for fmha_fwd qs_ks_vs pipeline * Remove _builtin_amdgcn_sched_barrier(0) * Move p_compute to p converting earlier for trying to increase vgprs re-using * Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation * Re-add __builtin_amdgcn_sched_barrier(0) --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `3d50f57f43`]	2025-01-13 12:43:05 +08:00
Bartłomiej Kocot	da7d6023cf	Grouped convolution backward weight special vector size loads (#1772 ) * Grouped convolution backward weight special vector size loads * Instnaces and tests * Fixes * Add 7 and 13 special cases * fix comments * Fix * Fix2 * fixes * fix atomic add bf16 [ROCm/composable_kernel commit: `fd46a01d8b`]	2025-01-10 22:02:30 +08:00
darren-amd	6bc57cf274	Disable building DPP kernels by default (#1804 ) * Disable building DPP kernels by default * Disable building dpp instances, examples, or tests if DPP_KERNELS is not set * Add new DPP_KERNELS flag to readme [ROCm/composable_kernel commit: `26b3829c02`]	2025-01-08 13:50:42 -05:00
Max Podkorytov	4c98908e17	mark unused args [ROCm/composable_kernel commit: `ad697c78ac`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	d7a2a81051	run clang-format -style=file [ROCm/composable_kernel commit: `a2e6ad62e2`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	e1896982b5	run clang-format==12 [ROCm/composable_kernel commit: `aa59ecaa22`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	715635839b	update comment in the policy [ROCm/composable_kernel commit: `82fb3f84fb`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	00c32ecda2	update qsksvs comment [ROCm/composable_kernel commit: `4daa82b451`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	099e23be84	remove dead code [ROCm/composable_kernel commit: `66c5b715c9`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	63cc962000	clang-format and remove dead code [ROCm/composable_kernel commit: `edb78a4729`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	25fdfe3df8	roll back splitkv [ROCm/composable_kernel commit: `60113859fa`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	d3d53433aa	update qsksvs pipeline [ROCm/composable_kernel commit: `bfc997a7e6`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	1d7c38642c	qsksvs pipeline changes to mirror qrksvs [ROCm/composable_kernel commit: `f7942b993c`]	2025-01-08 10:09:54 -08:00
AMD-dteng	12103d0f17	enable bias feature that add bias before adding residual (for rtpllm project) (#1741 ) * 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16 * delete comment * 1.remove fmha change 2.change buffer name from bias to xbias * Now bias can be used independently from fadd * change kbias to kxbias --------- Co-authored-by: feli <felix.li@amd.com> [ROCm/composable_kernel commit: `d5c8a334ca`]	2025-01-08 17:51:06 +08:00
Po Yen Chen	ff1d9f88fa	[CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789 ) * Update license year * Add initial code to override decode problem * Fix splitkv traits/args overriding error * Reshape and transpose lse for decode * Remove debug code * Prettify example code * Use better function name * Add kMergeNumHeadGroupsSeqLenQ flag Kernel user can use this switch to turn on/off optimization for some problem sizes * Add missing flag declarations * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen * Group similar statements together * Remove assumption of seqlen_q=1 * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need * Fix group mode block skip logics * Undo changes of normal fwd kernel * Update in GridSize() and using GridSize() for splitkv kernel (#1799) --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com> [ROCm/composable_kernel commit: `24b12d04af`]	2025-01-07 18:49:24 +08:00
Bartłomiej Kocot	a064792e96	Fix universal gemm profiler for pk_i4_t (#1790 ) * Fix universal gemm profiler for pk_i4_t * fix [ROCm/composable_kernel commit: `888317e698`]	2025-01-04 14:01:33 +01:00
Illia Silin	68c7f53cb1	terminology clean-up (#1792 ) [ROCm/composable_kernel commit: `8ea375bb58`]	2025-01-03 16:38:22 -08:00
carlushuang	60e814a3ba	[CK_TILE]naive attn support FP8 KVCache quant (#1747 ) * quant * fix bug * simple smoothquant after softmax * update kv-quant * update stride * fix fp8-pertoken-kvcache * update int8/fp8 quant support --------- Co-authored-by: so <a.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `6df5fe2ad8`]	2025-01-03 18:43:07 +08:00
Mingtao Gu	d4a8c6c2ed	Implement the fp16xint4 scale weight only kernel for Ali (#1786 ) * enable int4 scale (weight only) kernel * format some files * Add unit test for int4 weight only * fixed and formatted code * fixed * formated * formated * fixed * fixed a bug in the ckProfiler, and formatted the code --------- Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `4f62f6e9b7`]	2025-01-03 18:35:21 +08:00
feli	5ce28a1d13	Ck tile/layernorm: implement naive reduce, opt performance (#1784 ) * add no welford * enable output raw * raw of int8 * fix build * fix smoke test err * [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate * [cktile]layernorm, fix err commit files and remove uselss * fix quant 8192 err & change norm_reduce class and file name --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `4bc610416a`]	2025-01-03 14:28:59 +08:00
Muhammed Emin Ozturk	222b1d6b48	BF16 GEMM Stream-K (#1541 ) * initial * Cmake file * successfull compilation but validation failed * Cmake * update * gpu validation * gemm universal * gemm universal sk update * sk bf16 universal instance * gemm_universal_streamk.hpp * only build for gfx94 * Cmakelist * profiler update, bf16 sk only works at gfx42 * clang * clang * clang all * no need flags * cmake script * delete comment * gemm universal sk fix * clang * profiler fix * clang * update * update * delete comment * code formatting * cmake * fix instance * clang * argument supported * argument supported and clang * update * fix * removing unnecessary comments * clang formatting * Update library/src/tensor_operation_instance/gpu/CMakeLists.txt Co-authored-by: afagaj <john.afaganis@gmail.com> * CopyRight Comment 2025 * clang reformatting * copy right 2025 --------- Co-authored-by: Emin Ozturk <ozturk.27@osu.edu> Co-authored-by: root <root@ctr-ubbsmc16.amd.com> Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund> Co-authored-by: root <root@splinter-126-wr-d3.amd.com> Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund> Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund> Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund> Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu> Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund> Co-authored-by: afagaj <john.afaganis@gmail.com> [ROCm/composable_kernel commit: `9e95d54cd2`]	2025-01-02 10:30:04 -08:00
Adam Osewski	ac74520ff6	Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762 ) * add a prototype of int4 * clean * debug * clean * clean * move packed into dynamic_buffer * fixed coord reset * add fast pki4 to half conversion * fix * fixed reference and host_tensor * fixed tensor init * format * debug i4_to_f16_convert * format * fixed splitk * weight permute * add b tile permute * clean * weight permute with splitki * format * improve weight layout * add and_or_b32 * fixed splitk crush * add permute switch as a template * recover v3r1 * clean * failure with intrawave v2 * fixed * fixed * add ckProfiler * add bfp16 support * add bf16 example * fixed int4 to bhalf_t conversion * format * fixed int4 to bf16 conversion * clean * add instances for mem * clean * fixed host tensor size * fixed * debug * fixed * add pk_i4_t as a struct * fix * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * revert * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed comments * revert * clean * revert * revert * fixed * Update CMakeLists.txt * Update script/cmake-ck-dev.sh Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update CMakeLists.txt Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed * fixed * fixed * revert * revert * add comments * format * fixed assert * fixed * Fix I4 define in ckProfiler * Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `1d8e4ec2ce`]	2025-01-02 11:48:06 +08:00
Qianfeng	8c1883a424	Remove using partitioner for all fmha kernels (#1778 ) * Remove using tile partitioner for fmha_fwd_kernel * Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels * Remove using tile partitioner for fmha_fwd_appendkv kernel * Unify the format of GetTileIndex [ROCm/composable_kernel commit: `4e076909b6`]	2024-12-29 14:29:56 +08:00
Bartłomiej Kocot	7fbc8a9ac1	[CK TILE] GEMM and Batched GEMM SplitK support (#1724 ) * [CK TILE] Add split K support in GEMM * Updates * Fixes * rebase * fix * Fix * fixes * support for batched gemm [ROCm/composable_kernel commit: `af66494880`]	2024-12-28 14:40:17 +01:00
carlushuang	4c4be7b14f	[CK_TILE] optimize moe-sorting kernel (#1771 ) * opt moe sorting * remove commented code [ROCm/composable_kernel commit: `3d15f364b3`]	2024-12-23 10:59:02 +08:00
Illia Silin	c369965615	fix typo for CK_USE_OCP_FP8 (#1769 ) [ROCm/composable_kernel commit: `07339c7383`]	2024-12-20 07:52:24 -08:00
carlushuang	0d16f9b5c7	hot-fix (#1768 ) [ROCm/composable_kernel commit: `1c45ca35dd`]	2024-12-20 16:40:45 +08:00

1 2 3 4 5 ...

651 Commits