composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-18 01:28:27 +00:00

Author	SHA1	Message	Date
Po Yen Chen	b62e551ccb	[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403 ) * Rename batch_prerfill interface * Add min_seqlen_q parameter in MakeKargs() [ROCm/composable_kernel commit: `50fad03524`]	2025-06-25 15:19:21 +08:00
Yi DING	cba904aeff	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `b8212864cf`]	2025-06-24 01:33:31 +08:00
Max Podkorytov	eb96164495	Update for xformers (#2372 ) * update api * update kernel api * clang-format [ROCm/composable_kernel commit: `0366fb2abc`]	2025-06-22 00:28:30 -07:00
Satyanvesh Dittakavi	a4517b0a9d	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
MHYangAMD	7cb7aa8e75	Fix fmha fwd precision issue on MI3XX series (#2285 ) * Fix fmha fwd precision issue on MI3XX series For fmha fwd fp16 cases, we found that using impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero. For examaple, fixing K,V to be all 1, and Q is random, which outputs are expected to be all 1. But we found that it would have some incorrect outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 = 0.0005 < 0.001) Thus, ck do not report this error. * Add option to switch rtn/rtz for fmha fwd [ROCm/composable_kernel commit: `9fcf21a4ec`]	2025-06-10 15:03:23 +08:00
Po Yen Chen	4fc501ba2b	[CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask (#2209 ) * Assign block indices reversely if kHasMask=true * Assign block indices reversely for splitkv kernel [ROCm/composable_kernel commit: `c42b957d65`]	2025-05-27 10:58:58 +08:00
Zzz9990	3991de1cdc	[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221 ) * fix splitkv compiler issue since lse is used to select kernel instances * bypass seqlen == 1 * add chunked prefill into mha varlen This reverts commit 5ba7148f7b34b1b438e9806748211c153ee8c433. * skip compile when receipt 2-4 and add comments * fix --------- Co-authored-by: fsx950223 <fsx950223@outlook.com> [ROCm/composable_kernel commit: `ece38b9d7a`]	2025-05-26 19:17:18 +08:00
Po Yen Chen	2655746971	[CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations (#2198 ) * Write soft-sign in inline asm * Change tile idx computation * Add macro to turn off soft-sign asm opt * Use simple for loop to avoid register spill * Only do block id transform for masking cases [ROCm/composable_kernel commit: `791802b381`]	2025-05-16 15:14:46 +08:00
Po Yen Chen	c3f74a83b3	[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163 ) * hack for cap logits * fix bug * Re-format files * Allow specifying logits_soft_cap through APIs * Support turn on/off logits_soft_cap in async pipeline * Do not generate non-verified kernels * Align receipt used in Aiter * Sync logits soft-capping across pipelines * Re-enable some hdim pipelines * fix perf * Add attention variant for logits_soft_cap * Add newline at end-of-file * Fix performance * Add comment to explain logits_soft_cap pre-processing * Unify code * Unify floating-point literal style * Use class data member to slience the compilation error * [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133) * Send 'mask' along with variant params to the LogitsMask() * Send block indices to the variant * Add indices parameters in variant interface * Fix fmha bwd codegen error * Allow switch logits_soft_cap impl * Eliminate register spills * Fix compilation errors * Fix wrong LSE * Fix LSE for splitkv kernel * Sync splitkv pipeline changes * Add batch_prefill kernel/pipeline * Fix codegen error * Undo changes in CMakeLists.txt * Merge pipeline filtering check * Use different code path if kHasLogitsSoftCap=false * Remove [[maybe_unused]] attribute * Use pre-existing compile-time flag to instantiate templates * Sync pipeline changes * Update CHANGELOG.md --------- Co-authored-by: Bernard <bernaliu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> [ROCm/composable_kernel commit: `2920604786`]	2025-05-13 12:19:25 +08:00
slippedJim	cca9cca699	add fmha fwd splitkv receipt for aiter c++ api (#2068 ) * add s_randval for c++ api * Fix bug of bias in splitkv --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `5f885d2b7a`]	2025-04-10 23:21:13 +08:00
rocking	2b657d9a2c	Reduce redundant space in bias tensor (#2024 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `8a20b62e91`]	2025-03-28 21:58:06 +08:00
rocking	8be61cfc9d	Sync the kname with instance name (#1989 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `b819c217e4`]	2025-03-20 00:06:45 +08:00
carlushuang	5293517d0a	Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" … (#1971 ) * Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969) This reverts commit a222573537b139c8a4f6b870910bdd487935efa8. * fix codegen problem * Update config.hpp --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3e81279d26`]	2025-03-13 11:41:39 +08:00
Illia Silin	c62042b022	Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 ) This reverts commit 4399ad790293bb76e4cf89f333017cf06a6d3e82. [ROCm/composable_kernel commit: `8cbcd3e0d0`]	2025-03-11 10:40:18 -07:00
carlushuang	569e9892a6	[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 ) * support hdim=192/128 pair * remove useless print * update [ROCm/composable_kernel commit: `7a93b16ff6`]	2025-03-11 21:07:40 +08:00
Qianfeng	7771db8ecf	Ck tile/complete k prefetch (#1941 ) * Re-implement qr_ks_vs_async pipeline by using kLoadOnce * Remove last block_sync_lds() in the loop * Tiny adjustment in qr_ks_vs_async pipeline for better performance * Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline * Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline * Use un-rolled gemm for Gemm-0 * Use k0_loops small tile load/store to replace the big tile load/store for K * Remove the commented lines in qx_ks_vs_custom_policy.hpp * Tune the prefetching of V in qr_ks_vs_async pipeline * Move the codes for storing the first v_lds tile some later * Let BlockDropout reuse LDS with V * Switch to separate code blocks according to iteration index * Interleave code blocks for better performance * Move clear_tile(s_acc) for better interleaving * Move code interleaving * Use MakeQDramTileDistribution for q_dram_window * Roll-back to load Q directly from global memory instead of using LDS as intermediary stop * Let V reuse the LDS of K * Use array of tiles to represent Q in vgprs * Use QLoadOnce == false for qr_ks_vs_async pipeline * Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline * Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile() * Move the definition of v_tiles out from the loop * Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile() * Remove using KLoadOnce in qx_ks_vs_custom_policy * Remove un-used get_slice_tile() call * Move the code line of clear_tile(s_acc) * Tune the lines of codes to make them more tidy * Re-arrange the codes before the main-loop * Add comments * Unify the alignment to be 8 for Q/K/V Lds decriptors * Tuning to K pre-loading * Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false * Adjust the pipeline codes * Use NumPrefetchV to separate from NumVLdsBuffers * Tune the location of a scheduler barrier code line * Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths * Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel * Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline * Move the location for store_tile() of first v_tile * Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline * Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<> * Try to fix old bugs in qx_ks_vs_custom_policy * Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler * Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy * Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV) * Tiny Fix (#1888) * Ck tile/paged attention workaround (#1894) * Correction in GetRangeAlongX() * Work-around to solve the failures in test_paged_attention_ck in xformers * Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline * Remove one call of move_tile_window for q_dram_window * Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers() * Tiny fix in qr_ks_vs_whole_k_prefetch pipeline * Adjust the location of codes for storing the first V tile to LDS * Tiny fix and add comments * Change GetSmemKPackK size to improve performance * Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy * Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access * Adjustment in intermediate iteration codes for tiny performance improvement * Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation * Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue * Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer() * Remove useless AsyncopyV * Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used * Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy * Recover the qr_ks_vs_async pipeline * Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline * Revert "Try to fix old bugs in qx_ks_vs_custom_policy" This reverts commit `39b82ca194`. * Tiny fix with regard to whole_k_prefetch pipeline compiling * Update kPadSeqLenK setting in fmha_fwd_kernel * Use q_element_func and k_element_func * Use single q_tile rather than multiple sliced q_tiles * Codes refine according to the comments * Re-format one file * Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true [ROCm/composable_kernel commit: `4f54fa3058`]	2025-03-07 14:19:51 +08:00
Qianfeng	d12c6417a0	Ck tile/paged attention workaround (#1894 ) * Correction in GetRangeAlongX() * Work-around to solve the failures in test_paged_attention_ck in xformers [ROCm/composable_kernel commit: `a3757a5f9c`]	2025-02-17 14:29:25 +08:00
Qianfeng	cec8025ec4	Tiny Fix (#1888 ) [ROCm/composable_kernel commit: `4cfb24feb6`]	2025-02-14 12:44:32 +08:00
Qianfeng	3cc02417a9	Update for fmha_fwd qs_ks_vs pipeline (#1810 ) * Update for fmha_fwd qs_ks_vs pipeline * Remove _builtin_amdgcn_sched_barrier(0) * Move p_compute to p converting earlier for trying to increase vgprs re-using * Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation * Re-add __builtin_amdgcn_sched_barrier(0) --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `3d50f57f43`]	2025-01-13 12:43:05 +08:00
Max Podkorytov	4c98908e17	mark unused args [ROCm/composable_kernel commit: `ad697c78ac`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	d7a2a81051	run clang-format -style=file [ROCm/composable_kernel commit: `a2e6ad62e2`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	e1896982b5	run clang-format==12 [ROCm/composable_kernel commit: `aa59ecaa22`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	715635839b	update comment in the policy [ROCm/composable_kernel commit: `82fb3f84fb`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	00c32ecda2	update qsksvs comment [ROCm/composable_kernel commit: `4daa82b451`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	099e23be84	remove dead code [ROCm/composable_kernel commit: `66c5b715c9`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	63cc962000	clang-format and remove dead code [ROCm/composable_kernel commit: `edb78a4729`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	25fdfe3df8	roll back splitkv [ROCm/composable_kernel commit: `60113859fa`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	d3d53433aa	update qsksvs pipeline [ROCm/composable_kernel commit: `bfc997a7e6`]	2025-01-08 10:09:54 -08:00
Max Podkorytov	1d7c38642c	qsksvs pipeline changes to mirror qrksvs [ROCm/composable_kernel commit: `f7942b993c`]	2025-01-08 10:09:54 -08:00
Po Yen Chen	ff1d9f88fa	[CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789 ) * Update license year * Add initial code to override decode problem * Fix splitkv traits/args overriding error * Reshape and transpose lse for decode * Remove debug code * Prettify example code * Use better function name * Add kMergeNumHeadGroupsSeqLenQ flag Kernel user can use this switch to turn on/off optimization for some problem sizes * Add missing flag declarations * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen * Group similar statements together * Remove assumption of seqlen_q=1 * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need * Fix group mode block skip logics * Undo changes of normal fwd kernel * Update in GridSize() and using GridSize() for splitkv kernel (#1799) --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com> [ROCm/composable_kernel commit: `24b12d04af`]	2025-01-07 18:49:24 +08:00
Qianfeng	8c1883a424	Remove using partitioner for all fmha kernels (#1778 ) * Remove using tile partitioner for fmha_fwd_kernel * Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels * Remove using tile partitioner for fmha_fwd_appendkv kernel * Unify the format of GetTileIndex [ROCm/composable_kernel commit: `4e076909b6`]	2024-12-29 14:29:56 +08:00
Po Yen Chen	f5c4569acd	[CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705 ) * Add check for zero values * Add static assertions * Remove invalid option '-e' in smoke_test.sh * Use correct path of smoke_test.sh * Avoid zero-sized shared memory array * Add warning comment * Replace expr by integer_divide_ceil() call * Use more readable constant names * Write down assumption as static assertion * Add more diagnostic error messages * Fix wrong BlockWarps when using default pipeline policy * Add more static assertions for A LDS desc * Allow using vector size < 8 for data type fp16/bf16 * Align vector size between DRAM dist & LDS desc * Remove no-longer used func decl * Fix wrong displayed piepline name * Undo policy template changes for tile_example_gemm_basic * Add missing space and make error message stands out * Unify print precision * Add missing include directive <iomanip> * Replace constant 64 by get_warp_size() call * Replace constant 128 by named variable: BankLength * Add kAMBlock/kBNBlock attributes * Allow usig different A/B warp dist for multiple blocks * Add helper function to get warp dist encodings * Add 4x64x4 fp16 warp gemm attribute impl * Complete the A/B warp dist encoding logic * Fix wrong thread mapping for C matrix * Use smaller vector size for small tile * Add static assert to block unsupported warp gemm impl * Extract common code out as helper method * Add 4x64x16 fp16 warp gemm type alias * Add comment to warning developers * Undo WarpGemmAtrributeMfma<> changes * Use more clear static assertion error message * Add trivial wrapper to get warp dstr encodings * Only transpose warp gemm result if it's square * Fix compilation error * Support multi-block warp gemm (on N direction) * Remove duplicated code * Fix output encoding of warp gemm * Fix wrong shape of WarpGemmAtrributeMfmaIterateK<> * Remove unused code * Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4 * Add type config for bf16_t * Add 4x64x16 bf16 warp gemm * Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution * Add 64x4x4 fp16/bf16 warp gemm impl * Add 64x4x16 fp16/bf16 warp gemm * Add static assertion for better error diagnostic * Get Q dram dstr directly form block gemm * Add missing header: fused_moe.hpp * Allow specifying different warp-gemm for gemm0 & gemm1 * Store P matrix into LDS before gemm1 * Fix inconsistant kernel name * Remove constraint on gemm0 & gemm1 block warps * Remove unsupported vector size from checking list * Allow using 4x64x16 warp gemm for gemm0 * Finish policy customization * Finish pipeline modification F# * Use block warps in codegen * Fix wrong rank of m_lds_window origin * Use better distributed tensor * Make P-store earlier * Remove duplicated experssions * Remove unnecessary tile window * Create new files for new splitkv pipeline * Separate old/new pipeline codegen logic * Sync changes form develop * Undo gemm kernel/pipeline changes * Undo gemm example changes * Remove blank lines * Fix typo * Use new warp gemm interface * Fix link error * Fix wrong pipeline tag * Fix more link error * Avoid unnecessary padding * Always use vector load for K * Padding on fastest dimension when necessary * Force padding Q on hdim_q * Set high dimension padding flag to false * Re-format headers * Use warps=<1, 4, 1> for both gemm0 & gemm1 * Fix complilation errors * Remove m/l shuffle logics * Ignore duplicate data when write lse_acc * Use gemm0 block warps as lds tile width * Remove hard-coded numbers * Fix wrong distribution width * Remove unnecessary code * Add s_barrier before writing to LDS * Store Q into LDS before gemm0 * Fix wrong Q tile size * Use simple Q lds descriptor for debuging * Use more realistic Q lds descriptor * Add comment & use better variable name * Make Q lds space not overlapped with others * Remove unnecessary block_tile_reduce_sync() call * Move Q load statements * Move block_sync_lds() right before use * Re-order instructions * Remove necessary lambda expression * Use 8 threads on kMaxSplits direction while doing reduction * Tiny correction for using 8 threads on kMaxSplits direction for combine kernel * Padding num_split direction of o_acc tile window to 4x * Update splitkv combine pipeline design * Add kN1 back to splitkv combine pipeline problem * Fix compilation errors * Add missing template parameter * Fix wrong splitkv combine kernel name * Fix wrong origin * Fix wrong LDS descriptor shape * Fix sync & reduction logics * Remove unnecessary static assertions * Extract tile size computation logics * Make sure we can reuse padding flags in combine kernels * Rename variables * Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<> * Remove unnecessary static assertion * Fix function name typo * Add constraint on kN1 template parameter * Hide K tile loading latency in earlier iteration * Fix wrong splitkv kernel name * Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction * Rename pipeline * Fix wrong pipeline name attribute * Add GetAlignmentQ() for NWarpSShuffle pipeline * Separate Q tile into dram tile & register tile concepts * Remove non-squre warp gemm transpose c type alias * Fallback tile size changes for fmha fwd splitkv * Remove redundant change * Refine naming for the S tile * Use better naming of the S tile dstr (read from lds) * Share Q lds with K lds * Tiny change * Fix with using static_for for passing CI checking --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com> [ROCm/composable_kernel commit: `37cdbf4f0e`]	2024-12-20 14:41:01 +08:00
Po Yen Chen	44b0100283	Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725 ) [ROCm/composable_kernel commit: `58e7f37fc8`]	2024-12-06 12:59:58 +08:00
Po Yen Chen	6c27be75ae	[CK_TILE] Use 'false' for highest dimension padding flags (#1716 ) * Use 'false' for highest dimension padding flags * Update padding flag of bias [ROCm/composable_kernel commit: `126ce85aa1`]	2024-12-04 15:59:58 +08:00
Po Yen Chen	b0cfd7f12e	[CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688 ) * Allow getting batch size from splitkv tile partitioner * Fix wrong paged-kvcache impl for group mode * Fix wrong example code for page-kvcache * Undo changes in fmha_fwd.cpp * Always use 2D block table * Add is_gappy kernel argument for paged-kvcache The is_gappy argument is used for differentiating seqstart_k_ptr usage in flash-attention & xformers * Remove out-of-date comments * Remove no-longer used method * Fix wrong # page-block calculation * Fix wrong comment --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com> [ROCm/composable_kernel commit: `cf2d635ea2`]	2024-11-26 20:37:54 +08:00
carlushuang	8acce2dee1	[CK_TILE] fused-moe first version (#1634 ) * moe pipeline * update code * compile OK * update * update cpu reference * update pipeline_gemm0 * compiler ok * update pipeline * rename to ex pipeline * block-asm * update * update * update first gemm ok * compute correct * update file structure * update README * update * update * update code * update API * return unsupport case * add comment * update readme * update * uncomment * update * fix build err --------- Co-authored-by: valarLip <340077269@qq.com> [ROCm/composable_kernel commit: `440e28b08f`]	2024-11-26 11:14:56 +08:00
Po Yen Chen	f81addbe42	[CK_TILE] Fix fMHA fwd MakeKargs() compilation errors (#1689 ) * Fix mis-matched tuple<> elem types * Rename MakeKargs() as MakeKargsImpl() --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com> [ROCm/composable_kernel commit: `645fe812f6`]	2024-11-25 15:30:35 +08:00
Qianfeng	e58d441138	Change in fwd-splitkv kernel to support num_splits=1 case (#1690 ) * Change in fwd-splitkv kernel to support num_splits=1 case * Update in codegen fwd-splitkv to make num_splits > 1 cases pass * Specify instance traits in dispatch * Fix link error for fp8 kernels --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `ce2bdf42a9`]	2024-11-25 12:31:38 +08:00
schung-amd	47b06431d5	[CK_TILE] MakeKargs overloads for backward compatibility (#1681 ) * Add overloads for MakeKargs Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void, void> to preserve functionality of code currently passing in list initializers or tuples. * Add overloads for MakeKargs Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void, void> to preserve functionality of code currently passing in list initializers or tuples. * Re-format files using ck_tile remod.py --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `ff92222f93`]	2024-11-23 06:51:35 +08:00
Po Yen Chen	ed8630416e	[CK_TILE] Add paged-kvcache support in group mode fmha fwd splitkv kernels (#1678 ) * Generate group mode paged-attn kernel * Enable paged-kvcache + group mode support * Add missing header: fused_moe.hpp * Add comment to explain kernel arg usage * Make error message more clear * Add comment for confusing data member names * Add more comment for confusing variable names * Fix typo in option description [ROCm/composable_kernel commit: `fb1ccfa9df`]	2024-11-21 14:53:10 +08:00
Thomas Ning	6bafdd985c	[CK Tile] Improve the Layout, Padding, and Alignment features of CK Tile GEMM (#1651 ) * Finished the feature * Modified the test file * Test case update * addresss comment * Addressed the review comment * Fixed the CI error [ROCm/composable_kernel commit: `2b6458ddf2`]	2024-11-11 18:08:25 -08:00
Po Yen Chen	f541d4382f	Return nullptr when block index is invalid (#1649 ) [ROCm/composable_kernel commit: `13332998a4`]	2024-11-11 09:28:32 +08:00
Bartłomiej Kocot	724312aea3	Remove virtual destructors from unary ops (#1610 ) * Remove virtual destructors from unary ops * Fixes * Fixes * clang format fixes [ROCm/composable_kernel commit: `9a8a52130d`]	2024-10-30 17:42:50 +01:00
Qianfeng	1fc6f29f35	[CK_TILE] Add fmha fwd headdim96 support (#1608 ) * Add ceil_to_qualified_tile_length() * Rename kK0BlockLength to kQKHeaddim * Add kSubQKHeaddim concept to support headdim96 * Fix in math.hpp to avoid using __half interfaces * Add LdsBufferSequence instance for headdim96 * Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing * Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time * Reformat one file * Fix text alignment in fmha_fwd_splitkv.py --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `8632221814`]	2024-10-30 14:03:16 +08:00
carlushuang	a6b29fcb01	topk_softmax (#1592 ) * topk_softmax * remove some file * fix atomix linear_offset * address various comment, and change sfc get_index api to static(tuple) [ROCm/composable_kernel commit: `b098b71b05`]	2024-10-26 23:52:49 +08:00
Po Yen Chen	c37f8d33f4	[CK_TILE] More fmha splitkv optimizations (#1588 ) * Use pre-defined constants for readability * Use vector write for o_acc tensor * Remove no-longer used policy method * Deprecate no-longer used policy/pipeline * Specify gemm0/gemm1 block warps separately in codegen * Fix wrong ps_idx creation logic * Add single-warp block gemm * Supoprt single-warp gemm0 * Make MakeCBlockTile() as static method * Use MakeCBlockTile() to get underlying tile distribution * Use kNumGemm1Warps to compute # threads for gemm1 * Put normal case in the if clause * Refine fmha splitkv block mapping * Refine & fix the lse_acc/o_acc layout * Fix wrong LDS size for K tile * Use kK0=64 for hdim=128,256 fmha splitkv kernels * Use kK1=64 for hdim=32,64,128 fmha splitkv kernels * Undo kK0/kK1 changes * Use more reasonable GetAlignmentV() computation * Using store_tile() in fmha splitkv kernel epilogue [ROCm/composable_kernel commit: `54f0e6f4bb`]	2024-10-26 18:35:45 +08:00
Po Yen Chen	8cc793d26a	[CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577 ) * Use smaller width for lse_accum dist tensor * Update pipeline comment * Fix wrong distribution for lse_accum * Remove duplicate dim in lse_accum dist encoding * Decide fmha splitkv combine kernel kBlockSize by kM0 * Remove assumption of MPerThread=1 * Add log<4> & log<8> specialization * Enlarge occupancy array * Fix vector size for small tile * Add support for kMaxSplits=8 * Re-format gemm.hpp * Use 16x16x16 warp gemm for fwd_splitkv * Centralize policy code changes * Leave fp8/bf8 tile settings unchanged [ROCm/composable_kernel commit: `95e722a3b3`]	2024-10-21 10:52:11 +08:00
Qianfeng	2587dc434b	[CK_TILE] Improve headdim96 performance for fmha-bwd (#1573 ) * Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96 * Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor * Fix in bwd_piple_default_policy * Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines * Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim * Update to hd96 tile settings * Add smoke test scripts for fmha-bwd hd96 * Revert "Add smoke test scripts for fmha-bwd hd96" This reverts commit `7ca7e1a93d`. * Remove hd96 tile settings in fmha_bwd codegen to save compiling * Fix lost code line in bwd_pipeline_default_policy * Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits * Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor * tiny adjustments --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: danyao12 <Dan.Yao@amd.com> [ROCm/composable_kernel commit: `14c3cfb1c6`]	2024-10-16 18:14:32 +08:00
Thomas Ning	53df021563	decouple the calling from gemm_pipeline (#1571 ) * decouple the calling from gemm_pipeline * clang format [ROCm/composable_kernel commit: `35c1777d59`]	2024-10-14 13:59:26 +08:00
Thomas Ning	99640365d8	Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535 ) * ake the cshuffle compilable * modify Mhe reference on gpu and cpu. Correaccess of cshuffle * fix the cpu reference code * Complete the in tile shuffle logic * restructure the kernel template input * change the naming pattern of ck_tile gemm pipeline * Re-format files using remod.py * Solve the fmha conflict with gemm * Comment Addressed from Carlus --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `6f27bc9872`]	2024-10-10 18:02:22 +08:00

1 2

72 Commits