composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-04 21:51:28 +00:00

Author	SHA1	Message	Date
aska-0096	b4640a9de6	merge fa_decode pipeline into fmha_fwd api	2025-08-08 05:46:18 +00:00
aska-0096	0d12fc944f	Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA	2025-08-04 10:27:42 +00:00
aska-0096	746f4ccb99	Load Q through lds, implement xor;	2025-08-04 06:49:01 +00:00
aska-0096	2d4e73d2b4	small refactor	2025-08-01 10:44:54 +00:00
aska-0096	a28b6e67fe	upgrade prefill pipeline; simple iglp; consistent data produce and consume order	2025-07-31 10:25:37 +00:00
aska-0096	75cba48682	enable larger tile size; upgrade xor pattern	2025-07-31 05:13:27 +00:00
aska-0096	69890afc98	remove all lds bankconflict with xor layouts	2025-07-30 12:25:33 +00:00
aska-0096	8dacc35c4c	enable prefill overload operator().	2025-07-30 03:51:06 +00:00
aska-0096	13bcc913de	fix the lds alignment caused performance regression	2025-07-25 07:10:01 +00:00
aska-0096	14e0ab70c6	tempsave. asynccopy+trload sanity checked	2025-07-22 08:04:05 +00:00
aska-0096	afd96d8180	compile pass	2025-07-18 10:04:34 +00:00
aska-0096	5616551115	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-07-18 05:17:27 +00:00
Linjun-AMD	095393276a	h_dim256 fmha use async_qr pipeline (#2510 )	2025-07-18 09:59:38 +08:00
aska-0096	7e330553dc	Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline	2025-07-17 07:24:32 +00:00
slippedJim	05b65d0c7c	update (#2519 )	2025-07-17 15:24:19 +08:00
Andres Lugo	aadeffde18	Update FMHA recipe for Pytorch SDPA integration (#2480 ) * Add receipts in splitk and appendk * remove grouped * Remove logits --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-10 09:00:23 -07:00
aska-0096	18669925cc	temp save, change all instance to 1wave	2025-07-10 04:29:33 +00:00
Po Yen Chen	ad9863fe05	[CK_TILE] Low CU utilization optimization for fMHA fwd kernels (#2402 ) * Wrap tile size mapping as class method * Warp pipeline generating as class method * Add constraint as kernel dispatching criteria * Support mutltiple tile size for a (hdim, hdim_v) combination * Use smaller tile size if CU utilization is low * Use integar as the key of the tile size map * Fix type error * Simply override parent class method return value * Add attribute to eliminate warnging * Allow using environment variables to turn on/off custom factory * Unify param naming style * Add missing HIP runtime include directive * Fix os.environ.get() usage	2025-07-09 22:01:33 +08:00
aska-0096	18686cfe5b	tempsave, fmha_decode	2025-07-08 08:37:20 +00:00
Haocong WANG	5557eadce6	[CK TILE] Fix FA build filter (#2369 ) * Fix for fwd/bwd kernel build filter * fix bwd code * cmake depends & bwd filter order fix * revert unexpected reformat * Avoid change fmha bwd filter order for downstream compatibility * Revert unexpected changes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-07-08 10:42:07 +08:00
ltqin	9f4c5d7372	ck tile pagedkv prefill (#2405 ) * add prefetching physical block id for pagedkv * start add pagedkv prefill * rename pipeline * add kernel for pagedkv * add an init version pagedkv prefill * fix redefine issue * add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args * generate dispatch code * add body generating code * comipling pass * remove dropout from pagedkv * set lse to false in generating code * start changing qr kernel to pagedkv * init version of kernerl with pagedkv * change names of file that are generated * chang host validation for pagedkv prefill * using iglp to change blockgemm * add kernel files to op head file * show parameters * rewrite print parameter fun * add fwd * remove default parameter of GridSize * format * fix nhead issue and add seqlen_k_ptr to batch mode * format code * remove no-longer used code * format * fix some comments --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-07 16:16:54 +08:00
Yi DING	b8212864cf	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-06-24 01:33:31 +08:00
aska-0096	47565f21a5	temp save, waiting for debug	2025-06-21 15:02:57 +00:00
slippedJim	57f497452a	remove restriction of group mode hd192 no lse (#2252 ) Co-authored-by: Jim <jimguo12@amd.com>	2025-05-30 10:14:21 +08:00
Po Yen Chen	28cd0dffc9	[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251 ) * Add constraint on traits/tile/pipeline * Use kM0=128 if max_seqlen_q == 8192 * Re-format codegen script * Remove redundant attr name postix * Fix import error: default field in dataclass * Use kK0=64 & kK1=64 to hide latency * Use CU utilization to decide tile size	2025-05-29 18:36:33 +09:00
Zzz9990	ece38b9d7a	[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221 ) * fix splitkv compiler issue since lse is used to select kernel instances * bypass seqlen == 1 * add chunked prefill into mha varlen This reverts commit `aa9847e42d`. * skip compile when receipt 2-4 and add comments * fix --------- Co-authored-by: fsx950223 <fsx950223@outlook.com>	2025-05-26 19:17:18 +08:00
Po Yen Chen	8cb0474b3d	Use only qr_async pipeline for batch_prefill (#2195 )	2025-05-15 11:47:29 -07:00
Po Yen Chen	2920604786	[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163 ) * hack for cap logits * fix bug * Re-format files * Allow specifying logits_soft_cap through APIs * Support turn on/off logits_soft_cap in async pipeline * Do not generate non-verified kernels * Align receipt used in Aiter * Sync logits soft-capping across pipelines * Re-enable some hdim pipelines * fix perf * Add attention variant for logits_soft_cap * Add newline at end-of-file * Fix performance * Add comment to explain logits_soft_cap pre-processing * Unify code * Unify floating-point literal style * Use class data member to slience the compilation error * [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133) * Send 'mask' along with variant params to the LogitsMask() * Send block indices to the variant * Add indices parameters in variant interface * Fix fmha bwd codegen error * Allow switch logits_soft_cap impl * Eliminate register spills * Fix compilation errors * Fix wrong LSE * Fix LSE for splitkv kernel * Sync splitkv pipeline changes * Add batch_prefill kernel/pipeline * Fix codegen error * Undo changes in CMakeLists.txt * Merge pipeline filtering check * Use different code path if kHasLogitsSoftCap=false * Remove [[maybe_unused]] attribute * Use pre-existing compile-time flag to instantiate templates * Sync pipeline changes * Update CHANGELOG.md --------- Co-authored-by: Bernard <bernaliu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com>	2025-05-13 12:19:25 +08:00
Po Yen Chen	3d4d70d2fc	Avoid using store_tile_raw() for fp32 tensors (#2072 )	2025-04-26 23:07:41 -07:00
joyeamd	41541aff7a	SWDEV-52596 for hdim=256, when use splitkv pipeline, two new pipelines need to be added (#2126 )	2025-04-25 16:31:09 +08:00
rocking	02ce6d39ea	Only generate specific hdim (#2120 )	2025-04-24 18:52:58 +08:00
joyeamd	94d47b1680	fmha hdim256 vectorize improve (#2086 ) For hdim 256, will not have vectorized buffer load when seqlen % 256 != 0 and hdim % 256 = 0; this commit tries to solve this condition.	2025-04-16 09:21:04 +08:00
slippedJim	5f885d2b7a	add fmha fwd splitkv receipt for aiter c++ api (#2068 ) * add s_randval for c++ api * Fix bug of bias in splitkv --------- Co-authored-by: rocking <ChunYu.Lai@amd.com>	2025-04-10 23:21:13 +08:00
slippedJim	5a22b61de5	Add new receipt (#2055 )	2025-04-07 14:18:01 +08:00
carlushuang	6c08c5c46d	add mask support in hdim=192/128 (#1999 )	2025-03-21 18:28:43 +08:00
carlushuang	e3c9886cdf	[CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982 ) * return value with macro and revert the return value * [CK-TILE] no-macro launch api solution (#1992) * no-macro solution * address -Wcomma --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-03-20 11:00:29 -07:00
rocking	b819c217e4	Sync the kname with instance name (#1989 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-03-20 00:06:45 +08:00
carlushuang	3e81279d26	Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" … (#1971 ) * Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969) This reverts commit `8cbcd3e0d0`. * fix codegen problem * Update config.hpp --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-13 11:41:39 +08:00
Illia Silin	8cbcd3e0d0	Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 ) This reverts commit `7a93b16ff6`.	2025-03-11 10:40:18 -07:00
carlushuang	7a93b16ff6	[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 ) * support hdim=192/128 pair * remove useless print * update	2025-03-11 21:07:40 +08:00
Max Podkorytov	9e132eb77c	refactor ck-tile kernel launch (#1925 )	2025-03-07 08:29:40 -08:00
rocking	faa2235dad	explicit show no feature in kernel name (#1920 )	2025-02-28 14:23:30 +08:00
slippedJim	a9bcd3c98d	make fmha bwd api template for v2 & v3 (#1918 ) * use template fmha_bwd function * update --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-02-27 19:26:19 +08:00
rocking	e9ee568681	Apply filter to every kernel in the codgen of FMHA (#1911 ) * add receipt for fwd * Add receipt for bwd * Use kernel name to avoid more receipt * apply filter to every kernel	2025-02-26 20:20:29 +08:00
rocking	e4358c01d9	only output the deterministic bwd kernel for aiter (#1903 ) * only output the deterministic kernel * Add comment	2025-02-20 04:27:01 +08:00
rocking	f0d49d14fc	Add receipt 10~12 for codegen of aiter integration (#1877 ) * Add receipt for aiter integration * update receipt * Add hdim 96 instances * Revert "Add hdim 96 instances" This reverts commit `f339449f54`.	2025-02-19 09:01:08 +08:00
Andres Lugo	8086bbe3a7	Add receipt 4 option to codegen (#1875 ) * Add receipt 4 option to codegen * Remove repeated code * Review comments	2025-02-11 10:11:46 -08:00
Po Yen Chen	24b12d04af	[CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789 ) * Update license year * Add initial code to override decode problem * Fix splitkv traits/args overriding error * Reshape and transpose lse for decode * Remove debug code * Prettify example code * Use better function name * Add kMergeNumHeadGroupsSeqLenQ flag Kernel user can use this switch to turn on/off optimization for some problem sizes * Add missing flag declarations * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen * Group similar statements together * Remove assumption of seqlen_q=1 * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need * Fix group mode block skip logics * Undo changes of normal fwd kernel * Update in GridSize() and using GridSize() for splitkv kernel (#1799) --------- Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>	2025-01-07 18:49:24 +08:00
Qianfeng	4e076909b6	Remove using partitioner for all fmha kernels (#1778 ) * Remove using tile partitioner for fmha_fwd_kernel * Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels * Remove using tile partitioner for fmha_fwd_appendkv kernel * Unify the format of GetTileIndex	2024-12-29 14:29:56 +08:00
Po Yen Chen	4c2eff023a	Correct the dtype checking logics (#1775 )	2024-12-25 23:57:28 +08:00

1 2

63 Commits