aska-0096
96d24497f5
fix conflict. disable all v-col instance for fmha fwd
2025-08-12 04:02:41 +00:00
aska-0096
1716171be4
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-12 03:52:34 +00:00
Yi DING
4fde1646e5
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
2025-08-12 11:11:55 +08:00
aska-0096
498d234ab8
change the warp setting for hdim32 fmha fwd
2025-08-11 15:37:37 +00:00
aska-0096
8c101ccb88
fix bug on non-gfx950
2025-08-08 18:35:53 +00:00
aska-0096
78edd7303b
bug fix, clang format;
2025-08-08 09:04:02 +00:00
aska-0096
6bb57c2c57
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-08 07:50:12 +00:00
aska-0096
1ecee378d5
remove unnecessary files; rename some files
2025-08-08 06:19:31 +00:00
aska-0096
b4640a9de6
merge fa_decode pipeline into fmha_fwd api
2025-08-08 05:46:18 +00:00
Yi DING
b0a97498b0
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
2025-08-07 21:24:43 +08:00
Yi DING
15e8b6ccf7
[CK_TILE] Fix FMHA qr_async causing errors in FA ( #2627 )
2025-08-06 20:04:23 +08:00
aska-0096
0d12fc944f
Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
2025-08-04 10:27:42 +00:00
aska-0096
746f4ccb99
Load Q through lds, implement xor;
2025-08-04 06:49:01 +00:00
aska-0096
2d4e73d2b4
small refactor
2025-08-01 10:44:54 +00:00
aska-0096
a28b6e67fe
upgrade prefill pipeline; simple iglp; consistent data produce and consume order
2025-07-31 10:25:37 +00:00
aska-0096
75cba48682
enable larger tile size; upgrade xor pattern
2025-07-31 05:13:27 +00:00
aska-0096
69890afc98
remove all lds bankconflict with xor layouts
2025-07-30 12:25:33 +00:00
aska-0096
8dacc35c4c
enable prefill overload operator().
2025-07-30 03:51:06 +00:00
Yi DING
1926cd0cb8
[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 ( #2130 )
...
* Fix shuffle_tile
* Add fmha bwd d160
* CHANGELOG
* Use static_cast
* Update
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-07-29 09:31:14 +08:00
Andres Lugo
7fe50dc3da
Remove filter for only batch on receipt 4 ( #2574 )
...
Re-enable group mode instances for the Pytorch receipt and resolve linker errors for torch SDPA
2025-07-28 14:53:24 -07:00
rocking
b36e0b029f
[CK_TILE][FMHA] Uncomment all the headdim, use optdim to control ( #2539 )
...
* uncomment all the headdim, use optdim to control
* change default back to -1
* uncomment splitkv instance
* Fix typo in receipt 4 for appendkv
* support optdim for bwd, splitkv and appendkv
* Fix 192 key error
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com >
2025-07-28 17:16:32 +08:00
aska-0096
13bcc913de
fix the lds alignment caused performance regression
2025-07-25 07:10:01 +00:00
Illia Silin
1b6f024836
refactor fmha_bwd.py ( #2546 )
2025-07-23 09:09:56 -07:00
aska-0096
14e0ab70c6
tempsave. asynccopy+trload sanity checked
2025-07-22 08:04:05 +00:00
aska-0096
afd96d8180
compile pass
2025-07-18 10:04:34 +00:00
aska-0096
5616551115
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-07-18 05:17:27 +00:00
Linjun-AMD
095393276a
h_dim256 fmha use async_qr pipeline ( #2510 )
2025-07-18 09:59:38 +08:00
aska-0096
7e330553dc
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline
2025-07-17 07:24:32 +00:00
slippedJim
05b65d0c7c
update ( #2519 )
2025-07-17 15:24:19 +08:00
Andres Lugo
aadeffde18
Update FMHA recipe for Pytorch SDPA integration ( #2480 )
...
* Add receipts in splitk and appendk
* remove grouped
* Remove logits
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
2025-07-10 09:00:23 -07:00
aska-0096
18669925cc
temp save, change all instance to 1wave
2025-07-10 04:29:33 +00:00
Po Yen Chen
ad9863fe05
[CK_TILE] Low CU utilization optimization for fMHA fwd kernels ( #2402 )
...
* Wrap tile size mapping as class method
* Warp pipeline generating as class method
* Add constraint as kernel dispatching criteria
* Support mutltiple tile size for a (hdim, hdim_v) combination
* Use smaller tile size if CU utilization is low
* Use integar as the key of the tile size map
* Fix type error
* Simply override parent class method return value
* Add attribute to eliminate warnging
* Allow using environment variables to turn on/off custom factory
* Unify param naming style
* Add missing HIP runtime include directive
* Fix os.environ.get() usage
2025-07-09 22:01:33 +08:00
aska-0096
18686cfe5b
tempsave, fmha_decode
2025-07-08 08:37:20 +00:00
Haocong WANG
5557eadce6
[CK TILE] Fix FA build filter ( #2369 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* cmake depends & bwd filter order fix
* revert unexpected reformat
* Avoid change fmha bwd filter order for downstream compatibility
* Revert unexpected changes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
2025-07-08 10:42:07 +08:00
ltqin
9f4c5d7372
ck tile pagedkv prefill ( #2405 )
...
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-07-07 16:16:54 +08:00
Yi DING
b8212864cf
[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 ( #2114 )
...
* 160+192
* Add splitkv d160
* cleanup
* fix
* Add change log
* Fix CHANGELOG
* Use static_cast
* Update ignored instance
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-06-24 01:33:31 +08:00
aska-0096
47565f21a5
temp save, waiting for debug
2025-06-21 15:02:57 +00:00
slippedJim
57f497452a
remove restriction of group mode hd192 no lse ( #2252 )
...
Co-authored-by: Jim <jimguo12@amd.com >
2025-05-30 10:14:21 +08:00
Po Yen Chen
28cd0dffc9
[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization ( #2251 )
...
* Add constraint on traits/tile/pipeline
* Use kM0=128 if max_seqlen_q == 8192
* Re-format codegen script
* Remove redundant attr name postix
* Fix import error: default field in dataclass
* Use kK0=64 & kK1=64 to hide latency
* Use CU utilization to decide tile size
2025-05-29 18:36:33 +09:00
Zzz9990
ece38b9d7a
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit aa9847e42d .
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
2025-05-26 19:17:18 +08:00
Po Yen Chen
8cb0474b3d
Use only qr_async pipeline for batch_prefill ( #2195 )
2025-05-15 11:47:29 -07:00
Po Yen Chen
2920604786
[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines ( #2163 )
...
* hack for cap logits
* fix bug
* Re-format files
* Allow specifying logits_soft_cap through APIs
* Support turn on/off logits_soft_cap in async pipeline
* Do not generate non-verified kernels
* Align receipt used in Aiter
* Sync logits soft-capping across pipelines
* Re-enable some hdim pipelines
* fix perf
* Add attention variant for logits_soft_cap
* Add newline at end-of-file
* Fix performance
* Add comment to explain logits_soft_cap pre-processing
* Unify code
* Unify floating-point literal style
* Use class data member to slience the compilation error
* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133 )
* Send 'mask' along with variant params to the LogitsMask()
* Send block indices to the variant
* Add indices parameters in variant interface
* Fix fmha bwd codegen error
* Allow switch logits_soft_cap impl
* Eliminate register spills
* Fix compilation errors
* Fix wrong LSE
* Fix LSE for splitkv kernel
* Sync splitkv pipeline changes
* Add batch_prefill kernel/pipeline
* Fix codegen error
* Undo changes in CMakeLists.txt
* Merge pipeline filtering check
* Use different code path if kHasLogitsSoftCap=false
* Remove [[maybe_unused]] attribute
* Use pre-existing compile-time flag to instantiate templates
* Sync pipeline changes
* Update CHANGELOG.md
---------
Co-authored-by: Bernard <bernaliu@amd.com >
Co-authored-by: coderfeli <coderfeli@163.com >
2025-05-13 12:19:25 +08:00
Po Yen Chen
3d4d70d2fc
Avoid using store_tile_raw() for fp32 tensors ( #2072 )
2025-04-26 23:07:41 -07:00
joyeamd
41541aff7a
SWDEV-52596 for hdim=256, when use splitkv pipeline, two new pipelines need to be added ( #2126 )
2025-04-25 16:31:09 +08:00
rocking
02ce6d39ea
Only generate specific hdim ( #2120 )
2025-04-24 18:52:58 +08:00
joyeamd
94d47b1680
fmha hdim256 vectorize improve ( #2086 )
...
For hdim 256, will not have vectorized buffer load when seqlen % 256 != 0 and hdim % 256 = 0; this commit tries to solve this condition.
2025-04-16 09:21:04 +08:00
slippedJim
5f885d2b7a
add fmha fwd splitkv receipt for aiter c++ api ( #2068 )
...
* add s_randval for c++ api
* Fix bug of bias in splitkv
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
2025-04-10 23:21:13 +08:00
slippedJim
5a22b61de5
Add new receipt ( #2055 )
2025-04-07 14:18:01 +08:00
carlushuang
6c08c5c46d
add mask support in hdim=192/128 ( #1999 )
2025-03-21 18:28:43 +08:00
carlushuang
e3c9886cdf
[CK_TILE] return value with macro in ck_tile::kernel_launch API ( #1982 )
...
* return value with macro and revert the return value
* [CK-TILE] no-macro launch api solution (#1992 )
* no-macro solution
* address -Wcomma
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
2025-03-20 11:00:29 -07:00