aska-0096
3b9fb6af38
Remove unnecessary changes
2025-08-08 08:08:03 +00:00
aska-0096
6bb57c2c57
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-08 07:50:12 +00:00
aska-0096
1ecee378d5
remove unnecessary files; rename some files
2025-08-08 06:19:31 +00:00
aska-0096
b4640a9de6
merge fa_decode pipeline into fmha_fwd api
2025-08-08 05:46:18 +00:00
Yi DING
b0a97498b0
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
2025-08-07 21:24:43 +08:00
aska-0096
414cad667b
Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
2025-08-05 07:23:51 +00:00
aska-0096
0d12fc944f
Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
2025-08-04 10:27:42 +00:00
aska-0096
4f31847de1
add vmcnt guard before load ktile
2025-08-04 10:02:17 +00:00
aska-0096
746f4ccb99
Load Q through lds, implement xor;
2025-08-04 06:49:01 +00:00
aska-0096
2d4e73d2b4
small refactor
2025-08-01 10:44:54 +00:00
aska-0096
a28b6e67fe
upgrade prefill pipeline; simple iglp; consistent data produce and consume order
2025-07-31 10:25:37 +00:00
aska-0096
75cba48682
enable larger tile size; upgrade xor pattern
2025-07-31 05:13:27 +00:00
aska-0096
69890afc98
remove all lds bankconflict with xor layouts
2025-07-30 12:25:33 +00:00
aska-0096
8dacc35c4c
enable prefill overload operator().
2025-07-30 03:51:06 +00:00
Illia Silin
49723e94bb
fix the clang-format ( #2578 )
2025-07-28 20:49:55 -07:00
Yi DING
1926cd0cb8
[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 ( #2130 )
...
* Fix shuffle_tile
* Add fmha bwd d160
* CHANGELOG
* Use static_cast
* Update
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-07-29 09:31:14 +08:00
Illia Silin
504b101da3
upgrade from clang-format-12 to clang-format-18 ( #2568 )
...
* upgrade to clang-format-18
* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
shay-li77
8ae528a1b4
fix mha bwd dbias random mismatch ( #2570 )
...
* fix mha bwd dbias random mismatch
* formatting code
2025-07-28 14:39:31 +08:00
liang
d2459878cf
reorder grid dim schedule ( #2533 )
...
Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-07-26 02:46:55 +08:00
aska-0096
13bcc913de
fix the lds alignment caused performance regression
2025-07-25 07:10:01 +00:00
aska-0096
af28123cec
remove unnecessary features
2025-07-23 09:05:57 +00:00
aska-0096
14e0ab70c6
tempsave. asynccopy+trload sanity checked
2025-07-22 08:04:05 +00:00
aska-0096
1b468bac0b
tempsave, trload+asyncload done
2025-07-21 05:55:55 +00:00
aska-0096
afd96d8180
compile pass
2025-07-18 10:04:34 +00:00
aska-0096
5616551115
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-07-18 05:17:27 +00:00
aska-0096
ae39c84f55
tempsave
2025-07-18 05:16:39 +00:00
aska-0096
94b6430489
temp save
2025-07-17 10:06:09 +00:00
aska-0096
7e330553dc
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline
2025-07-17 07:24:32 +00:00
Po Yen Chen
722c22fb15
Revert "Eliminate warning caused by failed to meet occupancy requirement ( #2389 )" ( #2514 )
...
This reverts commit b2dea90116 .
2025-07-17 10:09:01 +08:00
Qianfeng
45904b8fd7
Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) ( #2487 )
...
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline
* i_nhead_ conversion type to prevent overflow
---------
Co-authored-by: ltqin <letaoqin@amd.com >
2025-07-11 18:14:47 +08:00
aska-0096
18669925cc
temp save, change all instance to 1wave
2025-07-10 04:29:33 +00:00
shay-li77
d814fefe18
support y-direction step length greater than 1 for SimplifiedGenericAttentionMask ( #2338 )
...
* mask support ratio for y axis
* format code
* add notes for param y_ratio
* fix comments error
* support template and mdiv for ratio mask
* refactor y-ratio mask constructor
* optimize coordinate calculation
* add SimplifiedRatioAttentionMask
2025-07-09 23:18:55 +08:00
aska-0096
18686cfe5b
tempsave, fmha_decode
2025-07-08 08:37:20 +00:00
Haocong WANG
5557eadce6
[CK TILE] Fix FA build filter ( #2369 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* cmake depends & bwd filter order fix
* revert unexpected reformat
* Avoid change fmha bwd filter order for downstream compatibility
* Revert unexpected changes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
2025-07-08 10:42:07 +08:00
Po Yen Chen
b2dea90116
Eliminate warning caused by failed to meet occupancy requirement ( #2389 )
...
Co-authored-by: felix <felix.li@amd.com >
2025-07-08 09:17:25 +08:00
ltqin
9f4c5d7372
ck tile pagedkv prefill ( #2405 )
...
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-07-07 16:16:54 +08:00
Po Yen Chen
50fad03524
[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface ( #2403 )
...
* Rename batch_prerfill interface
* Add min_seqlen_q parameter in MakeKargs()
2025-06-25 15:19:21 +08:00
Yi DING
b8212864cf
[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 ( #2114 )
...
* 160+192
* Add splitkv d160
* cleanup
* fix
* Add change log
* Fix CHANGELOG
* Use static_cast
* Update ignored instance
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-06-24 01:33:31 +08:00
Max Podkorytov
0366fb2abc
Update for xformers ( #2372 )
...
* update api
* update kernel api
* clang-format
2025-06-22 00:28:30 -07:00
aska-0096
47565f21a5
temp save, waiting for debug
2025-06-21 15:02:57 +00:00
aska-0096
4bd5fd4a3c
fix bwd code
2025-06-18 07:27:24 +00:00
aska-0096
69809d9513
Fix for fwd/bwd kernel build filter
2025-06-18 06:38:03 +00:00
Satyanvesh Dittakavi
4c57157d50
Do not use warpSize as compile time constant as it is removed ( #2320 )
...
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com >
2025-06-17 11:54:30 -07:00
MHYangAMD
9fcf21a4ec
Fix fmha fwd precision issue on MI3XX series ( #2285 )
...
* Fix fmha fwd precision issue on MI3XX series
For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.
For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.
* Add option to switch rtn/rtz for fmha fwd
2025-06-10 15:03:23 +08:00
Po Yen Chen
c42b957d65
[CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask ( #2209 )
...
* Assign block indices reversely if kHasMask=true
* Assign block indices reversely for splitkv kernel
2025-05-27 10:58:58 +08:00
Zzz9990
ece38b9d7a
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit aa9847e42d .
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
2025-05-26 19:17:18 +08:00
Po Yen Chen
791802b381
[CK_TILE] fMHA batch_prefill block index & logits soft-capping optimizations ( #2198 )
...
* Write soft-sign in inline asm
* Change tile idx computation
* Add macro to turn off soft-sign asm opt
* Use simple for loop to avoid register spill
* Only do block id transform for masking cases
2025-05-16 15:14:46 +08:00
Po Yen Chen
2920604786
[CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines ( #2163 )
...
* hack for cap logits
* fix bug
* Re-format files
* Allow specifying logits_soft_cap through APIs
* Support turn on/off logits_soft_cap in async pipeline
* Do not generate non-verified kernels
* Align receipt used in Aiter
* Sync logits soft-capping across pipelines
* Re-enable some hdim pipelines
* fix perf
* Add attention variant for logits_soft_cap
* Add newline at end-of-file
* Fix performance
* Add comment to explain logits_soft_cap pre-processing
* Unify code
* Unify floating-point literal style
* Use class data member to slience the compilation error
* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133 )
* Send 'mask' along with variant params to the LogitsMask()
* Send block indices to the variant
* Add indices parameters in variant interface
* Fix fmha bwd codegen error
* Allow switch logits_soft_cap impl
* Eliminate register spills
* Fix compilation errors
* Fix wrong LSE
* Fix LSE for splitkv kernel
* Sync splitkv pipeline changes
* Add batch_prefill kernel/pipeline
* Fix codegen error
* Undo changes in CMakeLists.txt
* Merge pipeline filtering check
* Use different code path if kHasLogitsSoftCap=false
* Remove [[maybe_unused]] attribute
* Use pre-existing compile-time flag to instantiate templates
* Sync pipeline changes
* Update CHANGELOG.md
---------
Co-authored-by: Bernard <bernaliu@amd.com >
Co-authored-by: coderfeli <coderfeli@163.com >
2025-05-13 12:19:25 +08:00
slippedJim
5f885d2b7a
add fmha fwd splitkv receipt for aiter c++ api ( #2068 )
...
* add s_randval for c++ api
* Fix bug of bias in splitkv
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
2025-04-10 23:21:13 +08:00
rocking
8a20b62e91
Reduce redundant space in bias tensor ( #2024 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-03-28 21:58:06 +08:00