aska-0096
0810799e25
refactor blockgemm change, isolate to v2;
2025-08-12 14:25:50 +00:00
aska-0096
75f6f6bac4
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-12 09:04:41 +00:00
Yi DING
8e1eb0c1ee
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
2025-08-12 17:02:52 +08:00
aska-0096
96d24497f5
fix conflict. disable all v-col instance for fmha fwd
2025-08-12 04:02:41 +00:00
aska-0096
1716171be4
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-12 03:52:34 +00:00
Yi DING
4fde1646e5
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
2025-08-12 11:11:55 +08:00
aska-0096
efb8549279
fix bug
2025-08-08 17:53:19 +00:00
aska-0096
729e8785fb
fix bugs
2025-08-08 15:42:15 +00:00
aska-0096
78edd7303b
bug fix, clang format;
2025-08-08 09:04:02 +00:00
aska-0096
3b9fb6af38
Remove unnecessary changes
2025-08-08 08:08:03 +00:00
aska-0096
6bb57c2c57
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-08-08 07:50:12 +00:00
aska-0096
1ecee378d5
remove unnecessary files; rename some files
2025-08-08 06:19:31 +00:00
aska-0096
b4640a9de6
merge fa_decode pipeline into fmha_fwd api
2025-08-08 05:46:18 +00:00
Yi DING
b0a97498b0
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
2025-08-07 21:24:43 +08:00
aska-0096
414cad667b
Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
2025-08-05 07:23:51 +00:00
aska-0096
0d12fc944f
Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
2025-08-04 10:27:42 +00:00
aska-0096
4f31847de1
add vmcnt guard before load ktile
2025-08-04 10:02:17 +00:00
aska-0096
746f4ccb99
Load Q through lds, implement xor;
2025-08-04 06:49:01 +00:00
aska-0096
2d4e73d2b4
small refactor
2025-08-01 10:44:54 +00:00
aska-0096
a28b6e67fe
upgrade prefill pipeline; simple iglp; consistent data produce and consume order
2025-07-31 10:25:37 +00:00
aska-0096
75cba48682
enable larger tile size; upgrade xor pattern
2025-07-31 05:13:27 +00:00
aska-0096
69890afc98
remove all lds bankconflict with xor layouts
2025-07-30 12:25:33 +00:00
aska-0096
8dacc35c4c
enable prefill overload operator().
2025-07-30 03:51:06 +00:00
Illia Silin
49723e94bb
fix the clang-format ( #2578 )
2025-07-28 20:49:55 -07:00
Yi DING
1926cd0cb8
[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 ( #2130 )
...
* Fix shuffle_tile
* Add fmha bwd d160
* CHANGELOG
* Use static_cast
* Update
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-07-29 09:31:14 +08:00
Illia Silin
504b101da3
upgrade from clang-format-12 to clang-format-18 ( #2568 )
...
* upgrade to clang-format-18
* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
shay-li77
8ae528a1b4
fix mha bwd dbias random mismatch ( #2570 )
...
* fix mha bwd dbias random mismatch
* formatting code
2025-07-28 14:39:31 +08:00
liang
d2459878cf
reorder grid dim schedule ( #2533 )
...
Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-07-26 02:46:55 +08:00
aska-0096
13bcc913de
fix the lds alignment caused performance regression
2025-07-25 07:10:01 +00:00
aska-0096
af28123cec
remove unnecessary features
2025-07-23 09:05:57 +00:00
aska-0096
14e0ab70c6
tempsave. asynccopy+trload sanity checked
2025-07-22 08:04:05 +00:00
aska-0096
1b468bac0b
tempsave, trload+asyncload done
2025-07-21 05:55:55 +00:00
aska-0096
afd96d8180
compile pass
2025-07-18 10:04:34 +00:00
aska-0096
5616551115
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa
2025-07-18 05:17:27 +00:00
aska-0096
ae39c84f55
tempsave
2025-07-18 05:16:39 +00:00
aska-0096
94b6430489
temp save
2025-07-17 10:06:09 +00:00
aska-0096
7e330553dc
Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline
2025-07-17 07:24:32 +00:00
Po Yen Chen
722c22fb15
Revert "Eliminate warning caused by failed to meet occupancy requirement ( #2389 )" ( #2514 )
...
This reverts commit b2dea90116 .
2025-07-17 10:09:01 +08:00
Qianfeng
45904b8fd7
Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) ( #2487 )
...
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline
* i_nhead_ conversion type to prevent overflow
---------
Co-authored-by: ltqin <letaoqin@amd.com >
2025-07-11 18:14:47 +08:00
aska-0096
18669925cc
temp save, change all instance to 1wave
2025-07-10 04:29:33 +00:00
shay-li77
d814fefe18
support y-direction step length greater than 1 for SimplifiedGenericAttentionMask ( #2338 )
...
* mask support ratio for y axis
* format code
* add notes for param y_ratio
* fix comments error
* support template and mdiv for ratio mask
* refactor y-ratio mask constructor
* optimize coordinate calculation
* add SimplifiedRatioAttentionMask
2025-07-09 23:18:55 +08:00
aska-0096
18686cfe5b
tempsave, fmha_decode
2025-07-08 08:37:20 +00:00
Haocong WANG
5557eadce6
[CK TILE] Fix FA build filter ( #2369 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* cmake depends & bwd filter order fix
* revert unexpected reformat
* Avoid change fmha bwd filter order for downstream compatibility
* Revert unexpected changes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
2025-07-08 10:42:07 +08:00
Po Yen Chen
b2dea90116
Eliminate warning caused by failed to meet occupancy requirement ( #2389 )
...
Co-authored-by: felix <felix.li@amd.com >
2025-07-08 09:17:25 +08:00
ltqin
9f4c5d7372
ck tile pagedkv prefill ( #2405 )
...
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-07-07 16:16:54 +08:00
Po Yen Chen
50fad03524
[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface ( #2403 )
...
* Rename batch_prerfill interface
* Add min_seqlen_q parameter in MakeKargs()
2025-06-25 15:19:21 +08:00
Yi DING
b8212864cf
[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 ( #2114 )
...
* 160+192
* Add splitkv d160
* cleanup
* fix
* Add change log
* Fix CHANGELOG
* Use static_cast
* Update ignored instance
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-06-24 01:33:31 +08:00
Max Podkorytov
0366fb2abc
Update for xformers ( #2372 )
...
* update api
* update kernel api
* clang-format
2025-06-22 00:28:30 -07:00
aska-0096
47565f21a5
temp save, waiting for debug
2025-06-21 15:02:57 +00:00
aska-0096
4bd5fd4a3c
fix bwd code
2025-06-18 07:27:24 +00:00