Optimize fmha fwd decode & prefill for gfx950 (#2641)

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-04-20 14:59:17 +00:00

* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

This commit is contained in:

Haocong WANG

2025-08-12 19:43:14 +08:00

committed by

GitHub

parent c0c2ded566

commit b7322a521a

31 changed files with 3533 additions and 627 deletions

									
										7

include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp
									
												View File
												
				@@ -330,13 +330,6 @@ struct PassThrough

				        y = type_convert<float>(x);

				    }

				    template <>

				    CK_TILE_HOST_DEVICE void

				    operator()<ck_tile::bf16_t, ck_tile::fp16_t>(ck_tile::bf16_t& y, const ck_tile::fp16_t& x) const

				    {

				        y = type_convert<ck_tile::bf16_t>(x);

				    }

				    template <>

				    CK_TILE_HOST_DEVICE void operator()<float, ck_tile::fp16_t>(float& y,

				                                                                const ck_tile::fp16_t& x) const

Optimize fmha fwd decode & prefill for gfx950 (#2641)

7 include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp Unescape Escape View File

7

include/ck_tile/ops/elementwise/unary_element_wise_operation.hpp

View File