* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline
* i_nhead_ conversion type to prevent overflow
---------
Co-authored-by: ltqin <letaoqin@amd.com>
* mask support ratio for y axis
* format code
* add notes for param y_ratio
* fix comments error
* support template and mdiv for ratio mask
* refactor y-ratio mask constructor
* optimize coordinate calculation
* add SimplifiedRatioAttentionMask
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* Fix fmha fwd precision issue on MI3XX series
For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.
For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.
* Add option to switch rtn/rtz for fmha fwd
* Write soft-sign in inline asm
* Change tile idx computation
* Add macro to turn off soft-sign asm opt
* Use simple for loop to avoid register spill
* Only do block id transform for masking cases
* Re-implement qr_ks_vs_async pipeline by using kLoadOnce
* Remove last block_sync_lds() in the loop
* Tiny adjustment in qr_ks_vs_async pipeline for better performance
* Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline
* Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline
* Use un-rolled gemm for Gemm-0
* Use k0_loops small tile load/store to replace the big tile load/store for K
* Remove the commented lines in qx_ks_vs_custom_policy.hpp
* Tune the prefetching of V in qr_ks_vs_async pipeline
* Move the codes for storing the first v_lds tile some later
* Let BlockDropout reuse LDS with V
* Switch to separate code blocks according to iteration index
* Interleave code blocks for better performance
* Move clear_tile(s_acc) for better interleaving
* Move code interleaving
* Use MakeQDramTileDistribution for q_dram_window
* Roll-back to load Q directly from global memory instead of using LDS as intermediary stop
* Let V reuse the LDS of K
* Use array of tiles to represent Q in vgprs
* Use QLoadOnce == false for qr_ks_vs_async pipeline
* Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline
* Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile()
* Move the definition of v_tiles out from the loop
* Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile()
* Remove using KLoadOnce in qx_ks_vs_custom_policy
* Remove un-used get_slice_tile() call
* Move the code line of clear_tile(s_acc)
* Tune the lines of codes to make them more tidy
* Re-arrange the codes before the main-loop
* Add comments
* Unify the alignment to be 8 for Q/K/V Lds decriptors
* Tuning to K pre-loading
* Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false
* Adjust the pipeline codes
* Use NumPrefetchV to separate from NumVLdsBuffers
* Tune the location of a scheduler barrier code line
* Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths
* Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel
* Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline
* Move the location for store_tile() of first v_tile
* Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline
* Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<>
* Try to fix old bugs in qx_ks_vs_custom_policy
* Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler
* Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy
* Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV)
* Tiny Fix (#1888)
* Ck tile/paged attention workaround (#1894)
* Correction in GetRangeAlongX()
* Work-around to solve the failures in test_paged_attention_ck in xformers
* Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline
* Remove one call of move_tile_window for q_dram_window
* Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers()
* Tiny fix in qr_ks_vs_whole_k_prefetch pipeline
* Adjust the location of codes for storing the first V tile to LDS
* Tiny fix and add comments
* Change GetSmemKPackK size to improve performance
* Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy
* Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access
* Adjustment in intermediate iteration codes for tiny performance improvement
* Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation
* Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue
* Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer()
* Remove useless AsyncopyV
* Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used
* Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy
* Recover the qr_ks_vs_async pipeline
* Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline
* Revert "Try to fix old bugs in qx_ks_vs_custom_policy"
This reverts commit 39b82ca194.
* Tiny fix with regard to whole_k_prefetch pipeline compiling
* Update kPadSeqLenK setting in fmha_fwd_kernel
* Use q_element_func and k_element_func
* Use single q_tile rather than multiple sliced q_tiles
* Codes refine according to the comments
* Re-format one file
* Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true
* Update for fmha_fwd qs_ks_vs pipeline
* Remove _builtin_amdgcn_sched_barrier(0)
* Move p_compute to p converting earlier for trying to increase vgprs re-using
* Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation
* Re-add __builtin_amdgcn_sched_barrier(0)
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Update license year
* Add initial code to override decode problem
* Fix splitkv traits/args overriding error
* Reshape and transpose lse for decode
* Remove debug code
* Prettify example code
* Use better function name
* Add kMergeNumHeadGroupsSeqLenQ flag
Kernel user can use this switch to turn on/off optimization for
some problem sizes
* Add missing flag declarations
* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen
* Group similar statements together
* Remove assumption of seqlen_q=1
* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel
* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel
* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need
* Fix group mode block skip logics
* Undo changes of normal fwd kernel
* Update in GridSize() and using GridSize() for splitkv kernel (#1799)
---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
* Remove using tile partitioner for fmha_fwd_kernel
* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels
* Remove using tile partitioner for fmha_fwd_appendkv kernel
* Unify the format of GetTileIndex