* Add TailHandler for V3, V4 and Mem pipelines
* Adapt examples and tests to use TailHandler
* move tail-handling logic to pipeline in persistent grouped gemm
* Fix Mem pipeline dispatching, add CompV4 dispatching
* Use a macro for handling the many tails of Mem pipeline
* Fix formatting again
* Use const-ref RunFunction, remove unnecessary try_run
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm
* Some helper functions for persistent kernel case
* Get max occupancy grid using device properties
* Implement tile loop in main entry point to grouped gemm
* Enable GridSize() on device
* Handle offset tile index using real current block index
* Add persistent kernel choice to grouped gemm example
* Use a for-loop for iterating over the group
* Reduce VGPR spills by early-exit
* Enable persistent kernel choice in grouped_gemm example
* Add persistent kernel option to grouped_gemm test
* Fix formatting with remod.py
* Remove GridUpdateBlocks as blocks are now iteratively computed
* Add comment about VGPR spilling
* Fix formatting
* Use CK_TILE_HOST instead of __host__
* Enable all Row/Col combinations in grouped gemm unit test
* Add some KBatch=2 cases to grouped gemm tests
* Fix SplitK for grouped gemm
* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm
* Add type traits
* Split examples to regular and tileloop
* Formatting
* Use hipExtStreamGetCUMask to get current active CUs for the given stream
* Align test and example kernel config, and disable validation for splitk repeats
* Remove debug options from CMakeLists.txt
* Separate the code paths for persistent/non-persistent in test
* Fix formatting
* Address review comments
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Write soft-sign in inline asm
* Change tile idx computation
* Add macro to turn off soft-sign asm opt
* Use simple for loop to avoid register spill
* Only do block id transform for masking cases
* Adding validation for tile sizes
* Add architecture in config, and shuffle lines of code in warp_gemm.hpp
* Enable MFMA for gfx950, and invalid tile handling
* Shared Memory for single data point
* CKTile Transpose vectorize CP1
* CKTile Transpose vectorize CP2
* CKTile Transpose vectorize CP2.1
* fixed the compile error of the transpose tile 2d
* Have the correct result for the current test sample
* Changes to printing tensor
* fp8 support added
* Debugging for transpose
* solving the corner issue
* Changed padding flag
* Intermideate Debugging
* Intermidiate Debugging
* Intermediate Debugging
* Finished debugging of the transpose op
* Code Cleanup
* Adding edge case smoke tests
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Addressing Review Comment
* Addressing Comments
* Addressing Comments
* Measuring Perf Tests
* Code Cleanup
* Changlog
* Added the running iterations
* clang format
* Fix the changelog
* Fix the compilation error
* change the printing factor
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
* Changes for updating tile distribution for shuffle and transpose
* Fixed swizzle and transpose, removed comments
* clang formatted
* Adding support for bf16 type
* Addressing review comments
* sync with function interface of cshuffleepiloge,fix flatmm build fail
* move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm
---------
Co-authored-by: solin <bingzhou@amd.com>
* added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline
* added type traits instead of enum to tile_window() and tile_window_linear() with debug comments
* removed comments, added documentation and clang format
* fixed computev4 policy bug for lds swizzle
* added swizzle for input matrix B
* Improved ComputeV4 policy and pipeline by swizzling A and B
* consolidated LDS descriptor functions in parent struct
* add ck tile examples to package
* Update jenkinsfile
* fix for jenkinsfile
* fix for building ck tile code on non gfx9
* compile ck tile examples only for gfx94
* include ck tile examples in all target
* fix for basic gemm UseStructuredSparsity
* Update CMakeLists.txt
* Update gemm_pipeline_problem.hpp
* add targets to rocm install
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* draft v_mfma_f32_16x16x32_bf16
* fix error config and add debug code.
* Solve the CShuffle Problem
* draft v_mfma_f32_16x16x32_bf16
* fix error config and add debug code.
* Solve the CShuffle Problem
* fix error while testing new command
* Finished the feature of new mfma 16*16*32
* Addressed the comment
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* 50ms -> 28ms
* Fix bug in non fuse_add_store cases
* Fine tuned setting for 2 pass pipeline
* adjust workload
* remove unnecessary change
* add layernorm
* Adding output quant and unquant results at the same time.
* fix test
* fix format
* tune for cases 128x640 and 128x1024
* bug ifx
* Re-implement qr_ks_vs_async pipeline by using kLoadOnce
* Remove last block_sync_lds() in the loop
* Tiny adjustment in qr_ks_vs_async pipeline for better performance
* Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline
* Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline
* Use un-rolled gemm for Gemm-0
* Use k0_loops small tile load/store to replace the big tile load/store for K
* Remove the commented lines in qx_ks_vs_custom_policy.hpp
* Tune the prefetching of V in qr_ks_vs_async pipeline
* Move the codes for storing the first v_lds tile some later
* Let BlockDropout reuse LDS with V
* Switch to separate code blocks according to iteration index
* Interleave code blocks for better performance
* Move clear_tile(s_acc) for better interleaving
* Move code interleaving
* Use MakeQDramTileDistribution for q_dram_window
* Roll-back to load Q directly from global memory instead of using LDS as intermediary stop
* Let V reuse the LDS of K
* Use array of tiles to represent Q in vgprs
* Use QLoadOnce == false for qr_ks_vs_async pipeline
* Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline
* Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile()
* Move the definition of v_tiles out from the loop
* Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile()
* Remove using KLoadOnce in qx_ks_vs_custom_policy
* Remove un-used get_slice_tile() call
* Move the code line of clear_tile(s_acc)
* Tune the lines of codes to make them more tidy
* Re-arrange the codes before the main-loop
* Add comments
* Unify the alignment to be 8 for Q/K/V Lds decriptors
* Tuning to K pre-loading
* Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false
* Adjust the pipeline codes
* Use NumPrefetchV to separate from NumVLdsBuffers
* Tune the location of a scheduler barrier code line
* Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths
* Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel
* Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline
* Move the location for store_tile() of first v_tile
* Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline
* Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<>
* Try to fix old bugs in qx_ks_vs_custom_policy
* Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler
* Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy
* Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV)
* Tiny Fix (#1888)
* Ck tile/paged attention workaround (#1894)
* Correction in GetRangeAlongX()
* Work-around to solve the failures in test_paged_attention_ck in xformers
* Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline
* Remove one call of move_tile_window for q_dram_window
* Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers()
* Tiny fix in qr_ks_vs_whole_k_prefetch pipeline
* Adjust the location of codes for storing the first V tile to LDS
* Tiny fix and add comments
* Change GetSmemKPackK size to improve performance
* Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy
* Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access
* Adjustment in intermediate iteration codes for tiny performance improvement
* Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation
* Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue
* Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer()
* Remove useless AsyncopyV
* Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used
* Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy
* Recover the qr_ks_vs_async pipeline
* Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline
* Revert "Try to fix old bugs in qx_ks_vs_custom_policy"
This reverts commit 39b82ca194.
* Tiny fix with regard to whole_k_prefetch pipeline compiling
* Update kPadSeqLenK setting in fmha_fwd_kernel
* Use q_element_func and k_element_func
* Use single q_tile rather than multiple sliced q_tiles
* Codes refine according to the comments
* Re-format one file
* Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true