* [CK_TILE] add more stride for layernorm to support un-continuous Tensor
* align CK coding style
* extend strides to layernrom expample
* clang-format...
* optimze small N case using vec io and using rcp div
* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass
* [Ck_tile] fix blockSize compute in Generic2dBlockShape
* [Ck_tile]fix kfastfdiv template style
* [Ck_tile] layernorm, fix stype in review
---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
* add prenorm/postnorm support, refactor using generate.py
* update README
* update README
* fix format
* update some description and fix format
* update format
* format
* use non-raw for loading
* format and update n4096
* dynamic-quant ready
* update readme
* support fused dynamic-quant
* update fused-quant, with smooth
* update README
* update args
* update some based on comment
* CK-Tile GEMM with memory bound pipeline.
* Memory bound gemm pipeline.
* Fix not closed namespace.
* Block gemm mem pipeline draft.
* Do not use ck_tile:: within ck_tile namespace.
* Refactoring & Move Layout info to pipeline problem.
* Get hot loop and TailNum information before lunching kernel.
* Fixes in pipeline.
* Add comment to load_tile_raw and change variable naming style.
* Few small changes & formatting.
* Do not use macro.
* Add gtests.
* Use AccDataType for Output of MFMA instruction.
* Formatting.
* Refactor gemm examples.
* Switch over to current block gemm.
* Use currently available pipeline policy.
* Refactoring and review comment.s
* Fixes after merge.
* Add missing include.
* Add load tile overload which accepts output tensor as parameter.
* This give 8% perf boost at the cost of using more registers.
* Rename example.
* Small changes.
* Fix compilation err and lower K.
* Support different layouts for A/B
* Fix vector size for different layouts.
* Rename Alignment into VectorSize
* Unblock tests.
* Add reduce2d new api
* Prevent user use cross warp reduction
* Fix bug of std caculation
* Add rmsnorm2d
* Add rmsnorm small example
* Remove static assert to prevent compile fail
* Add script to test performance and correctness
* Add missing cmake change
* refine naming
* refine example of rmsnorm
* Fix bug of rmsnorm
* Refine naming
* Fix cmake
* clang format
* Refine pipeline name
* Add add_rmsnorm2d_rdquant kernel
* Add reduce op
* host verification
* Fix bug of one pass pipeline
* Refine tile size
* Add two pass pipeline
* Rename two pass to three pass
* Fix bug of kSaveX == false
* Add instance library
* Add test script
* Fix bug of x verification
* Add save_x to trait
* Add README
* Move reduce2d into reduce folder
* Fix bug of welford when number of m warp > 1
* remove reduncant comment
* 1. move 06_rmsnorm2d to 10_rmsnorm2d
2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant
* clang format and add missing header
* Add host validation of add + layernorm2d + rsquant
* Revert "Add host validation of add + layernorm2d + rsquant"
This reverts commit 936cb45797.
* Remove deprecated flag
* Add ceil_to_qualified_tile_length()
* Rename kK0BlockLength to kQKHeaddim
* Add kSubQKHeaddim concept to support headdim96
* Fix in math.hpp to avoid using __half interfaces
* Add LdsBufferSequence instance for headdim96
* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing
* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time
* Reformat one file
* Fix text alignment in fmha_fwd_splitkv.py
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Use pre-defined constants for readability
* Use vector write for o_acc tensor
* Remove no-longer used policy method
* Deprecate no-longer used policy/pipeline
* Specify gemm0/gemm1 block warps separately in codegen
* Fix wrong ps_idx creation logic
* Add single-warp block gemm
* Supoprt single-warp gemm0
* Make MakeCBlockTile() as static method
* Use MakeCBlockTile() to get underlying tile distribution
* Use kNumGemm1Warps to compute # threads for gemm1
* Put normal case in the if clause
* Refine fmha splitkv block mapping
* Refine & fix the lse_acc/o_acc layout
* Fix wrong LDS size for K tile
* Use kK0=64 for hdim=128,256 fmha splitkv kernels
* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels
* Undo kK0/kK1 changes
* Use more reasonable GetAlignmentV() computation
* Using store_tile() in fmha splitkv kernel epilogue
* port layernorm
* change warp_welford.hpp
* Update warpshuffle
* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()
* refine welford max count calculation
* unify layernorm api
* Rename file
* Remove save mean and inv std
* Revert "refine welford max count calculation"
This reverts commit 022365802b.
* Fix order of parameter
* refine welford max count calculation again
* Remove fp32 instances
* Fix bug of padding
* refactor api
* Support bf16
* Extract common function
* Refine arg of operator()
* Add kMThreadPerBlock to template parameter
* clang format
* Refine variable name
* Refine file name
* remove redundant line
* refactor layernorm2d pipeline and add block-per-block utility
* fix name
* rename more
* add more block-per-tile instance
* remove duplicated define
* update instance for 2048, 1024 case
* support up to 2048 now
* opt loading
* add n1536
* Add two pass pipeline
* format
* Fix incorrect type
* parallel compilation
* Use smaller N
* fix 2p pass
* Support Repeat_M in distribution
* Refine nameing
* Add reduce example
---------
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
* Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96
* Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor
* Fix in bwd_piple_default_policy
* Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines
* Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim
* Update to hd96 tile settings
* Add smoke test scripts for fmha-bwd hd96
* Revert "Add smoke test scripts for fmha-bwd hd96"
This reverts commit 7ca7e1a93d.
* Remove hd96 tile settings in fmha_bwd codegen to save compiling
* Fix lost code line in bwd_pipeline_default_policy
* Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits
* Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor
* tiny adjustments
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: danyao12 <Dan.Yao@amd.com>
* ake the cshuffle compilable
* modify Mhe reference on gpu and cpu. Correaccess of cshuffle
* fix the cpu reference code
* Complete the in tile shuffle logic
* restructure the kernel template input
* change the naming pattern of ck_tile gemm pipeline
* Re-format files using remod.py
* Solve the fmha conflict with gemm
* Comment Addressed from Carlus
---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
* Simplify the codes in splitkv_combine pipeline
* Always set kPadSeqLenK=true for fmha splitkv kernels
* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Fix compile error
* Add one pass pipeline
* Extract creating tile_window to operator()
* clang format
* reduce duplicated code
* do not hardcode
* Support padding in layernorm
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Adding seed and offset pointer support to the philox random number generator.
* Separating seed and offset pointer checks with different condition statements.
* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
* Correcting a typo in the readme file
* Re-format files using remod.py
* Use STL type for API parameters
* Use simpler struct design for drop_seed & drop_offset
* Undo unnecessary changes
* Sync kargs style for fmha_fwd.hpp/.cpp
* Use templated union to reduce code
* Use structured binding to make code more readable
---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Use same layout for o_acc and o tensor
* Use better param names in partitioner
* Remove redundant kargs 'max_seqlen_q'
* Use better param names in splitkv kernel
* Add comment for additional kernel arguments
* Sync empty loop early return logics between pipelines
* Pass more arguments to cmake in scripts
* Align backslashes
* Fix wrong o_acc tensor view strides
* Change o_acc layout if o_perm=0
* Handle whole row masked via attn_bias
* Use use vector width = 1 for o_acc
* Use more even split sizes
* Finished the feature of gpu verification
* Add the ck_tile_gemm test in the CI CD
* add the include of tensor_layou in reference_gemm
* Comment Addressed
* split ck_tile fhma and gemm tests into separate stages
* restructure the reference gemm
* restructure a new reference_gemm api that could read the device mem
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout
* Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.
* Fix: Clang Format, API fixed from fmha
* fix with better naming convention
* revert back the pipeline code of fmha
* Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.
* clang format with the reference_gemm file
* convert the clang format with the remod.py
* Changed the format and variable name of the kernel gemm_shape and partitioner
---------
Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
* Use dictionary to config all the functions
* Add init codegen logic for fmha fwd appendkv
* Call HIP_CHECK_ERROR() macro to get real source info
* Setup meaningfull arguments
* Sync kernel name with the codegen
* Add knew/vnew tensors to the kernel argument
* Fix wrong K values after appending
* Fix vnew append errro
* Extract common logics
* Fix Vnew tile dstr for row major case
* Conditionally add fwd_splitkv API in fmha_fwd example
* Conditionally add call to fmha_fwd_splitkv()
* Remove "EXAMPLE_" prefix of cmake variables
* Regsiter API handlers automatically
* Early return if 0 < s_k_new is not supported
* Show message if we are ignoring option
* Unify CMakeLists.txt coding style
* Set num_splits=1 if split-kv is not supported
* Add length/stride getters for HostTensor
* Add RoPE example utilities
* Add reference_rotary_position_embedding() (not implemented)
* Finish reference_rotary_position_embedding() impl
* Fix typo of HostTensor<>::get_length()
* Fix compilation errors
* Fix wrong answer when interleaved=false
* Fix wrong answer when interleaved=true
* Append K/V in the host verification code
* Simplify K appending logics
* Simplify v_host_ref definition
* Reduce input/output dimensions
* Rename function: add "batched" prefix
* Apply RoPE on host side
* Rename RoPE utility function
* Fix wrong tensor size
* Avoid invoking deprecated method 'find_module'
* Pass RoPE kernel args
* Create Rotary Cos/Sin tile windows in kernel
* Add compute data type alias for RoPE
* Randomly generate seqlen_knew if needed
* Fix seqlen_knew enabling check logic
* Add minimum seqlen_k to generate compliance kvcache
* Fix compilation error in debug mode
* Fix wrong boundaries
* Fix wrong seqlen_k for kvcache
* Rename variables used in distributio encoding
* Fix rotary cos/sin tensor/tile size
* Add constraint to the rotary_dim option
* Remove unused inner namespace
* Add dram distribution for rotary_cos/rotary_sin (interleaved)
* Only apply interleaved RoPE on Knew for now
* Fix wrong thread starting offset
* Instantiate multiple kernels for RoPE approaches
* Clean-up pipeline
* Fix error in RoPE host reference
* Handle RoPE half-rotated logics
* Support 8x rotary_dim under half-rotated RoPE
* Add comment
* Apply elementwise function to the loaded tiles
* Unify parameter/variable naming style
* Remove constness from q_ptr
* Add code blocks for q_tile
* Apply RoPE to q_tile
* Remove debug print code in kernel
* Fix wrong knew/vnew appending positions
* Use better naming for tile indices
* Add make_tile_window() for adding distribution only
* Skip code if # of block is more than needed
* Move thread locating logics into policy
* Remove always true static_assert()
* Rename header
* Rename RotaryEmbeddingEnum
* Extract rotary embedding logic out
* Re-order parameters
* Align naming of some tile size constants
* Rename more tile size constants
* Fix wrong grid size
* Fix wrong shape of knew_host/vnew_host
* Fix wrong index into knew_host/vnew_host
* Fix wrong rotary_cos/rotary_sin memory size for Q
* Extract Q/Knew vector size to helper methods
* Use different rotary_cos/rotary_sin distr for Q/Knew
* Update host/device specifiers
* Fix wrong data type for Q rotary_cos/rotary_sin
* Remove RoPEComputeDataType type alias
* Shift rotary_cos/rotary_sin by cache_seqlen_k
* Add comment for why I just 't' for all padding flags
* Align commit message to the real comment
* Fix wrong pipeline
* Rename utility function
* Disable host verification if API not exist
* Fix wrong rope key for fp8 pipeline
* Allow only apply RoPE on Q (without append KV)
* Add append-kv smoke tests
* Remove debug statements
* Remove more debug statements
* Re-arrange the 'set +x' command
* Remove no-longer used method in pipeline
* Add missing init code
* Refine pipeline padding settings
* Enlarge rotary_dim limit (8 -> 16)
* Enlarge KPerThread for rotary_interleaved=false
* Update rotary_dim range in smoke_test_fwd.sh
* Add template argument 'kIsPagedKV' for splitkv kernels
* Launch splitkv kernel if given page_block_size
* Fix wrong kernel name
* Fix seqlen_k_min for pre-fill case (1 -> 0)
* Add copy_const<> type trait
* Add another make_tile_window()
* Introduce 'TileWindowNavigator' types
* Simplify TileWindowNavigator interfaces
* Fix tile window navigation bugs
* Disable calling fmha_fwd()
* Remove ununnecessary data members
* Simplify more make_tile_window() overloads
* Move V tile through TileWindowNavigator
* Fix uneven split checking logic
* Move code after decide seqlen_q/seqlen_k
* Make sure we always start reading complete tile
* Use 128 as minimus page_block_size
* Fix wrong origin for bias
* Add batch_stride_k/batch_stride_v in group mode
* Unify origin
* Add missing kernel arguments for group mode
* Add paged-kv codegen logic for appendkv kernels
* Add block_table kernel args for appendkv kernel
* Add tile navigators to the appendkv kernel
* Fix wrong tensor descriptor lengths
* Pass re-created tile window to pipeline
* Fix wrong strides for appendkv kernel
* Allow transit tile_window to another page-block
* Handle cross-page-block write
* Donot perform write again if already in last page-block
* Always add fmha_fwd() api
* Add missing group mode argument
* Remove debug macro usages
* Rename option s_k_new to s_knew
* Separate splitkv/non-splitkv args/traits
* Remove fmha_fwd_dispatch()
* Fix compilation errors
* Remove dropout code in splitkv kernel
* Allow problem types without define kHasDropout attr
* Use generic lambda to init traits objects
* Separate more non-splitkv & splitkv traits/args
* Display more info for specific kernels
* Show more detailed warning message
* Rename 'max_num_blocks' to 'max_num_page_blocks'
* Remove no-longer used pipeline files
* Wrap code by #if directives
* Move functors to the begining of validation code
* Use generic lambda to init all the api traits/args
* Fix wrong seqlen for kvcache
* Add missing comment
* Rename TileWindowNavigator to PageBlockNavigator
* Only expose necessary methods (not attributes)
* Re-order pipeline paremeters
* Refine smoke_test_fwd.sh
* Fix wrong arugment count
* Make tile window directly via PageBlockNavigator
* Remove unused template paremeter
* Remove group mode from appendkv kernel
* Fix skcheck logic
* Fix wrong syntax in skcheck expr
* Use meaningful options in smoke test
* Remove options
* Fix formatting
* Fix more format
* Re-organize bash functions
* Pass cache_batch_idx to kernels
* Support cache_batch_idx in example
* Fix compilation error
* Add more appendkv test
* Add more case for appendkv
* Fix unexisted attribute
* Remove 0 < seqlen_knew constraint
* Clarify the case in warning message
* Remove macro checking
* Force batch mode when invoking appendkv & splitkv apis
* Fix mode overriding logics
* Fix wrong parameter name
* Randomize seqlen_k if use kvcache
* Use randomized seqlen_k for kvcache
* Avoid using too small rotary_cos & rotary_sin
* Rename parameter
* Add seqlen_q & seqlen_k rules
* Add comment
* Add more comments
* Fix compilation errors
* Fix typo in comment
* Remove type argument
* Avoid seqlen_k=0 for kvcache
* Revert "Avoid seqlen_k=0 for kvcache"
This reverts commit 21c4df89e4.
* Fix wrong uneven split checking logics
* Only randomize kvcache seqlen_k if 1 < batch
* Return earlier if split is empty
* Revert "Only randomize kvcache seqlen_k if 1 < batch"
This reverts commit b9a4ab0d7e.
* Re-order seqlen_k_start adjustment logics
* Fix compilation errors
* Re-format script
* Find executable from folder automatically
* Fix kvcache seqlen_k generating logic
* Make comment more clear
* Fix wrong knew/vew appending logic on host
* Add s_barrier to sync threads
* Revert "Add s_barrier to sync threads"
This reverts commit d3f550f30c.
* Support only using 1 row of rotary_cos/rotary_sin
* Rotate Q in different way
* Unify tensor view creation logics
* Fix wrong argument
* Add mask to switch how we use the rotary_cos/sin
* Move attr from traits to problem
* Move has_mask to fmha_fwd_appendkv_args
* Support use uint32_t as SAD operand in Alibi<>
* Use sad_u32() in splitkv kernels
* Store tensor views in PageBlockNavigator
* Use stored tensor view to update tile windows
* Enlarge tensor view size
* Remove debug code
* Fix wrong tensor view size
* Wrap tensor view into PageBlockNavigator
* Add DataType member to PageBlockNavigator
* Remove unnecessary member functions
* Refind macro use
* Fix typo
* Add blank line between directives and actual code
* Re-format files
* Remove type in comment
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* wa prec, remove sgpr offset for inline asm
* macro for set tile
* ignore unused param if no kernel instances in host API
* fix more prec issue
* cache buffer resource
* fix
* support pre-nop
* clear tile by vector type members
* add workaround to reduce scratch memory
* conditionally enable workaround code
* enable workaround start from certain build version
* fallback set_tile() implementation from certain build version
* undo template argument changes
* put dummy asm in load_raw()
* fix comments, refactor s_nop inside buffer_load
---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
* Add layernorm2d forward
* Refind file path
* clang format
* Exclude ck_tile op from all
* use add_executable instead
* refactor layernorm2d_fwd example
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
* Add NullBlockDropout to be used when kHasDropout is false
* Change to BlockDropout::Run() for forward to reduce conditional checkings
* Re-format files
---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>