* Ck-tile, impl. grouped gemm
* Workspace is allocated by user, and is passed to the function
* Prepare test to new api design
* Unify GemTransKernelArgs, removing N0 param
* Add 1 to dim3 in paritioner
* Typo: gem - > gemm
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* add interwave scheduler for gemm mem pipeline
* Fix merge artifacts.
* Refactor unit tests.
* Switch to interwave scheduler for mem example
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* Block universal gemm.
* Universal block gemm with interwave scheduler - draft.
* Refactoring
* Move a/b_warp_tiles into BlockGemmImpl
* set BlockGemmImpl as a class member
* Change tile size for more suitable to memory bound cases.
* Introduce kKPerThread to WarpGemm
* Add documentation comment.
* Fix Interwave scheduler block gemm.
* Add compute/memory friendly tile configuration.
* Clean
* New tile configurations in gemm mem example.
* Add more static checks and fix loop order in block gemm.
* Add more static checks and use warp gemm mfma dispatcher.
* Add default scheduler block gemm.
* Remove logging in example.
* Change in fwd-splitkv kernel to support num_splits=1 case
* Update in codegen fwd-splitkv to make num_splits > 1 cases pass
* Specify instance traits in dispatch
* Fix link error for fp8 kernels
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Add overloads for MakeKargs
Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.
* Add overloads for MakeKargs
Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.
* Re-format files using ck_tile remod.py
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Generate group mode paged-attn kernel
* Enable paged-kvcache + group mode support
* Add missing header: fused_moe.hpp
* Add comment to explain kernel arg usage
* Make error message more clear
* Add comment for confusing data member names
* Add more comment for confusing variable names
* Fix typo in option description
* [CK_TILE] add more stride for layernorm to support un-continuous Tensor
* align CK coding style
* extend strides to layernrom expample
* clang-format...
* optimze small N case using vec io and using rcp div
* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass
* [Ck_tile] fix blockSize compute in Generic2dBlockShape
* [Ck_tile]fix kfastfdiv template style
* [Ck_tile] layernorm, fix stype in review
---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
* add prenorm/postnorm support, refactor using generate.py
* update README
* update README
* fix format
* update some description and fix format
* update format
* format
* use non-raw for loading
* format and update n4096
* dynamic-quant ready
* update readme
* support fused dynamic-quant
* update fused-quant, with smooth
* update README
* update args
* update some based on comment
* CK-Tile GEMM with memory bound pipeline.
* Memory bound gemm pipeline.
* Fix not closed namespace.
* Block gemm mem pipeline draft.
* Do not use ck_tile:: within ck_tile namespace.
* Refactoring & Move Layout info to pipeline problem.
* Get hot loop and TailNum information before lunching kernel.
* Fixes in pipeline.
* Add comment to load_tile_raw and change variable naming style.
* Few small changes & formatting.
* Do not use macro.
* Add gtests.
* Use AccDataType for Output of MFMA instruction.
* Formatting.
* Refactor gemm examples.
* Switch over to current block gemm.
* Use currently available pipeline policy.
* Refactoring and review comment.s
* Fixes after merge.
* Add missing include.
* Add load tile overload which accepts output tensor as parameter.
* This give 8% perf boost at the cost of using more registers.
* Rename example.
* Small changes.
* Fix compilation err and lower K.
* Support different layouts for A/B
* Fix vector size for different layouts.
* Rename Alignment into VectorSize
* Unblock tests.
* Add reduce2d new api
* Prevent user use cross warp reduction
* Fix bug of std caculation
* Add rmsnorm2d
* Add rmsnorm small example
* Remove static assert to prevent compile fail
* Add script to test performance and correctness
* Add missing cmake change
* refine naming
* refine example of rmsnorm
* Fix bug of rmsnorm
* Refine naming
* Fix cmake
* clang format
* Refine pipeline name
* Add add_rmsnorm2d_rdquant kernel
* Add reduce op
* host verification
* Fix bug of one pass pipeline
* Refine tile size
* Add two pass pipeline
* Rename two pass to three pass
* Fix bug of kSaveX == false
* Add instance library
* Add test script
* Fix bug of x verification
* Add save_x to trait
* Add README
* Move reduce2d into reduce folder
* Fix bug of welford when number of m warp > 1
* remove reduncant comment
* 1. move 06_rmsnorm2d to 10_rmsnorm2d
2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant
* clang format and add missing header
* Add host validation of add + layernorm2d + rsquant
* Revert "Add host validation of add + layernorm2d + rsquant"
This reverts commit 936cb45797.
* Remove deprecated flag
* Add ceil_to_qualified_tile_length()
* Rename kK0BlockLength to kQKHeaddim
* Add kSubQKHeaddim concept to support headdim96
* Fix in math.hpp to avoid using __half interfaces
* Add LdsBufferSequence instance for headdim96
* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing
* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time
* Reformat one file
* Fix text alignment in fmha_fwd_splitkv.py
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Use pre-defined constants for readability
* Use vector write for o_acc tensor
* Remove no-longer used policy method
* Deprecate no-longer used policy/pipeline
* Specify gemm0/gemm1 block warps separately in codegen
* Fix wrong ps_idx creation logic
* Add single-warp block gemm
* Supoprt single-warp gemm0
* Make MakeCBlockTile() as static method
* Use MakeCBlockTile() to get underlying tile distribution
* Use kNumGemm1Warps to compute # threads for gemm1
* Put normal case in the if clause
* Refine fmha splitkv block mapping
* Refine & fix the lse_acc/o_acc layout
* Fix wrong LDS size for K tile
* Use kK0=64 for hdim=128,256 fmha splitkv kernels
* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels
* Undo kK0/kK1 changes
* Use more reasonable GetAlignmentV() computation
* Using store_tile() in fmha splitkv kernel epilogue
* port layernorm
* change warp_welford.hpp
* Update warpshuffle
* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()
* refine welford max count calculation
* unify layernorm api
* Rename file
* Remove save mean and inv std
* Revert "refine welford max count calculation"
This reverts commit 022365802b.
* Fix order of parameter
* refine welford max count calculation again
* Remove fp32 instances
* Fix bug of padding
* refactor api
* Support bf16
* Extract common function
* Refine arg of operator()
* Add kMThreadPerBlock to template parameter
* clang format
* Refine variable name
* Refine file name
* remove redundant line
* refactor layernorm2d pipeline and add block-per-block utility
* fix name
* rename more
* add more block-per-tile instance
* remove duplicated define
* update instance for 2048, 1024 case
* support up to 2048 now
* opt loading
* add n1536
* Add two pass pipeline
* format
* Fix incorrect type
* parallel compilation
* Use smaller N
* fix 2p pass
* Support Repeat_M in distribution
* Refine nameing
* Add reduce example
---------
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
* Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96
* Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor
* Fix in bwd_piple_default_policy
* Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines
* Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim
* Update to hd96 tile settings
* Add smoke test scripts for fmha-bwd hd96
* Revert "Add smoke test scripts for fmha-bwd hd96"
This reverts commit 7ca7e1a93d.
* Remove hd96 tile settings in fmha_bwd codegen to save compiling
* Fix lost code line in bwd_pipeline_default_policy
* Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits
* Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor
* tiny adjustments
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: danyao12 <Dan.Yao@amd.com>
* ake the cshuffle compilable
* modify Mhe reference on gpu and cpu. Correaccess of cshuffle
* fix the cpu reference code
* Complete the in tile shuffle logic
* restructure the kernel template input
* change the naming pattern of ck_tile gemm pipeline
* Re-format files using remod.py
* Solve the fmha conflict with gemm
* Comment Addressed from Carlus
---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
* Simplify the codes in splitkv_combine pipeline
* Always set kPadSeqLenK=true for fmha splitkv kernels
* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Fix compile error
* Add one pass pipeline
* Extract creating tile_window to operator()
* clang format
* reduce duplicated code
* do not hardcode
* Support padding in layernorm
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Adding seed and offset pointer support to the philox random number generator.
* Separating seed and offset pointer checks with different condition statements.
* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
* Correcting a typo in the readme file
* Re-format files using remod.py
* Use STL type for API parameters
* Use simpler struct design for drop_seed & drop_offset
* Undo unnecessary changes
* Sync kargs style for fmha_fwd.hpp/.cpp
* Use templated union to reduce code
* Use structured binding to make code more readable
---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Use same layout for o_acc and o tensor
* Use better param names in partitioner
* Remove redundant kargs 'max_seqlen_q'
* Use better param names in splitkv kernel
* Add comment for additional kernel arguments
* Sync empty loop early return logics between pipelines
* Pass more arguments to cmake in scripts
* Align backslashes
* Fix wrong o_acc tensor view strides
* Change o_acc layout if o_perm=0
* Handle whole row masked via attn_bias
* Use use vector width = 1 for o_acc
* Use more even split sizes