* Update license year
* Add initial code to override decode problem
* Fix splitkv traits/args overriding error
* Reshape and transpose lse for decode
* Remove debug code
* Prettify example code
* Use better function name
* Add kMergeNumHeadGroupsSeqLenQ flag
Kernel user can use this switch to turn on/off optimization for
some problem sizes
* Add missing flag declarations
* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen
* Group similar statements together
* Remove assumption of seqlen_q=1
* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel
* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel
* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need
* Fix group mode block skip logics
* Undo changes of normal fwd kernel
* Update in GridSize() and using GridSize() for splitkv kernel (#1799)
---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
* add no welford
* enable output raw
* raw of int8
* fix build
* fix smoke test err
* [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate
* [cktile]layernorm, fix err commit files and remove uselss
* fix quant 8192 err & change norm_reduce class and file name
---------
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>