* Fix text alignment of ArgParser::print()
* Update example README files
* Clarify make-ck-dev.sh <arch> usage
* Only keep some of the argument from '-?' output
* Undo command line output changes in README
* Only keep existing argument on doc and update description
* Fix text alignment
* Make cmake-ck-*.sh compatible with 'sh' command
* Simplify the codes in splitkv_combine pipeline
* Always set kPadSeqLenK=true for fmha splitkv kernels
* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Fix compile error
* Add one pass pipeline
* Extract creating tile_window to operator()
* clang format
* reduce duplicated code
* do not hardcode
* Support padding in layernorm
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Adding seed and offset pointer support to the philox random number generator.
* Separating seed and offset pointer checks with different condition statements.
* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
* Correcting a typo in the readme file
* Re-format files using remod.py
* Use STL type for API parameters
* Use simpler struct design for drop_seed & drop_offset
* Undo unnecessary changes
* Sync kargs style for fmha_fwd.hpp/.cpp
* Use templated union to reduce code
* Use structured binding to make code more readable
---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Without this change, the following diagnostic is generated:
a template argument list is expected after a name prefixed by the template
keyword [-Wmissing-template-arg-list-after-template-kw]
See C++17 spec [temp.names] p5.
* Use same layout for o_acc and o tensor
* Use better param names in partitioner
* Remove redundant kargs 'max_seqlen_q'
* Use better param names in splitkv kernel
* Add comment for additional kernel arguments
* Sync empty loop early return logics between pipelines
* Pass more arguments to cmake in scripts
* Align backslashes
* Fix wrong o_acc tensor view strides
* Change o_acc layout if o_perm=0
* Handle whole row masked via attn_bias
* Use use vector width = 1 for o_acc
* Use more even split sizes
* Finished the feature of gpu verification
* Add the ck_tile_gemm test in the CI CD
* add the include of tensor_layou in reference_gemm
* Comment Addressed
* split ck_tile fhma and gemm tests into separate stages
* restructure the reference gemm
* restructure a new reference_gemm api that could read the device mem
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Legacy support: customized filesystem
* Update cmakefile for python alternative path
* fix build issues
* CK has no boost dependency
* More fixes to issues found on legay systems
* fix clang format issue
* Check if blob is correctly generated in cmake
* fix the python issues
* add a compiler flag for codegen when using alternative python
* use target_link_options instead of target_compile_options
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>