* Use pre-defined constants for readability
* Use vector write for o_acc tensor
* Remove no-longer used policy method
* Deprecate no-longer used policy/pipeline
* Specify gemm0/gemm1 block warps separately in codegen
* Fix wrong ps_idx creation logic
* Add single-warp block gemm
* Supoprt single-warp gemm0
* Make MakeCBlockTile() as static method
* Use MakeCBlockTile() to get underlying tile distribution
* Use kNumGemm1Warps to compute # threads for gemm1
* Put normal case in the if clause
* Refine fmha splitkv block mapping
* Refine & fix the lse_acc/o_acc layout
* Fix wrong LDS size for K tile
* Use kK0=64 for hdim=128,256 fmha splitkv kernels
* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels
* Undo kK0/kK1 changes
* Use more reasonable GetAlignmentV() computation
* Using store_tile() in fmha splitkv kernel epilogue
* Calculate generic relative threshold pool3dfwd
* Calculate absolute error threshold pool3d fwd
* Generic threshold calculation take max input for relative error pool3dfwd
* Remove max possible value for error calculation at runtime
* Remove debug print in pool3dfwd
* Pool3d fwd adjusted types in generic threshold calculation
* Generic threshold calculation take into account number of accumulations and accdatatype
* Generic threshold fix final error formula
* Generic threshold calculation - num of accs fix
* Generic threshold calculation - adjust absolute error
* Generic threshold calculation - OutDataType in absolute error
* port layernorm
* change warp_welford.hpp
* Update warpshuffle
* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()
* refine welford max count calculation
* unify layernorm api
* Rename file
* Remove save mean and inv std
* Revert "refine welford max count calculation"
This reverts commit 022365802b.
* Fix order of parameter
* refine welford max count calculation again
* Remove fp32 instances
* Fix bug of padding
* refactor api
* Support bf16
* Extract common function
* Refine arg of operator()
* Add kMThreadPerBlock to template parameter
* clang format
* Refine variable name
* Refine file name
* remove redundant line
* refactor layernorm2d pipeline and add block-per-block utility
* fix name
* rename more
* add more block-per-tile instance
* remove duplicated define
* update instance for 2048, 1024 case
* support up to 2048 now
* opt loading
* add n1536
* Add two pass pipeline
* format
* Fix incorrect type
* parallel compilation
* Use smaller N
* fix 2p pass
* Support Repeat_M in distribution
* Refine nameing
* Add reduce example
---------
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
* The draft on ckProfiler instance add
* support the ck profiler instance with same data types
* add a small feature on the M and N variable switch.
* Partially solve the incorrect result problem
* fix based on ci cd
* Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96
* Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor
* Fix in bwd_piple_default_policy
* Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines
* Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim
* Update to hd96 tile settings
* Add smoke test scripts for fmha-bwd hd96
* Revert "Add smoke test scripts for fmha-bwd hd96"
This reverts commit 7ca7e1a93d.
* Remove hd96 tile settings in fmha_bwd codegen to save compiling
* Fix lost code line in bwd_pipeline_default_policy
* Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits
* Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor
* tiny adjustments
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: danyao12 <Dan.Yao@amd.com>
* Build codegen as standalone
* Add exception for device tests
* Use local filesystem header
* add a codegen test CI stage and daily build
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Add non_native_vector_type
* Add a test
* Add non-native vector type
* Fix CTOR
* Fix non-native vector type of 1
* Fix CTORs
* Use vector_type to cover non-native implementation as well
* Update the test
* Format
* Format
* Fix copyright years
* Remove BoolVecT so far
* Add AsType test cases
* Update assert error message
* Remove redundant type
* Update naming
* Add complex half type with tests
* Add tests for vector reshaping
* Add missing alignas
* Update test/data_type/test_custom_type.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Compare custom types to built-in types
* Add default constructor test
* Add an alignment test
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* ake the cshuffle compilable
* modify Mhe reference on gpu and cpu. Correaccess of cshuffle
* fix the cpu reference code
* Complete the in tile shuffle logic
* restructure the kernel template input
* change the naming pattern of ck_tile gemm pipeline
* Re-format files using remod.py
* Solve the fmha conflict with gemm
* Comment Addressed from Carlus
---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
* Add a gpu gemm reference kernel
* Switch to gpu reference in gemm examples
* Remove redundant arguments
* Update all related examples
* Update more examples
* Try less threads per block
* Try even less threads per block
* Add support for all matrix layouts
* Increase block size
* Clean up
* Remove hardcoded strides
* Clean up
* Try a column-major case
* Revert back to row-major
* Run both CPU and GPU veriffication
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Fix text alignment of ArgParser::print()
* Update example README files
* Clarify make-ck-dev.sh <arch> usage
* Only keep some of the argument from '-?' output
* Undo command line output changes in README
* Only keep existing argument on doc and update description
* Fix text alignment
* Make cmake-ck-*.sh compatible with 'sh' command
* Simplify the codes in splitkv_combine pipeline
* Always set kPadSeqLenK=true for fmha splitkv kernels
* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Fix compile error
* Add one pass pipeline
* Extract creating tile_window to operator()
* clang format
* reduce duplicated code
* do not hardcode
* Support padding in layernorm
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Adding seed and offset pointer support to the philox random number generator.
* Separating seed and offset pointer checks with different condition statements.
* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
* Correcting a typo in the readme file
* Re-format files using remod.py
* Use STL type for API parameters
* Use simpler struct design for drop_seed & drop_offset
* Undo unnecessary changes
* Sync kargs style for fmha_fwd.hpp/.cpp
* Use templated union to reduce code
* Use structured binding to make code more readable
---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Without this change, the following diagnostic is generated:
a template argument list is expected after a name prefixed by the template
keyword [-Wmissing-template-arg-list-after-template-kw]
See C++17 spec [temp.names] p5.
* Use same layout for o_acc and o tensor
* Use better param names in partitioner
* Remove redundant kargs 'max_seqlen_q'
* Use better param names in splitkv kernel
* Add comment for additional kernel arguments
* Sync empty loop early return logics between pipelines
* Pass more arguments to cmake in scripts
* Align backslashes
* Fix wrong o_acc tensor view strides
* Change o_acc layout if o_perm=0
* Handle whole row masked via attn_bias
* Use use vector width = 1 for o_acc
* Use more even split sizes