* fix bug in loops that need use local tokens to compute
* support extra chain local_token
* update
* update
* refine some main
* update
* support dispatch_policy
* fix 15 example
* Initial commit
* Adding new tile partitioner to flatmm
* intermediate changes
* debugging kernels
* Updating flatmm example to universal gemm example
* updated flatmm kernel to run via gemmKernel
* update universal gemm to incorporate flatmm
* debug
* Fix flatmm call
* Fixing other kernels and tests for API changes
* clang formatted
* fixing gemm tests
* added test for flatmm and simplify kernel arguments
* adding flatmm test
* fix test for flatmm
* simplify gemm kernel with flatmm
* remove flatmm related files
* addressing review comments and code clean up
* resolving empty file
* resolving empty file
* clang formatted
* addressing review comments
* enable persistent kernel for flatmm
* reverted the removed files for flatmm
* reverted the removed files for flatmm
* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example
* some more renames
* clang formatted
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline
* i_nhead_ conversion type to prevent overflow
---------
Co-authored-by: ltqin <letaoqin@amd.com>
* mask support ratio for y axis
* format code
* add notes for param y_ratio
* fix comments error
* support template and mdiv for ratio mask
* refactor y-ratio mask constructor
* optimize coordinate calculation
* add SimplifiedRatioAttentionMask
* add for async load builtin
* add async load api
* fix some compiling errors
* fix a compiling error
* fix some compiling errors
* add a pipeline which copies from v4
* add a new pipeline for async load
* fix some compiling errors
* add async load tests
* fix some issues in async load
* fix
* fix async inline assembly
* fix async inline assembly
* add ignore header file
* comment some not gfx950 codes
* comment some not gfx950 codes
* fix a error
* update async load apis
* fix lds descriptor
* fix a compiling error
* fix some compiling errors
* fix a descriptor issue
* update lds descriptor
* change async pipeline's tile distribution pattern from thread to warp
* fix clang format
* update async policy
* fix a CRTP issue
* fix a typo error
* change lds layout
* fix some sync issues
* improve codes
* delete the async test
* fix a commented format issue
* avoid compiling device functions when compile host
* make gemm run
* add the copy kernel support
* finish the feature
* Address comment
* add the support for buffer_builtin
* solved the merging problem
* Comment Addressed
---------
Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>
* add prefetching physical block id for pagedkv
* start add pagedkv prefill
* rename pipeline
* add kernel for pagedkv
* add an init version pagedkv prefill
* fix redefine issue
* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args
* generate dispatch code
* add body generating code
* comipling pass
* remove dropout from pagedkv
* set lse to false in generating code
* start changing qr kernel to pagedkv
* init version of kernerl with pagedkv
* change names of file that are generated
* chang host validation for pagedkv prefill
* using iglp to change blockgemm
* add kernel files to op head file
* show parameters
* rewrite print parameter fun
* add fwd
* remove default parameter of GridSize
* format
* fix nhead issue and add seqlen_k_ptr to batch mode
* format code
* remove no-longer used code
* format
* fix some comments
---------
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* updates to support int8 in 03_gemm example
* added comments, using aliases, helper functions
* test(gemm_universal): add test cases for int8 gemm pipeline
* fix(test_gemm): fix for failing test unit test for int8
* test(ck_tile): add int8 unit test for gemm universal
* refactor(gemm_universal): GPU reference verification for GEMM code improved
* style(gemm_universal): removed extra comments and did clang format
* merging recent changes to universal gemm to tile_engine
* ck tile engine integration work
* feat(tile_engine): add int8 support to tile engine ops/gemm
* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8
* style: Format code with clang-format-12
* refactor(tile_engine): address review comments
* style: removed unhelpful comments & unused variables.
* build: tile engine uses default config
* feat: add int8 support for CK_TILE GEMM
* style: added trailing commas to codegen_utils.py
* refactor: tile engine
* refactor: formatting and code review
* refactor: code formatting for python files
* fix: suppress build warning
* add support for gfx950
* refactor:KWarpTile size in gemms util
* Fix the branch and wrap up the k warp tile
* Add bf8 integration
* refactor: clang format and rebase
---------
Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>
* [CK_TILE] Refine fp8 in flatmm
1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct
* fix clang format
* fix clang format
* keep default bhavior unchanged (warp tile = 16x16)
* fix tile engine build error
* fix a typo in codegen_utils.py
* address review comments
* address review comments
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>