Dan Yao
79a5d9c10c
[CK_TILE] FA bwd kernels optimization ( #1397 )
...
* tmp save
* fix batch deterministic bugs
* fix group deterministic bugs
* codegen update
* reorder files
* bias support
* hd256 bias support
* bwd smoke test update
* simplify convert dq
* fix hd256 dropout scratch
* do{}while() -> while(){}
* comments
* remove FmhaBwdTilePartitioner
* save clear_tile
* refactor dropout
* code cleanup
* code cleanup
* comments
* fix epilogue problem
* fix fwd dropout
* group convert_dq opt
* fix dq alignment
* Do not store storerandval in bwd for flash attention integration
* fix hd32 error and boost performance
* revert
* Remove duplicated WarpGemm definitions in the policy file
* dropout patch for mrepeat 16*16
* code sync up
* dq_acc stride
* dq_acc stride stuff
* codegen update
* fwd dropout revert
* fix hd128 scratches and boost performance
* receipt 3 for simplified smoke test
* more strides for fa integration
* fix hd64 scratches and boost performance
* non-iglp pipeline for headdim padding cases
* dpad same as dvpad for flash attention integration
* unpadded lse&d for group mode
* Support unpad layout for group lse
* Support unpad lse layout for splitkv
* Fix stride for splitkv kernel
* fix unpadded lse issue in fwd splitkv
* comment
* solve lds read&write conflicts
* rename
* bias rename
* tile index revert
---------
Co-authored-by: danyao12 <danyao12>
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com >
2024-08-16 13:40:10 -07:00
carlushuang
b3f86e79dd
workaround rocm-6.2 compiler issue ( #1421 )
2024-07-31 16:03:59 +08:00
carlushuang
8182976c37
[CK_TILE] wa prec, remove sgpr offset for inline asm ( #1356 )
...
* wa prec, remove sgpr offset for inline asm
* macro for set tile
* ignore unused param if no kernel instances in host API
* fix more prec issue
* cache buffer resource
* fix
* support pre-nop
* clear tile by vector type members
* add workaround to reduce scratch memory
* conditionally enable workaround code
* enable workaround start from certain build version
* fallback set_tile() implementation from certain build version
* undo template argument changes
* put dummy asm in load_raw()
* fix comments, refactor s_nop inside buffer_load
---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com >
2024-07-08 11:09:55 -07:00
Po Yen Chen
0cb2e06ddc
[CK_TILE] fmha forward split-kv + combine kernels ( #1338 )
...
* FA fwd dropout
* FA bwd
* epilogue reuse
* CMakeLists update
* [CK_TILE] support alibi (#1269 )
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
* now fwd/bwd can build
* bwd alibi
* add bwd validation stream_config
* update generated filenames
* update bwd kernel launch
* CK_TILE_HOST_DEVICE in philox
* Transpose -> transpose
* format
* format
* format
* Generate the instance for FA required
* format
* fix error in WarpGemm
* Add num_splits option and dummy split-kv api method
* Generate fmha_fwd_splitkv()
* Add SplitKV kernel codegen logics
* Add SplitKV combine kernel codegen logics
* Fix mismatched return type
* Clean-up code
* Replace sentinel value before storing
* Fix wrong layout of LSE/LSEacc/Oacc
* Format codes
* Fix o_acc memory error
* Fix wrong kBlockSize used in policy
* Reduce # of combine kernels
* Fix split-kv combine kernel name
* Fix wrong LDS indexing logics
* Fix wrong loop counter step logic
* Undo vector size changes
* Remove no-longer used field
* Remove in-consistent comment
* Remove debug statements in example
* Remove more debug statements
* Add constness to local variables
* Clearn up generate.py
* Fix unstable clang-format comment
* Remove unused include directive
* Use shorter template parameter name
* Enable non-split-kv blobs
* Update license date
* Print num_splits conditionally
* Undo disabling data types
* Remove unnessary tile size for fp8
* Fix wrong pipeline args for fp8
* Fix example output format
* Remove more debug code in combine pipeline
* Add stride kernel arguments for LSE/O acc workspace
* Re-order split-kv pipeline call operator arguments
* Pass LSE/O strides in kernel argument
* Re-order pipeline call operator arguments
* Use tensor_descriptor to locate LSEacc elements
* Support providing invalid element for tensor view
* Set invalid element value for LSEacc tensor view
* Remove hand-written store_tile() code
* Remove necessary value-overwrite logic
* Add transposed lds descriptor
* Support load_tile() for tile_window_with_static_lengths<>
* Undo removing necessary value-overwrite logic
* Use read descriptor to locate lds elements
* Simplify pipeline source code
* Add constraint to kMaxSplits
* Default use kMaxSplits=64 in generate.py
* Revert "Add constraint to kMaxSplits"
This reverts commit 0a2132d758 .
* Revert "Default use kMaxSplits=64 in generate.py"
This reverts commit c7d9c80b77 .
* Decide alignment by the padding parameter
* Remove no-longer used utility functions
* Remove not-working code
* Add comment & remove no-longer used code
* Fix computation errors
* Add heuristic to override num_splits option
* Add constraint to kMaxSplits
* Fix compilation error
* Clean up pipeline code
* Wrap pointer access as lambda function
* Rename confusing methods
* Use kLogMasSplits as template parameter
* Finish splitkv combine kernel codegen
* Update kMaxSplits limit
* Use smaller kM0 for splitkv combine kernel
* Ignore droupout flag in splitkv pipeline
* Unify flag usage
* Add back flag kStoreLSE
* Merge lambda calls in pipeline
* Fix compilation errors
* Avoid all empty splits
* Always check for empty loop in splitkv pipelines
* Re-order parameters
* Remove redundant p_drop option check
* Add traits/problem for fwd splitkv kernel
* Conditionally enable uneven split boundary checks
* Add comment for the splitkv traits field
* Change even split criteria
* Re-order statements
* Refine occupancy value for hdim=128&256
* Refine occupancy value for hdim=32&64
* Remove redundant kernel argument
* Separate fmha bwd codegen logics
* Separate fmha fwd codegen logics
* Remove redundant direction parameter in fwd&bwd codegen logics
* Support generate multiple APIs for an example
* Let 'api' an alias of 'direction' option
* Remove choices for the 'direction' option
* Use dictionary to config all the functions
* Move fmha splitkv codegen logics to other file
* Add fwd_splitkv api for tile_example_fmha_fwd
---------
Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
2024-06-26 17:41:15 +08:00