PoYen, Chen
ca4b208b60
Fix wrong grid size
2024-07-23 14:20:52 +00:00
PoYen, Chen
52b47810bb
Rename more tile size constants
2024-07-23 09:30:05 +00:00
PoYen, Chen
99c1d463de
Align naming of some tile size constants
2024-07-23 09:24:38 +00:00
PoYen, Chen
ce5e0f1d67
Re-order parameters
2024-07-23 09:02:41 +00:00
PoYen, Chen
fb80c7b2cb
Extract rotary embedding logic out
2024-07-23 08:51:59 +00:00
PoYen, Chen
2192bbc68a
Rename RotaryEmbeddingEnum
2024-07-23 07:50:50 +00:00
PoYen, Chen
d4606cf3c3
Rename header
2024-07-23 07:45:25 +00:00
PoYen, Chen
b275732128
Remove always true static_assert()
2024-07-23 07:25:50 +00:00
PoYen, Chen
eb649a2f25
Move thread locating logics into policy
2024-07-23 07:21:20 +00:00
PoYen, Chen
0e5cb6f913
Skip code if # of block is more than needed
2024-07-23 06:53:24 +00:00
PoYen, Chen
0925c0e941
Use better naming for tile indices
2024-07-23 06:40:53 +00:00
PoYen, Chen
bc7c7ee0c5
Fix wrong knew/vnew appending positions
2024-07-23 04:46:53 +00:00
PoYen, Chen
56df4d6397
Remove debug print code in kernel
2024-07-23 04:01:55 +00:00
PoYen, Chen
48c70720b5
Apply RoPE to q_tile
2024-07-23 03:54:11 +00:00
PoYen, Chen
e88253a2f4
Add code blocks for q_tile
2024-07-23 03:28:40 +00:00
PoYen, Chen
1dbed18555
Remove constness from q_ptr
2024-07-23 03:11:31 +00:00
PoYen, Chen
c26c60db4c
Unify parameter/variable naming style
2024-07-23 02:59:17 +00:00
PoYen, Chen
c0bc097758
Apply elementwise function to the loaded tiles
2024-07-23 02:50:07 +00:00
PoYen, Chen
df352f955a
Add comment
2024-07-23 02:31:45 +00:00
PoYen, Chen
d1ecfdc700
Support 8x rotary_dim under half-rotated RoPE
2024-07-23 02:19:16 +00:00
PoYen, Chen
631f29d527
Handle RoPE half-rotated logics
2024-07-22 08:50:03 +00:00
PoYen, Chen
01865d2ae4
Clean-up pipeline
2024-07-22 03:14:10 +00:00
PoYen, Chen
fffd6799e6
Instantiate multiple kernels for RoPE approaches
2024-07-20 02:28:21 +00:00
PoYen, Chen
27b5141706
Fix wrong thread starting offset
2024-07-18 20:02:06 +00:00
PoYen, Chen
23450526c0
Only apply interleaved RoPE on Knew for now
2024-07-18 19:42:14 +00:00
PoYen, Chen
85bfed07fa
Add dram distribution for rotary_cos/rotary_sin (interleaved)
2024-07-18 09:11:22 +00:00
PoYen, Chen
99f863e4cd
Fix rotary cos/sin tensor/tile size
2024-07-16 06:31:17 +00:00
PoYen, Chen
b32fd8d3f4
Rename variables used in distributio encoding
2024-07-16 06:27:28 +00:00
PoYen, Chen
879710a495
Fix wrong seqlen_k for kvcache
2024-07-16 03:42:51 +00:00
PoYen, Chen
b0925bb7f6
Create Rotary Cos/Sin tile windows in kernel
2024-07-14 23:47:40 +00:00
PoYen, Chen
391210ed9e
Pass RoPE kernel args
2024-07-14 23:18:32 +00:00
PoYen, Chen
b34ddf5f71
Merge remote-tracking branch 'origin/feature/cond-add-splitkv' into feature/fmha-fwd-appendkv
2024-07-12 04:42:45 +00:00
PoYen, Chen
8c733fb3be
Fix compilation errors
2024-07-10 10:53:58 +00:00
carlushuang
8182976c37
[CK_TILE] wa prec, remove sgpr offset for inline asm ( #1356 )
...
* wa prec, remove sgpr offset for inline asm
* macro for set tile
* ignore unused param if no kernel instances in host API
* fix more prec issue
* cache buffer resource
* fix
* support pre-nop
* clear tile by vector type members
* add workaround to reduce scratch memory
* conditionally enable workaround code
* enable workaround start from certain build version
* fallback set_tile() implementation from certain build version
* undo template argument changes
* put dummy asm in load_raw()
* fix comments, refactor s_nop inside buffer_load
---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com >
2024-07-08 11:09:55 -07:00
PoYen, Chen
1c070380fa
Merge branch 'feature/cond-add-splitkv' into feature/fmha-fwd-appendkv
2024-07-08 07:13:34 +00:00
PoYen, Chen
34a3ff849f
Fix Vnew tile dstr for row major case
2024-06-27 09:55:35 +00:00
PoYen, Chen
c40c1daff0
Extract common logics
2024-06-26 18:02:28 +00:00
PoYen, Chen
8fb567c286
Fix vnew append errro
2024-06-26 17:00:07 +00:00
Po Yen Chen
0cb2e06ddc
[CK_TILE] fmha forward split-kv + combine kernels ( #1338 )
...
* FA fwd dropout
* FA bwd
* epilogue reuse
* CMakeLists update
* [CK_TILE] support alibi (#1269 )
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
* now fwd/bwd can build
* bwd alibi
* add bwd validation stream_config
* update generated filenames
* update bwd kernel launch
* CK_TILE_HOST_DEVICE in philox
* Transpose -> transpose
* format
* format
* format
* Generate the instance for FA required
* format
* fix error in WarpGemm
* Add num_splits option and dummy split-kv api method
* Generate fmha_fwd_splitkv()
* Add SplitKV kernel codegen logics
* Add SplitKV combine kernel codegen logics
* Fix mismatched return type
* Clean-up code
* Replace sentinel value before storing
* Fix wrong layout of LSE/LSEacc/Oacc
* Format codes
* Fix o_acc memory error
* Fix wrong kBlockSize used in policy
* Reduce # of combine kernels
* Fix split-kv combine kernel name
* Fix wrong LDS indexing logics
* Fix wrong loop counter step logic
* Undo vector size changes
* Remove no-longer used field
* Remove in-consistent comment
* Remove debug statements in example
* Remove more debug statements
* Add constness to local variables
* Clearn up generate.py
* Fix unstable clang-format comment
* Remove unused include directive
* Use shorter template parameter name
* Enable non-split-kv blobs
* Update license date
* Print num_splits conditionally
* Undo disabling data types
* Remove unnessary tile size for fp8
* Fix wrong pipeline args for fp8
* Fix example output format
* Remove more debug code in combine pipeline
* Add stride kernel arguments for LSE/O acc workspace
* Re-order split-kv pipeline call operator arguments
* Pass LSE/O strides in kernel argument
* Re-order pipeline call operator arguments
* Use tensor_descriptor to locate LSEacc elements
* Support providing invalid element for tensor view
* Set invalid element value for LSEacc tensor view
* Remove hand-written store_tile() code
* Remove necessary value-overwrite logic
* Add transposed lds descriptor
* Support load_tile() for tile_window_with_static_lengths<>
* Undo removing necessary value-overwrite logic
* Use read descriptor to locate lds elements
* Simplify pipeline source code
* Add constraint to kMaxSplits
* Default use kMaxSplits=64 in generate.py
* Revert "Add constraint to kMaxSplits"
This reverts commit 0a2132d758 .
* Revert "Default use kMaxSplits=64 in generate.py"
This reverts commit c7d9c80b77 .
* Decide alignment by the padding parameter
* Remove no-longer used utility functions
* Remove not-working code
* Add comment & remove no-longer used code
* Fix computation errors
* Add heuristic to override num_splits option
* Add constraint to kMaxSplits
* Fix compilation error
* Clean up pipeline code
* Wrap pointer access as lambda function
* Rename confusing methods
* Use kLogMasSplits as template parameter
* Finish splitkv combine kernel codegen
* Update kMaxSplits limit
* Use smaller kM0 for splitkv combine kernel
* Ignore droupout flag in splitkv pipeline
* Unify flag usage
* Add back flag kStoreLSE
* Merge lambda calls in pipeline
* Fix compilation errors
* Avoid all empty splits
* Always check for empty loop in splitkv pipelines
* Re-order parameters
* Remove redundant p_drop option check
* Add traits/problem for fwd splitkv kernel
* Conditionally enable uneven split boundary checks
* Add comment for the splitkv traits field
* Change even split criteria
* Re-order statements
* Refine occupancy value for hdim=128&256
* Refine occupancy value for hdim=32&64
* Remove redundant kernel argument
* Separate fmha bwd codegen logics
* Separate fmha fwd codegen logics
* Remove redundant direction parameter in fwd&bwd codegen logics
* Support generate multiple APIs for an example
* Let 'api' an alias of 'direction' option
* Remove choices for the 'direction' option
* Use dictionary to config all the functions
* Move fmha splitkv codegen logics to other file
* Add fwd_splitkv api for tile_example_fmha_fwd
---------
Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
2024-06-26 17:41:15 +08:00
PoYen, Chen
4e6c28522c
Fix wrong K values after appending
2024-06-25 10:12:13 +00:00
PoYen, Chen
1ac17dae50
Add knew/vnew tensors to the kernel argument
2024-06-25 07:56:36 +00:00
PoYen, Chen
344902732a
Sync kernel name with the codegen
2024-06-24 14:50:25 +00:00
PoYen, Chen
bace0e5df0
Add init codegen logic for fmha fwd appendkv
2024-06-24 12:33:51 +00:00
rocking
cb13839425
layernorm2d forward ( #1339 )
...
* Add layernorm2d forward
* Refind file path
* clang format
* Exclude ck_tile op from all
* use add_executable instead
* refactor layernorm2d_fwd example
---------
Co-authored-by: carlushuang <carlus.huang@amd.com >
2024-06-24 08:45:52 +08:00
Dan Yao
1da802bdf2
Fix FA bwd alibi+causal NaN errors ( #1352 )
...
* fix bwd alibi nan error
* fix datatype
---------
Co-authored-by: danyao12 <danyao12>
2024-06-20 09:50:53 -05:00
Qianfeng
e3f44659cf
Fix in dropout lambda to avoid the compiling issue on some docker/compiler envs ( #1350 )
2024-06-20 11:36:42 +08:00
Qianfeng
1973903f49
Hacking ck_tile fmha Dropout facility ( #1344 )
...
* Add NullBlockDropout to be used when kHasDropout is false
* Change to BlockDropout::Run() for forward to reduce conditional checkings
* Re-format files
---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com >
2024-06-19 10:37:22 +08:00
carlushuang
17ed368f58
[CK_TILE][FA] using pk f16_f32 ( #1343 )
...
* [CK_TILE][FA] using pk f16_f32
* correct a error
2024-06-17 17:16:46 +08:00
Dan Yao
2cab8d39e3
CK Tile FA Training kernels ( #1286 )
...
* FA fwd dropout
* FA bwd
* epilogue reuse
* CMakeLists update
* [CK_TILE] support alibi (#1269 )
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
* now fwd/bwd can build
* bwd alibi
* add bwd validation stream_config
* update generated filenames
* update bwd kernel launch
* CK_TILE_HOST_DEVICE in philox
* Transpose -> transpose
* format
* format
* format
* Generate the instance for FA required
* format
* fix error in WarpGemm
---------
Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
2024-06-04 13:12:45 -05:00
carlushuang
5055b3bdcb
[CK_TILE] support group from cmdline ( #1295 )
...
* support cmdline seqlen decode
* silent print
* update readme
* update kernel launch 3d
* update tile partitioner
* fix spill for bf16
* modify based on comment
* modify payload_t
* fix bug for alibi mode
* fix alibi test err
* refactor kernel launch, support select timer
* add missing file
* remove useless code
* add some comments
2024-05-28 11:13:21 +08:00