Commit Graph

56 Commits

Author SHA1 Message Date
PoYen, Chen
1dbed18555 Remove constness from q_ptr 2024-07-23 03:11:31 +00:00
PoYen, Chen
c26c60db4c Unify parameter/variable naming style 2024-07-23 02:59:17 +00:00
PoYen, Chen
c0bc097758 Apply elementwise function to the loaded tiles 2024-07-23 02:50:07 +00:00
PoYen, Chen
df352f955a Add comment 2024-07-23 02:31:45 +00:00
PoYen, Chen
d1ecfdc700 Support 8x rotary_dim under half-rotated RoPE 2024-07-23 02:19:16 +00:00
PoYen, Chen
631f29d527 Handle RoPE half-rotated logics 2024-07-22 08:50:03 +00:00
PoYen, Chen
1136e6b560 Fix error in RoPE host reference 2024-07-22 08:39:09 +00:00
PoYen, Chen
01865d2ae4 Clean-up pipeline 2024-07-22 03:14:10 +00:00
PoYen, Chen
fffd6799e6 Instantiate multiple kernels for RoPE approaches 2024-07-20 02:28:21 +00:00
PoYen, Chen
27b5141706 Fix wrong thread starting offset 2024-07-18 20:02:06 +00:00
PoYen, Chen
23450526c0 Only apply interleaved RoPE on Knew for now 2024-07-18 19:42:14 +00:00
PoYen, Chen
85bfed07fa Add dram distribution for rotary_cos/rotary_sin (interleaved) 2024-07-18 09:11:22 +00:00
PoYen, Chen
39ef09bb23 Remove unused inner namespace 2024-07-18 09:10:51 +00:00
PoYen, Chen
99f863e4cd Fix rotary cos/sin tensor/tile size 2024-07-16 06:31:17 +00:00
PoYen, Chen
b32fd8d3f4 Rename variables used in distributio encoding 2024-07-16 06:27:28 +00:00
PoYen, Chen
879710a495 Fix wrong seqlen_k for kvcache 2024-07-16 03:42:51 +00:00
PoYen, Chen
b0925bb7f6 Create Rotary Cos/Sin tile windows in kernel 2024-07-14 23:47:40 +00:00
PoYen, Chen
391210ed9e Pass RoPE kernel args 2024-07-14 23:18:32 +00:00
PoYen, Chen
44c9bacff7 Rename function: add "batched" prefix 2024-07-12 06:51:31 +00:00
PoYen, Chen
ff75eff3bf Reduce input/output dimensions 2024-07-12 06:49:43 +00:00
PoYen, Chen
b34ddf5f71 Merge remote-tracking branch 'origin/feature/cond-add-splitkv' into feature/fmha-fwd-appendkv 2024-07-12 04:42:45 +00:00
PoYen, Chen
ee365bbc66 Fix wrong answer when interleaved=true 2024-07-11 00:26:18 +00:00
PoYen, Chen
52da00acd6 Fix wrong answer when interleaved=false 2024-07-10 12:50:00 +00:00
PoYen, Chen
8c733fb3be Fix compilation errors 2024-07-10 10:53:58 +00:00
PoYen, Chen
03b6d99be0 Fix typo of HostTensor<>::get_length() 2024-07-10 09:33:15 +00:00
PoYen, Chen
9d29311da0 Finish reference_rotary_position_embedding() impl 2024-07-10 09:16:54 +00:00
PoYen, Chen
f2d28e8ab4 Add reference_rotary_position_embedding() (not implemented) 2024-07-09 05:22:08 +00:00
PoYen, Chen
2e164f1b79 Add length/stride getters for HostTensor 2024-07-09 05:20:04 +00:00
carlushuang
8182976c37 [CK_TILE] wa prec, remove sgpr offset for inline asm (#1356)
* wa prec, remove sgpr offset for inline asm

* macro for set tile

* ignore unused param if no kernel instances in host API

* fix more prec issue

* cache buffer resource

* fix

* support pre-nop

* clear tile by vector type members

* add workaround to reduce scratch memory

* conditionally enable workaround code

* enable workaround start from certain build version

* fallback set_tile() implementation from certain build version

* undo template argument changes

* put dummy asm in load_raw()

* fix comments, refactor s_nop inside buffer_load

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2024-07-08 11:09:55 -07:00
PoYen, Chen
1c070380fa Merge branch 'feature/cond-add-splitkv' into feature/fmha-fwd-appendkv 2024-07-08 07:13:34 +00:00
PoYen, Chen
34a3ff849f Fix Vnew tile dstr for row major case 2024-06-27 09:55:35 +00:00
Illia Silin
941d1f7ce0 Merging the gfx12 code into public repo. (#1362) 2024-06-27 00:33:34 -07:00
PoYen, Chen
c40c1daff0 Extract common logics 2024-06-26 18:02:28 +00:00
PoYen, Chen
8fb567c286 Fix vnew append errro 2024-06-26 17:00:07 +00:00
Po Yen Chen
a32b1bc647 Replace hipDeviceSynchronize() by hipStreamSynchronize(stream) calls (#1359) 2024-06-26 22:04:52 +08:00
Po Yen Chen
0cb2e06ddc [CK_TILE] fmha forward split-kv + combine kernels (#1338)
* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

* Add num_splits option and dummy split-kv api method

* Generate fmha_fwd_splitkv()

* Add SplitKV kernel codegen logics

* Add SplitKV combine kernel codegen logics

* Fix mismatched return type

* Clean-up code

* Replace sentinel value before storing

* Fix wrong layout of LSE/LSEacc/Oacc

* Format codes

* Fix o_acc memory error

* Fix wrong kBlockSize used in policy

* Reduce # of combine kernels

* Fix split-kv combine kernel name

* Fix wrong LDS indexing logics

* Fix wrong loop counter step logic

* Undo vector size changes

* Remove no-longer used field

* Remove in-consistent comment

* Remove debug statements in example

* Remove more debug statements

* Add constness to local variables

* Clearn up generate.py

* Fix unstable clang-format comment

* Remove unused include directive

* Use shorter template parameter name

* Enable non-split-kv blobs

* Update license date

* Print num_splits conditionally

* Undo disabling data types

* Remove unnessary tile size for fp8

* Fix wrong pipeline args for fp8

* Fix example output format

* Remove more debug code in combine pipeline

* Add stride kernel arguments for LSE/O acc workspace

* Re-order split-kv pipeline call operator arguments

* Pass LSE/O strides in kernel argument

* Re-order pipeline call operator arguments

* Use tensor_descriptor to locate LSEacc elements

* Support providing invalid element for tensor view

* Set invalid element value for LSEacc tensor view

* Remove hand-written store_tile() code

* Remove necessary value-overwrite logic

* Add transposed lds descriptor

* Support load_tile() for tile_window_with_static_lengths<>

* Undo removing necessary value-overwrite logic

* Use read descriptor to locate lds elements

* Simplify pipeline source code

* Add constraint to kMaxSplits

* Default use kMaxSplits=64 in generate.py

* Revert "Add constraint to kMaxSplits"

This reverts commit 0a2132d758.

* Revert "Default use kMaxSplits=64 in generate.py"

This reverts commit c7d9c80b77.

* Decide alignment by the padding parameter

* Remove no-longer used utility functions

* Remove not-working code

* Add comment & remove no-longer used code

* Fix computation errors

* Add heuristic to override num_splits option

* Add constraint to kMaxSplits

* Fix compilation error

* Clean up pipeline code

* Wrap pointer access as lambda function

* Rename confusing methods

* Use kLogMasSplits as template parameter

* Finish splitkv combine kernel codegen

* Update kMaxSplits limit

* Use smaller kM0 for splitkv combine kernel

* Ignore droupout flag in splitkv pipeline

* Unify flag usage

* Add back flag kStoreLSE

* Merge lambda calls in pipeline

* Fix compilation errors

* Avoid all empty splits

* Always check for empty loop in splitkv pipelines

* Re-order parameters

* Remove redundant p_drop option check

* Add traits/problem for fwd splitkv kernel

* Conditionally enable uneven split boundary checks

* Add comment for the splitkv traits field

* Change even split criteria

* Re-order statements

* Refine occupancy value for hdim=128&256

* Refine occupancy value for hdim=32&64

* Remove redundant kernel argument

* Separate fmha bwd codegen logics

* Separate fmha fwd codegen logics

* Remove redundant direction parameter in fwd&bwd codegen logics

* Support generate multiple APIs for an example

* Let 'api' an alias of 'direction' option

* Remove choices for the 'direction' option

* Use dictionary to config all the functions

* Move fmha splitkv codegen logics to other file

* Add fwd_splitkv api for tile_example_fmha_fwd

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
2024-06-26 17:41:15 +08:00
PoYen, Chen
4e6c28522c Fix wrong K values after appending 2024-06-25 10:12:13 +00:00
PoYen, Chen
1ac17dae50 Add knew/vnew tensors to the kernel argument 2024-06-25 07:56:36 +00:00
PoYen, Chen
344902732a Sync kernel name with the codegen 2024-06-24 14:50:25 +00:00
PoYen, Chen
342c8cf01d Call HIP_CHECK_ERROR() macro to get real source info 2024-06-24 14:33:09 +00:00
PoYen, Chen
bace0e5df0 Add init codegen logic for fmha fwd appendkv 2024-06-24 12:33:51 +00:00
rocking
cb13839425 layernorm2d forward (#1339)
* Add layernorm2d forward

* Refind file path

* clang format

* Exclude ck_tile op from all

* use add_executable instead

* refactor layernorm2d_fwd example

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-06-24 08:45:52 +08:00
carlushuang
fa129c1a5d WA for rocm-6.2+ s constrait for buffer resource (#1346)
* WA for rocm-6.2+ s constrait for buffer resource

* add missing memory clobber
2024-06-21 11:00:13 -05:00
Dan Yao
1da802bdf2 Fix FA bwd alibi+causal NaN errors (#1352)
* fix bwd alibi nan error

* fix datatype

---------

Co-authored-by: danyao12 <danyao12>
2024-06-20 09:50:53 -05:00
Qianfeng
e3f44659cf Fix in dropout lambda to avoid the compiling issue on some docker/compiler envs (#1350) 2024-06-20 11:36:42 +08:00
Qianfeng
1973903f49 Hacking ck_tile fmha Dropout facility (#1344)
* Add NullBlockDropout to be used when kHasDropout is false

* Change to BlockDropout::Run() for forward to reduce conditional checkings

* Re-format files

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2024-06-19 10:37:22 +08:00
carlushuang
17ed368f58 [CK_TILE][FA] using pk f16_f32 (#1343)
* [CK_TILE][FA] using pk f16_f32

* correct a error
2024-06-17 17:16:46 +08:00
Qianfeng
37a347e380 Fix to the using of static_for in amd_buffer_addressing.hpp (#1337)
* Add insert_dummy_dep_per_dword over-loading for length 64

* Fix insert_dummy_dep_per_dword and remove over-loading for length 64

* Remove blank lines

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-06-13 16:12:20 +08:00
Dan Yao
2cab8d39e3 CK Tile FA Training kernels (#1286)
* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
2024-06-04 13:12:45 -05:00
carlushuang
5055b3bdcb [CK_TILE] support group from cmdline (#1295)
* support cmdline seqlen decode

* silent print

* update readme

* update kernel launch 3d

* update tile partitioner

* fix spill for bf16

* modify based on comment

* modify payload_t

* fix bug for alibi mode

* fix alibi test err

* refactor kernel launch, support select timer

* add missing file

* remove useless code

* add some comments
2024-05-28 11:13:21 +08:00