746 Commits

Author SHA1 Message Date
Po Yen Chen
95e722a3b3 [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577)
* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged
2024-10-21 10:52:11 +08:00
Qianfeng
14c3cfb1c6 [CK_TILE] Improve headdim96 performance for fmha-bwd (#1573)
* Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96

* Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor

* Fix in bwd_piple_default_policy

* Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines

* Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim

* Update to hd96 tile settings

* Add smoke test scripts for fmha-bwd hd96

* Revert "Add smoke test scripts for fmha-bwd hd96"

This reverts commit 7ca7e1a93d.

* Remove hd96 tile settings in fmha_bwd codegen to save compiling

* Fix lost code line in bwd_pipeline_default_policy

* Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits

* Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor

* tiny adjustments

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: danyao12 <Dan.Yao@amd.com>
2024-10-16 18:14:32 +08:00
Bartłomiej Kocot
d02a92cc0d [CK_TILE] Add block universal gemm pipeline policy (#1557)
* [CK_TILE] Add block universal gemm pipeline policy

* Fixes

* fixes2

* Fixes3

* fixeS
2024-10-15 13:53:41 +02:00
Po Yen Chen
9868fd0245 Apply ROCm 6.2 WA to ROCm 6.3 and later (#1563) 2024-10-15 18:02:41 +08:00
Thomas Ning
35c1777d59 decouple the calling from gemm_pipeline (#1571)
* decouple the calling from gemm_pipeline

* clang format
2024-10-14 13:59:26 +08:00
Thomas Ning
6f27bc9872 Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535)
* ake the cshuffle compilable

* modify Mhe reference on gpu and cpu. Correaccess of cshuffle

* fix the cpu reference code

* Complete the in tile shuffle logic

* restructure the kernel template input

* change the naming pattern of ck_tile gemm pipeline

* Re-format files using remod.py

* Solve the fmha conflict with gemm

* Comment Addressed from Carlus

---------

Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
2024-10-10 18:02:22 +08:00
Po Yen Chen
0c094daa7e [CK_TILE] Update example README files & fix script compatibility issue (#1548)
* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command
2024-10-08 10:45:12 +08:00
Qianfeng
74d68e3b99 [CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549)
* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-08 10:44:34 +08:00
Bartłomiej Kocot
cc8f466a7e [CK_TILE] Fix conv param multiple definition (#1550)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-07 15:21:21 +02:00
rocking
0023f01ab0 [Ck tile] Support layernorm one pass (#1512)
* Fix compile error

* Add one pass pipeline

* Extract creating tile_window to operator()

* clang format

* reduce duplicated code

* do not hardcode

* Support padding in layernorm

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-07 14:25:53 +08:00
kylasa
c24fae2346 Adding seed and offset pointer support to the philox random number generator. (#1523)
* Adding seed and offset pointer support to the philox random number generator.

* Separating seed and offset pointer checks with different condition statements.

* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.

* Correcting a typo in the readme file

* Re-format files using remod.py

* Use STL type for API parameters

* Use simpler struct design for drop_seed & drop_offset

* Undo unnecessary changes

* Sync kargs style for fmha_fwd.hpp/.cpp

* Use templated union to reduce code

* Use structured binding to make code more readable

---------

Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-05 02:48:47 +08:00
Illia Silin
8e4c3fb1bc [CK_TILE] add missing vector header (#1537)
* add missing vector header

* Re-format header using remod.py

---------

Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
2024-10-01 07:58:20 -07:00
Po Yen Chen
a1c07e8d91 [CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527)
* Use same layout for o_acc and o tensor

* Use better param names in partitioner

* Remove redundant kargs 'max_seqlen_q'

* Use better param names in splitkv kernel

* Add comment for additional kernel arguments

* Sync empty loop early return logics between pipelines

* Pass more arguments to cmake in scripts

* Align backslashes

* Fix wrong o_acc tensor view strides

* Change o_acc layout if o_perm=0

* Handle whole row masked via attn_bias

* Use use vector width = 1 for o_acc

* Use more even split sizes
2024-10-01 22:13:52 +08:00
Bartłomiej Kocot
de3e3b6424 [CK_TILE] Image to Column kernel (#1532)
* [CK_TILE] Image to Column kernel

* Fixes

* Vector loads and stores

* Fixes

* Fixes

* change test dir name
2024-09-27 22:57:38 +02:00
Dan Yao
9d69a099a4 [CK_TILE] Fix compiler related FA bwd issues (#1530)
* add barriers

* tail bias barriers

* adjust bf16/hd256 tol

* continue adjust bf16/hd256 tol
2024-09-26 12:18:39 -07:00
Illia Silin
42e6dceacc Fix compilation errors with Clang20.0. (#1533)
* fix clang20 compilation errors for gfx90a

* fix clang20 compilation errors for gfx11 targets
2024-09-25 13:45:38 -07:00
Po Yen Chen
770d2b7725 Early return if seqlen_k=0 on group mode (#1524) 2024-09-22 20:05:58 +08:00
Thomas Ning
694c300145 Ck tile gemm padding dim (#1516)
* Support the N dimension padding

* Finished the padding feature for different dimension of K
2024-09-18 11:32:29 -07:00
Thomas Ning
844f5a1712 Ck tile GPU verification sample develop & Add the CK TILE GEMM to the CI/CD test (#1505)
* Finished the feature of gpu verification

* Add the ck_tile_gemm test in the CI CD

* add the include of tensor_layou in reference_gemm

* Comment Addressed

* split ck_tile fhma and gemm tests into separate stages

* restructure the reference gemm

* restructure a new reference_gemm api that could read the device mem

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2024-09-14 21:08:40 +08:00
Dan Yao
d09572e8c2 [CK_TILE] FA bwd repair (#1502)
* fix fa bwd

* revert kernelBlockSize in gemm_kernel.hpp
2024-09-10 10:45:32 -07:00
Thomas Ning
caacd38830 Ck tile gemm example (#1488)
* Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout

* Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.

* Fix: Clang Format, API fixed from fmha

* fix with better naming convention

* revert back the pipeline code of fmha

* Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.

* clang format with the reference_gemm file

* convert the clang format with the remod.py

* Changed the format and variable name of the kernel gemm_shape and partitioner

---------

Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
2024-09-07 16:23:32 +08:00
Dan Yao
b8addae293 [CK_TILE] float -> bf16 inline asm rtn (#1482)
* asm rtn

* add asm rtn macro

* reorder macro

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-08-30 15:38:09 +08:00
Po Yen Chen
461ec98d78 Enable scratch memory workaround on ROCm 6.2 (#1486)
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-08-30 10:40:00 +08:00
Po Yen Chen
c156989298 [CK_TILE] Add PagedAttention kernels (#1387)
* Use dictionary to config all the functions

* Add init codegen logic for fmha fwd appendkv

* Call HIP_CHECK_ERROR() macro to get real source info

* Setup meaningfull arguments

* Sync kernel name with the codegen

* Add knew/vnew tensors to the kernel argument

* Fix wrong K values after appending

* Fix vnew append errro

* Extract common logics

* Fix Vnew tile dstr for row major case

* Conditionally add fwd_splitkv API in fmha_fwd example

* Conditionally add call to fmha_fwd_splitkv()

* Remove "EXAMPLE_" prefix of cmake variables

* Regsiter API handlers automatically

* Early return if 0 < s_k_new is not supported

* Show message if we are ignoring option

* Unify CMakeLists.txt coding style

* Set num_splits=1 if split-kv is not supported

* Add length/stride getters for HostTensor

* Add RoPE example utilities

* Add reference_rotary_position_embedding() (not implemented)

* Finish reference_rotary_position_embedding() impl

* Fix typo of HostTensor<>::get_length()

* Fix compilation errors

* Fix wrong answer when interleaved=false

* Fix wrong answer when interleaved=true

* Append K/V in the host verification code

* Simplify K appending logics

* Simplify v_host_ref definition

* Reduce input/output dimensions

* Rename function: add "batched" prefix

* Apply RoPE on host side

* Rename RoPE utility function

* Fix wrong tensor size

* Avoid invoking deprecated method 'find_module'

* Pass RoPE kernel args

* Create Rotary Cos/Sin tile windows in kernel

* Add compute data type alias for RoPE

* Randomly generate seqlen_knew if needed

* Fix seqlen_knew enabling check logic

* Add minimum seqlen_k to generate compliance kvcache

* Fix compilation error in debug mode

* Fix wrong boundaries

* Fix wrong seqlen_k for kvcache

* Rename variables used in distributio encoding

* Fix rotary cos/sin tensor/tile size

* Add constraint to the rotary_dim option

* Remove unused inner namespace

* Add dram distribution for rotary_cos/rotary_sin (interleaved)

* Only apply interleaved RoPE on Knew for now

* Fix wrong thread starting offset

* Instantiate multiple kernels for RoPE approaches

* Clean-up pipeline

* Fix error in RoPE host reference

* Handle RoPE half-rotated logics

* Support 8x rotary_dim under half-rotated RoPE

* Add comment

* Apply elementwise function to the loaded tiles

* Unify parameter/variable naming style

* Remove constness from q_ptr

* Add code blocks for q_tile

* Apply RoPE to q_tile

* Remove debug print code in kernel

* Fix wrong knew/vnew appending positions

* Use better naming for tile indices

* Add make_tile_window() for adding distribution only

* Skip code if # of block is more than needed

* Move thread locating logics into policy

* Remove always true static_assert()

* Rename header

* Rename RotaryEmbeddingEnum

* Extract rotary embedding logic out

* Re-order parameters

* Align naming of some tile size constants

* Rename more tile size constants

* Fix wrong grid size

* Fix wrong shape of knew_host/vnew_host

* Fix wrong index into knew_host/vnew_host

* Fix wrong rotary_cos/rotary_sin memory size for Q

* Extract Q/Knew vector size to helper methods

* Use different rotary_cos/rotary_sin distr for Q/Knew

* Update host/device specifiers

* Fix wrong data type for Q rotary_cos/rotary_sin

* Remove RoPEComputeDataType type alias

* Shift rotary_cos/rotary_sin by cache_seqlen_k

* Add comment for why I just 't' for all padding flags

* Align commit message to the real comment

* Fix wrong pipeline

* Rename utility function

* Disable host verification if API not exist

* Fix wrong rope key for fp8 pipeline

* Allow only apply RoPE on Q (without append KV)

* Add append-kv smoke tests

* Remove debug statements

* Remove more debug statements

* Re-arrange the 'set +x' command

* Remove no-longer used method in pipeline

* Add missing init code

* Refine pipeline padding settings

* Enlarge rotary_dim limit (8 -> 16)

* Enlarge KPerThread for rotary_interleaved=false

* Update rotary_dim range in smoke_test_fwd.sh

* Add template argument 'kIsPagedKV' for splitkv kernels

* Launch splitkv kernel if given page_block_size

* Fix wrong kernel name

* Fix seqlen_k_min for pre-fill case (1 -> 0)

* Add copy_const<> type trait

* Add another make_tile_window()

* Introduce 'TileWindowNavigator' types

* Simplify TileWindowNavigator interfaces

* Fix tile window navigation bugs

* Disable calling fmha_fwd()

* Remove ununnecessary data members

* Simplify more make_tile_window() overloads

* Move V tile through TileWindowNavigator

* Fix uneven split checking logic

* Move code after decide seqlen_q/seqlen_k

* Make sure we always start reading complete tile

* Use 128 as minimus page_block_size

* Fix wrong origin for bias

* Add batch_stride_k/batch_stride_v in group mode

* Unify origin

* Add missing kernel arguments for group mode

* Add paged-kv codegen logic for appendkv kernels

* Add block_table kernel args for appendkv kernel

* Add tile navigators to the appendkv kernel

* Fix wrong tensor descriptor lengths

* Pass re-created tile window to pipeline

* Fix wrong strides for appendkv kernel

* Allow transit tile_window to another page-block

* Handle cross-page-block write

* Donot perform write again if already in last page-block

* Always add fmha_fwd() api

* Add missing group mode argument

* Remove debug macro usages

* Rename option s_k_new to s_knew

* Separate splitkv/non-splitkv args/traits

* Remove fmha_fwd_dispatch()

* Fix compilation errors

* Remove dropout code in splitkv kernel

* Allow problem types without define kHasDropout attr

* Use generic lambda to init traits objects

* Separate more non-splitkv & splitkv traits/args

* Display more info for specific kernels

* Show more detailed warning message

* Rename 'max_num_blocks' to 'max_num_page_blocks'

* Remove no-longer used pipeline files

* Wrap code by #if directives

* Move functors to the begining of validation code

* Use generic lambda to init all the api traits/args

* Fix wrong seqlen for kvcache

* Add missing comment

* Rename TileWindowNavigator to PageBlockNavigator

* Only expose necessary methods (not attributes)

* Re-order pipeline paremeters

* Refine smoke_test_fwd.sh

* Fix wrong arugment count

* Make tile window directly via PageBlockNavigator

* Remove unused template paremeter

* Remove group mode from appendkv kernel

* Fix skcheck logic

* Fix wrong syntax in skcheck expr

* Use meaningful options in smoke test

* Remove options

* Fix formatting

* Fix more format

* Re-organize bash functions

* Pass cache_batch_idx to kernels

* Support cache_batch_idx in example

* Fix compilation error

* Add more appendkv test

* Add more case for appendkv

* Fix unexisted attribute

* Remove 0 < seqlen_knew constraint

* Clarify the case in warning message

* Remove macro checking

* Force batch mode when invoking appendkv & splitkv apis

* Fix mode overriding logics

* Fix wrong parameter name

* Randomize seqlen_k if use kvcache

* Use randomized seqlen_k for kvcache

* Avoid using too small rotary_cos & rotary_sin

* Rename parameter

* Add seqlen_q & seqlen_k rules

* Add comment

* Add more comments

* Fix compilation errors

* Fix typo in comment

* Remove type argument

* Avoid seqlen_k=0 for kvcache

* Revert "Avoid seqlen_k=0 for kvcache"

This reverts commit 21c4df89e4.

* Fix wrong uneven split checking logics

* Only randomize kvcache seqlen_k if 1 < batch

* Return earlier if split is empty

* Revert "Only randomize kvcache seqlen_k if 1 < batch"

This reverts commit b9a4ab0d7e.

* Re-order seqlen_k_start adjustment logics

* Fix compilation errors

* Re-format script

* Find executable from folder automatically

* Fix kvcache seqlen_k generating logic

* Make comment more clear

* Fix wrong knew/vew appending logic on host

* Add s_barrier to sync threads

* Revert "Add s_barrier to sync threads"

This reverts commit d3f550f30c.

* Support only using 1 row of rotary_cos/rotary_sin

* Rotate Q in different way

* Unify tensor view creation logics

* Fix wrong argument

* Add mask to switch how we use the rotary_cos/sin

* Move attr from traits to problem

* Move has_mask to fmha_fwd_appendkv_args

* Support use uint32_t as SAD operand in Alibi<>

* Use sad_u32() in splitkv kernels

* Store tensor views in PageBlockNavigator

* Use stored tensor view to update tile windows

* Enlarge tensor view size

* Remove debug code

* Fix wrong tensor view size

* Wrap tensor view into PageBlockNavigator

* Add DataType member to PageBlockNavigator

* Remove unnecessary member functions

* Refind macro use

* Fix typo

* Add blank line between directives and actual code

* Re-format files

* Remove type in comment

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
2024-08-28 20:50:43 +08:00
Dan Yao
79a5d9c10c [CK_TILE] FA bwd kernels optimization (#1397)
* tmp save

* fix batch deterministic bugs

* fix group deterministic bugs

* codegen update

* reorder files

* bias support

* hd256 bias support

* bwd smoke test update

* simplify convert dq

* fix hd256 dropout scratch

* do{}while() -> while(){}

* comments

* remove FmhaBwdTilePartitioner

* save clear_tile

* refactor dropout

* code cleanup

* code cleanup

* comments

* fix epilogue problem

* fix fwd dropout

* group convert_dq opt

* fix dq alignment

* Do not store storerandval in bwd for flash attention integration

* fix hd32 error and boost performance

* revert

* Remove duplicated WarpGemm definitions in the policy file

* dropout patch for mrepeat 16*16

* code sync up

* dq_acc stride

* dq_acc stride stuff

* codegen update

* fwd dropout revert

* fix hd128 scratches and boost performance

* receipt 3 for simplified smoke test

* more strides for fa integration

* fix hd64 scratches and boost performance

* non-iglp pipeline for headdim padding cases

* dpad same as dvpad for flash attention integration

* unpadded lse&d for group mode

* Support unpad layout for group lse

* Support unpad lse layout for splitkv

* Fix stride for splitkv kernel

* fix unpadded lse issue in fwd splitkv

* comment

* solve lds read&write conflicts

* rename

* bias rename

* tile index revert

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
2024-08-16 13:40:10 -07:00
carlushuang
b3f86e79dd workaround rocm-6.2 compiler issue (#1421) 2024-07-31 16:03:59 +08:00
Qianfeng
ee768148f0 Replace the using of __expf by __ocml_exp_f32 to work-around the test_softmax_rank4 failure (#1394) 2024-07-17 09:15:05 -07:00
carlushuang
8182976c37 [CK_TILE] wa prec, remove sgpr offset for inline asm (#1356)
* wa prec, remove sgpr offset for inline asm

* macro for set tile

* ignore unused param if no kernel instances in host API

* fix more prec issue

* cache buffer resource

* fix

* support pre-nop

* clear tile by vector type members

* add workaround to reduce scratch memory

* conditionally enable workaround code

* enable workaround start from certain build version

* fallback set_tile() implementation from certain build version

* undo template argument changes

* put dummy asm in load_raw()

* fix comments, refactor s_nop inside buffer_load

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2024-07-08 11:09:55 -07:00
Illia Silin
941d1f7ce0 Merging the gfx12 code into public repo. (#1362) 2024-06-27 00:33:34 -07:00
Po Yen Chen
a32b1bc647 Replace hipDeviceSynchronize() by hipStreamSynchronize(stream) calls (#1359) 2024-06-26 22:04:52 +08:00
Po Yen Chen
0cb2e06ddc [CK_TILE] fmha forward split-kv + combine kernels (#1338)
* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

* Add num_splits option and dummy split-kv api method

* Generate fmha_fwd_splitkv()

* Add SplitKV kernel codegen logics

* Add SplitKV combine kernel codegen logics

* Fix mismatched return type

* Clean-up code

* Replace sentinel value before storing

* Fix wrong layout of LSE/LSEacc/Oacc

* Format codes

* Fix o_acc memory error

* Fix wrong kBlockSize used in policy

* Reduce # of combine kernels

* Fix split-kv combine kernel name

* Fix wrong LDS indexing logics

* Fix wrong loop counter step logic

* Undo vector size changes

* Remove no-longer used field

* Remove in-consistent comment

* Remove debug statements in example

* Remove more debug statements

* Add constness to local variables

* Clearn up generate.py

* Fix unstable clang-format comment

* Remove unused include directive

* Use shorter template parameter name

* Enable non-split-kv blobs

* Update license date

* Print num_splits conditionally

* Undo disabling data types

* Remove unnessary tile size for fp8

* Fix wrong pipeline args for fp8

* Fix example output format

* Remove more debug code in combine pipeline

* Add stride kernel arguments for LSE/O acc workspace

* Re-order split-kv pipeline call operator arguments

* Pass LSE/O strides in kernel argument

* Re-order pipeline call operator arguments

* Use tensor_descriptor to locate LSEacc elements

* Support providing invalid element for tensor view

* Set invalid element value for LSEacc tensor view

* Remove hand-written store_tile() code

* Remove necessary value-overwrite logic

* Add transposed lds descriptor

* Support load_tile() for tile_window_with_static_lengths<>

* Undo removing necessary value-overwrite logic

* Use read descriptor to locate lds elements

* Simplify pipeline source code

* Add constraint to kMaxSplits

* Default use kMaxSplits=64 in generate.py

* Revert "Add constraint to kMaxSplits"

This reverts commit 0a2132d758.

* Revert "Default use kMaxSplits=64 in generate.py"

This reverts commit c7d9c80b77.

* Decide alignment by the padding parameter

* Remove no-longer used utility functions

* Remove not-working code

* Add comment & remove no-longer used code

* Fix computation errors

* Add heuristic to override num_splits option

* Add constraint to kMaxSplits

* Fix compilation error

* Clean up pipeline code

* Wrap pointer access as lambda function

* Rename confusing methods

* Use kLogMasSplits as template parameter

* Finish splitkv combine kernel codegen

* Update kMaxSplits limit

* Use smaller kM0 for splitkv combine kernel

* Ignore droupout flag in splitkv pipeline

* Unify flag usage

* Add back flag kStoreLSE

* Merge lambda calls in pipeline

* Fix compilation errors

* Avoid all empty splits

* Always check for empty loop in splitkv pipelines

* Re-order parameters

* Remove redundant p_drop option check

* Add traits/problem for fwd splitkv kernel

* Conditionally enable uneven split boundary checks

* Add comment for the splitkv traits field

* Change even split criteria

* Re-order statements

* Refine occupancy value for hdim=128&256

* Refine occupancy value for hdim=32&64

* Remove redundant kernel argument

* Separate fmha bwd codegen logics

* Separate fmha fwd codegen logics

* Remove redundant direction parameter in fwd&bwd codegen logics

* Support generate multiple APIs for an example

* Let 'api' an alias of 'direction' option

* Remove choices for the 'direction' option

* Use dictionary to config all the functions

* Move fmha splitkv codegen logics to other file

* Add fwd_splitkv api for tile_example_fmha_fwd

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
2024-06-26 17:41:15 +08:00
rocking
cb13839425 layernorm2d forward (#1339)
* Add layernorm2d forward

* Refind file path

* clang format

* Exclude ck_tile op from all

* use add_executable instead

* refactor layernorm2d_fwd example

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-06-24 08:45:52 +08:00
carlushuang
fa129c1a5d WA for rocm-6.2+ s constrait for buffer resource (#1346)
* WA for rocm-6.2+ s constrait for buffer resource

* add missing memory clobber
2024-06-21 11:00:13 -05:00
Dan Yao
1da802bdf2 Fix FA bwd alibi+causal NaN errors (#1352)
* fix bwd alibi nan error

* fix datatype

---------

Co-authored-by: danyao12 <danyao12>
2024-06-20 09:50:53 -05:00
Qianfeng
e3f44659cf Fix in dropout lambda to avoid the compiling issue on some docker/compiler envs (#1350) 2024-06-20 11:36:42 +08:00
Qianfeng
1973903f49 Hacking ck_tile fmha Dropout facility (#1344)
* Add NullBlockDropout to be used when kHasDropout is false

* Change to BlockDropout::Run() for forward to reduce conditional checkings

* Re-format files

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2024-06-19 10:37:22 +08:00
carlushuang
17ed368f58 [CK_TILE][FA] using pk f16_f32 (#1343)
* [CK_TILE][FA] using pk f16_f32

* correct a error
2024-06-17 17:16:46 +08:00
Qianfeng
37a347e380 Fix to the using of static_for in amd_buffer_addressing.hpp (#1337)
* Add insert_dummy_dep_per_dword over-loading for length 64

* Fix insert_dummy_dep_per_dword and remove over-loading for length 64

* Remove blank lines

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-06-13 16:12:20 +08:00
Dan Yao
2cab8d39e3 CK Tile FA Training kernels (#1286)
* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
2024-06-04 13:12:45 -05:00
carlushuang
5055b3bdcb [CK_TILE] support group from cmdline (#1295)
* support cmdline seqlen decode

* silent print

* update readme

* update kernel launch 3d

* update tile partitioner

* fix spill for bf16

* modify based on comment

* modify payload_t

* fix bug for alibi mode

* fix alibi test err

* refactor kernel launch, support select timer

* add missing file

* remove useless code

* add some comments
2024-05-28 11:13:21 +08:00
Illia Silin
06b891c5c2 aggregate device macros in ck_tile config header (#1297) 2024-05-20 08:34:45 -07:00
rocking
aaa8dfdae9 Fix compile error (#1292)
error: no viable conversion from returned value of type '__half' to function return type 'fp16_hip_t' (aka '_Float16')

Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-05-17 17:19:17 +08:00
carlushuang
dd0dd13d4e remove operator-deref (#1291) 2024-05-15 08:06:50 -07:00
carlushuang
851c3ed157 [CK_TILE] support alibi (#1269)
* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
2024-05-07 22:32:54 +08:00
rocking
43879b89e4 Small refactor (#1246)
* Remove kIsFp8

* Extract alias

* Fix K, V and corresponding acc type

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-04-22 20:28:49 +08:00
carlushuang
db376dd8a4 introducing ck_tile! (#1216)
* enable gfx940

* switch between intrinsic mfma routines on mi100/200 and mi300

* fix mfma_int8 on MI300

* disable 2 int8 examples on MI300

* Update cmake-ck-dev.sh

* restore gitignore file

* modify Jenkinsfile to the internal repo

* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* initial enablement of gfx950

* fix clang format

* disable examples 31 and 41 int8 on gfx950

* add code

* fix build wip

* fix xx

* now can build

* naming

* minor fix

* wip fix

* fix macro for exp2; fix warpgemm a/b in transposedC

* unify as tuple_array

* Update the required Python version to 3.9

* Update executable name in test scripts

* re-structure tuple/array to avoid spill

* Merge function templates

* Fix format

* Add constraint to array<> ctor

* Re-use function

* Some minor changes

* remove wrong code in store_raw()

* fix compile issue in transpose

* Rename enum
Rename 'cood_transform_enum' to 'coord_transform_enum'

* let more integral_constant->constant, and formating

* make sure thread_buffer can be tuple/array

* temp fix buffer_store spill

* not using custom data type by default, now we can have ISA-level same code as opt_padding

* fix compile error, fp8 not ready now

* fix fp8 duplicated move/shift/and/or problem

* Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode

* fix scratch in fp8 kernel

* update some readme

* fix merge from upstream

* sync with upstream

* sync upstream again

* sync 22

* remove unused

* fix clang-format

* update README of ck_tile example

* fix several issue

* let python version to be 3.8 as minimal

* remove ck_tile example from default cmake target like all/install/check

* remove mistake

* 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg

* fix some bug in group-mode masking and codegen. update README

* F8 quantization for FMHA forward (#1224)

* Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline

* Add element function to fmha api

* Adjust P elementwise function

* Fix bug of elementwise op, our elementwise op is not inout

* Add some elementwise op, prepare to quantization

* Let generate.py can generate different elementwise function

* To prevent compiler issue, remove the elementwise function we have not used.

* Remove f8 pipeline, we should share the same pipeline even in f8

* Remove remove_cvref_t

* Avoid warning

* Fix wrong fp8 QK/KV block gemm setting

* Check fp8 rounding error in check_err()

* Set fp8 rounding error for check_err()

* Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode

* 1. codgen the f8 api and kernel
2. f8 host code

* prevent warning in filter mode

* Remove not-in-use elementwise function kargs

* Remove more not-in-use elementwise function kargs

* Small refinements in C++ source files

* Use conditional_t<> to simplify code

* Support heterogeneous argument for binary function types

* Re-use already-existing scales<> functor template

* Fix wrong value produced by saturating

* Generalize the composes<> template

* Unify saturates<> implementation

* Fix type errors in composes<>

* Extend less_equal<>

* Reuse the existing template less_equal<> in check_err()

* Add equal<float> & equal<double>

* Rename check_err() parameter

* Rename check_err() parameter

* Add FIXME comment for adding new macro in future

* Remove unnecessary cast to void

* Eliminate duplicated code

* Avoid dividing api pool into more than 2 groups

* Use more clear variable names

* Use affirmative condition in if stmt

* Remove blank lines

* Donot perfect forwarding in composes<>

* To fix compile error, revert generate.py back to 4439cc107d

* Fix bug of p element function

* Add compute element op to host softmax

* Remove element function in api interface

* Extract user parameter

* Rename pscale and oscale variable

* rename f8 to fp8

* rename more f8 to fp8

* Add pipeline::operator() without element_functor

* 1. Remove deprecated pipeline enum
2. Refine host code parameter

* Use quantization range as input

* 1. Rename max_dtype to dtype_max.
2. Rename scale to scale_s
3.Add init description

* Refine description

* prevent early return

* unify _squant kernel name in cpp, update README

* Adjust the default range.

* Refine error message and bias range

* Add fp8 benchmark and smoke test

* fix fp8 swizzle_factor=4 case

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
2024-04-15 19:27:12 -05:00