Commit Graph

73 Commits

Author SHA1 Message Date
rocking
0d7c30f534 Merge branch 'develop' into ck_tile/rmsnorm 2024-10-29 18:53:00 +08:00
rocking
7df4c49603 remove reduncant comment 2024-10-29 10:51:43 +00:00
rocking
9fa8b4c170 Fix bug of welford when number of m warp > 1 2024-10-29 10:51:33 +00:00
valarLip
4d7e063a0a [CK_TILE] add scatter_gather (#1609) 2024-10-29 18:19:29 +08:00
valarLip
9fbd72e97e [CK_TILE] add generic_permute (#1607) 2024-10-29 18:05:53 +08:00
rocking
8beda9d98d Move reduce2d into reduce folder 2024-10-29 10:02:09 +00:00
rocking
1654e6cd97 Merge branch 'develop' into ck_tile/rmsnorm 2024-10-29 14:42:48 +08:00
rocking
b83f8d242a Add instance library 2024-10-28 19:34:51 +00:00
rocking
9a22805e92 Fix bug of kSaveX == false 2024-10-27 11:42:58 +00:00
rocking
0f9969a894 Rename two pass to three pass 2024-10-26 20:29:55 +00:00
rocking
697558d856 Add two pass pipeline 2024-10-26 20:21:18 +00:00
carlushuang
b098b71b05 topk_softmax (#1592)
* topk_softmax

* remove some file

* fix atomix linear_offset

* address various comment, and change sfc get_index api to static(tuple)
2024-10-26 23:52:49 +08:00
Po Yen Chen
54f0e6f4bb [CK_TILE] More fmha splitkv optimizations (#1588)
* Use pre-defined constants for readability

* Use vector write for o_acc tensor

* Remove no-longer used policy method

* Deprecate no-longer used policy/pipeline

* Specify gemm0/gemm1 block warps separately in codegen

* Fix wrong ps_idx creation logic

* Add single-warp block gemm

* Supoprt single-warp gemm0

* Make MakeCBlockTile() as static method

* Use MakeCBlockTile() to get underlying tile distribution

* Use kNumGemm1Warps to compute # threads for gemm1

* Put normal case in the if clause

* Refine fmha splitkv block mapping

* Refine & fix the lse_acc/o_acc layout

* Fix wrong LDS size for K tile

* Use kK0=64 for hdim=128,256 fmha splitkv kernels

* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels

* Undo kK0/kK1 changes

* Use more reasonable GetAlignmentV() computation

* Using store_tile() in fmha splitkv kernel epilogue
2024-10-26 18:35:45 +08:00
rocking
1c1f1e35b5 Fix bug of one pass pipeline 2024-10-26 10:22:50 +00:00
rocking
27d96b4031 host verification 2024-10-26 10:22:09 +00:00
rocking
826ee18a11 Add reduce op 2024-10-25 22:51:15 +00:00
rocking
1e0c9fde51 Add add_rmsnorm2d_rdquant kernel 2024-10-25 20:50:48 +00:00
dummycoderfe
9183ce69ca hot_fix epsilon pos (#1597)
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
2024-10-25 11:17:45 +08:00
rocking
871af334d1 Refine pipeline name 2024-10-24 20:42:40 +00:00
rocking
c89d8ca95f clang format 2024-10-24 17:05:36 +00:00
rocking
d79715ba53 Fix bug of rmsnorm 2024-10-24 11:43:45 +00:00
rocking
5b3108a62f Remove static assert to prevent compile fail 2024-10-24 06:09:23 +00:00
rocking
382a2af212 Add rmsnorm2d 2024-10-23 19:23:51 +00:00
rocking
dfb4bf9488 Fix bug of std caculation 2024-10-22 20:36:25 +00:00
rocking
26f16dd20b Prevent user use cross warp reduction 2024-10-22 19:29:46 +00:00
rocking
9e7fcc0b37 Add reduce2d new api 2024-10-22 14:52:10 +00:00
ltqin
0394f8a713 update layernorm (#1570)
* port layernorm

* change warp_welford.hpp

* Update warpshuffle

* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()

* refine welford max count calculation

* unify layernorm api

* Rename file

* Remove save mean and inv std

* Revert "refine welford max count calculation"

This reverts commit 022365802b.

* Fix order of parameter

* refine welford max count calculation again

* Remove fp32 instances

* Fix bug of padding

* refactor api

* Support bf16

* Extract common function

* Refine arg of operator()

* Add kMThreadPerBlock to template parameter

* clang format

* Refine variable name

* Refine file name

* remove redundant line

* refactor layernorm2d pipeline and add block-per-block utility

* fix name

* rename more

* add more block-per-tile instance

* remove duplicated define

* update instance for 2048, 1024 case

* support up to 2048 now

* opt loading

* add n1536

* Add two pass pipeline

* format

* Fix incorrect type

* parallel compilation

* Use smaller N

* fix 2p pass

* Support Repeat_M in distribution

* Refine nameing

* Add reduce example

---------

Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-22 09:26:18 +08:00
Po Yen Chen
95e722a3b3 [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577)
* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged
2024-10-21 10:52:11 +08:00
Qianfeng
14c3cfb1c6 [CK_TILE] Improve headdim96 performance for fmha-bwd (#1573)
* Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96

* Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor

* Fix in bwd_piple_default_policy

* Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines

* Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim

* Update to hd96 tile settings

* Add smoke test scripts for fmha-bwd hd96

* Revert "Add smoke test scripts for fmha-bwd hd96"

This reverts commit 7ca7e1a93d.

* Remove hd96 tile settings in fmha_bwd codegen to save compiling

* Fix lost code line in bwd_pipeline_default_policy

* Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits

* Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor

* tiny adjustments

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: danyao12 <Dan.Yao@amd.com>
2024-10-16 18:14:32 +08:00
Bartłomiej Kocot
d02a92cc0d [CK_TILE] Add block universal gemm pipeline policy (#1557)
* [CK_TILE] Add block universal gemm pipeline policy

* Fixes

* fixes2

* Fixes3

* fixeS
2024-10-15 13:53:41 +02:00
Po Yen Chen
9868fd0245 Apply ROCm 6.2 WA to ROCm 6.3 and later (#1563) 2024-10-15 18:02:41 +08:00
Thomas Ning
35c1777d59 decouple the calling from gemm_pipeline (#1571)
* decouple the calling from gemm_pipeline

* clang format
2024-10-14 13:59:26 +08:00
Thomas Ning
6f27bc9872 Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535)
* ake the cshuffle compilable

* modify Mhe reference on gpu and cpu. Correaccess of cshuffle

* fix the cpu reference code

* Complete the in tile shuffle logic

* restructure the kernel template input

* change the naming pattern of ck_tile gemm pipeline

* Re-format files using remod.py

* Solve the fmha conflict with gemm

* Comment Addressed from Carlus

---------

Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
2024-10-10 18:02:22 +08:00
Po Yen Chen
0c094daa7e [CK_TILE] Update example README files & fix script compatibility issue (#1548)
* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command
2024-10-08 10:45:12 +08:00
Qianfeng
74d68e3b99 [CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549)
* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-08 10:44:34 +08:00
Bartłomiej Kocot
cc8f466a7e [CK_TILE] Fix conv param multiple definition (#1550)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-07 15:21:21 +02:00
rocking
0023f01ab0 [Ck tile] Support layernorm one pass (#1512)
* Fix compile error

* Add one pass pipeline

* Extract creating tile_window to operator()

* clang format

* reduce duplicated code

* do not hardcode

* Support padding in layernorm

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-07 14:25:53 +08:00
kylasa
c24fae2346 Adding seed and offset pointer support to the philox random number generator. (#1523)
* Adding seed and offset pointer support to the philox random number generator.

* Separating seed and offset pointer checks with different condition statements.

* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.

* Correcting a typo in the readme file

* Re-format files using remod.py

* Use STL type for API parameters

* Use simpler struct design for drop_seed & drop_offset

* Undo unnecessary changes

* Sync kargs style for fmha_fwd.hpp/.cpp

* Use templated union to reduce code

* Use structured binding to make code more readable

---------

Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-05 02:48:47 +08:00
Illia Silin
8e4c3fb1bc [CK_TILE] add missing vector header (#1537)
* add missing vector header

* Re-format header using remod.py

---------

Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
2024-10-01 07:58:20 -07:00
Po Yen Chen
a1c07e8d91 [CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527)
* Use same layout for o_acc and o tensor

* Use better param names in partitioner

* Remove redundant kargs 'max_seqlen_q'

* Use better param names in splitkv kernel

* Add comment for additional kernel arguments

* Sync empty loop early return logics between pipelines

* Pass more arguments to cmake in scripts

* Align backslashes

* Fix wrong o_acc tensor view strides

* Change o_acc layout if o_perm=0

* Handle whole row masked via attn_bias

* Use use vector width = 1 for o_acc

* Use more even split sizes
2024-10-01 22:13:52 +08:00
Bartłomiej Kocot
de3e3b6424 [CK_TILE] Image to Column kernel (#1532)
* [CK_TILE] Image to Column kernel

* Fixes

* Vector loads and stores

* Fixes

* Fixes

* change test dir name
2024-09-27 22:57:38 +02:00
Dan Yao
9d69a099a4 [CK_TILE] Fix compiler related FA bwd issues (#1530)
* add barriers

* tail bias barriers

* adjust bf16/hd256 tol

* continue adjust bf16/hd256 tol
2024-09-26 12:18:39 -07:00
Illia Silin
42e6dceacc Fix compilation errors with Clang20.0. (#1533)
* fix clang20 compilation errors for gfx90a

* fix clang20 compilation errors for gfx11 targets
2024-09-25 13:45:38 -07:00
Po Yen Chen
770d2b7725 Early return if seqlen_k=0 on group mode (#1524) 2024-09-22 20:05:58 +08:00
Thomas Ning
694c300145 Ck tile gemm padding dim (#1516)
* Support the N dimension padding

* Finished the padding feature for different dimension of K
2024-09-18 11:32:29 -07:00
Thomas Ning
844f5a1712 Ck tile GPU verification sample develop & Add the CK TILE GEMM to the CI/CD test (#1505)
* Finished the feature of gpu verification

* Add the ck_tile_gemm test in the CI CD

* add the include of tensor_layou in reference_gemm

* Comment Addressed

* split ck_tile fhma and gemm tests into separate stages

* restructure the reference gemm

* restructure a new reference_gemm api that could read the device mem

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2024-09-14 21:08:40 +08:00
Dan Yao
d09572e8c2 [CK_TILE] FA bwd repair (#1502)
* fix fa bwd

* revert kernelBlockSize in gemm_kernel.hpp
2024-09-10 10:45:32 -07:00
Thomas Ning
caacd38830 Ck tile gemm example (#1488)
* Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout

* Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.

* Fix: Clang Format, API fixed from fmha

* fix with better naming convention

* revert back the pipeline code of fmha

* Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.

* clang format with the reference_gemm file

* convert the clang format with the remod.py

* Changed the format and variable name of the kernel gemm_shape and partitioner

---------

Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
2024-09-07 16:23:32 +08:00
Dan Yao
b8addae293 [CK_TILE] float -> bf16 inline asm rtn (#1482)
* asm rtn

* add asm rtn macro

* reorder macro

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-08-30 15:38:09 +08:00
Po Yen Chen
461ec98d78 Enable scratch memory workaround on ROCm 6.2 (#1486)
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-08-30 10:40:00 +08:00