Commit Graph

279 Commits

Author SHA1 Message Date
carlushuang
1c45ca35dd hot-fix (#1768) 2024-12-20 16:40:45 +08:00
Po Yen Chen
37cdbf4f0e [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705)
* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
2024-12-20 14:41:01 +08:00
aledudek
453ca37347 [CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730)
* Gemm Kernel Refactor part1

* Gemm Kernel Refactor common gemm pipeline part2

* [CK TILE] Refactor batched gemm to reuse GemmKernel

* [CK TILE] Refactor GemmKernel - review changes part1

* [CK TILE] Refactor GemmKernel - references fix

* [CK TILE] Refactor GemmKernel - naming changes, add problem

* [CK_TILE] Refactor GemmKernel - update tests

* [CK_TILE] Refactor GemmKernel - review changes

* [CK_TILE] Refactor GemmKernel - update test

* [CK_TILE] Refactor GemmKernel - constness fixes

* [CK_TILE] Refactor GemmKernel - update tests
2024-12-18 17:52:46 +01:00
Adam Osewski
d46196f291 Enhance printing functionality (#1751)
* Added object print with all template parameters

* fix clang format

---------

Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2024-12-17 09:19:44 +01:00
Xu, Shengnan
f57d720c67 added moe interleaving pipeline (#1712)
* added moe interleaving pipeline

* remove redundant code

* formater

---------

Co-authored-by: root <root@hjbog-srdc-14.amd.com>
2024-12-15 20:13:10 +08:00
chenjun
4e73177684 Ck tile/smoothquant out stride (#1742)
* add ck_tile/smoothquant out stride parameter

* Remove the default stride value

---------

Co-authored-by: so <a.com>
2024-12-13 11:53:52 +08:00
carlushuang
77a38e0211 [CK_TILE] naive attn (#1708)
* add reference attention fwd

* refactor addresser

* update

* paged, and i8 reflect-quant

* lets call it forward-quant

* fix error in decode variation

* update naive-attn

* fix page table

* fix build err
2024-12-12 11:54:03 +08:00
Po Yen Chen
58e7f37fc8 Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725) 2024-12-06 12:59:58 +08:00
jakpiase
feb9a2bd9b Add IsSupportedArgument() to gemm_kernel (#1698)
* add IsSupportedArgument to gemm_kernel

* add ut and do some refactoring

* switched to ck_tile's integral_constant
2024-12-05 09:02:13 +01:00
Mateusz Ozga
4cb3d7d7ea Ck tile grouped GEMM example (#1713)
* Ck-tile, impl. grouped gemm

* Workspace is allocated by user, and is passed to the function

* Prepare test to new api design

* Unify GemTransKernelArgs, removing N0 param

* Add 1 to dim3 in paritioner

* Typo: gem - > gemm

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2024-12-04 21:40:01 +01:00
Po Yen Chen
126ce85aa1 [CK_TILE] Use 'false' for highest dimension padding flags (#1716)
* Use 'false' for highest dimension padding flags

* Update padding flag of bias
2024-12-04 15:59:58 +08:00
Bartłomiej Kocot
cff7fab798 [CK TILE] Fix universal gemm template keywords (#1704) 2024-11-29 20:51:09 -08:00
aledudek
78f0fea08e Ck tile batched gemm example (#1615)
* [CK Tile] Batched GEMM Example

* [CK Tile] Batched GEMM Example - minor refactor

* [CK Tile] Batched GEMM Example - README update

* [CK Tile] Batched Gemm Example - review changes

- Added tensor data layours as input parameters
- Changed structure of Host and Kernel args
- Removed bug with invalid vector read on non-contiguous memory

* [CK Tile] Batched Gemm Example - remove comment

* [CK Tile] Batched Gemm Example - Add GTests part1

* [CK Tile] Batched Gemm Example - GTests part2 + review changes

* [CK TILE] Batched GEMM post merge fixes

* [CK Tile] Batched GEMM Example - fix pad views
2024-11-29 11:52:18 +01:00
Bartłomiej Kocot
f49b595dc0 [CK TILE] Add gemm compute pipeline v3 (#1661)
* [CK TILE] Add gemm compute pipeline v3

* Enable universal gemm compute pipeline.

* Rename example and add compute pipeline.

* Introduce ag bg cr pipeline impl base.

* Refactor to reuse code.

* Cleaning

* Formatting.

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2024-11-28 17:51:49 +01:00
jakpiase
e7b6286441 Add interwave scheduler for gemm mem pipeline (#1647)
* add interwave scheduler for gemm mem pipeline

* Fix merge artifacts.

* Refactor unit tests.

* Switch to interwave scheduler for mem example

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2024-11-27 18:25:07 +01:00
rocking
abae2afc72 support max3 in smoothquant and add+ rmsnorm + rdquant (#1654)
* Fix cmake example build

* Support max3 in smoothquant one pass

* support max3 in two pass

* support max3 in add_rmsnorm_rdquant
2024-11-27 05:01:15 +08:00
Po Yen Chen
cf2d635ea2 [CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688)
* Allow getting batch size from splitkv tile partitioner

* Fix wrong paged-kvcache impl for group mode

* Fix wrong example code for page-kvcache

* Undo changes in fmha_fwd.cpp

* Always use 2D block table

* Add is_gappy kernel argument for paged-kvcache

The is_gappy argument is used for differentiating seqstart_k_ptr usage
in flash-attention & xformers

* Remove out-of-date comments

* Remove no-longer used method

* Fix wrong # page-block calculation

* Fix wrong comment

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
2024-11-26 20:37:54 +08:00
Adam Osewski
b6bcd76d88 CK-Tile first draft of universal block gemm with interwave & intrawave scheduler (#1676)
* Block universal gemm.

* Universal block gemm with interwave scheduler - draft.

* Refactoring

* Move a/b_warp_tiles into BlockGemmImpl
* set BlockGemmImpl as a class member

* Change tile size for more suitable to memory bound cases.

* Introduce kKPerThread to WarpGemm

* Add documentation comment.

* Fix Interwave scheduler block gemm.

* Add compute/memory friendly tile configuration.

* Clean

* New tile configurations in gemm mem example.

* Add more static checks and fix loop order in block gemm.

* Add more static checks and use warp gemm mfma dispatcher.

* Add default scheduler block gemm.

* Remove logging in example.
2024-11-26 08:45:14 +01:00
carlushuang
440e28b08f [CK_TILE] fused-moe first version (#1634)
* moe pipeline

* update code

* compile OK

* update

* update cpu reference

* update pipeline_gemm0

* compiler ok

* update pipeline

* rename to ex pipeline

* block-asm

* update

* update

* update first gemm ok

* compute correct

* update file structure

* update README

* update

* update

* update code

* update API

* return unsupport case

* add comment

* update readme

* update

* uncomment

* update

* fix build err

---------

Co-authored-by: valarLip <340077269@qq.com>
2024-11-26 11:14:56 +08:00
Po Yen Chen
645fe812f6 [CK_TILE] Fix fMHA fwd MakeKargs() compilation errors (#1689)
* Fix mis-matched tuple<> elem types

* Rename MakeKargs() as MakeKargsImpl()

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
2024-11-25 15:30:35 +08:00
carlushuang
36c7ce4e0e [CK_TILE]Moe update index (#1672)
* update MOCK_ID for moe-sorting

* add moe-smoothquant

* update a comment

* fix format

* hot fix

* update topk in overflow case

* update comments

* update bf16 cvt

---------

Co-authored-by: valarLip <340077269@qq.com>
2024-11-25 13:12:35 +08:00
Qianfeng
ce2bdf42a9 Change in fwd-splitkv kernel to support num_splits=1 case (#1690)
* Change in fwd-splitkv kernel to support num_splits=1 case

* Update in codegen fwd-splitkv to make num_splits > 1 cases pass

* Specify instance traits in dispatch

* Fix link error for fp8 kernels

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-11-25 12:31:38 +08:00
schung-amd
ff92222f93 [CK_TILE] MakeKargs overloads for backward compatibility (#1681)
* Add overloads for MakeKargs

Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.

* Add overloads for MakeKargs

Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.

* Re-format files using ck_tile remod.py

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-11-23 06:51:35 +08:00
Po Yen Chen
fb1ccfa9df [CK_TILE] Add paged-kvcache support in group mode fmha fwd splitkv kernels (#1678)
* Generate group mode paged-attn kernel

* Enable paged-kvcache + group mode support

* Add missing header: fused_moe.hpp

* Add comment to explain kernel arg usage

* Make error message more clear

* Add comment for confusing data member names

* Add more comment for confusing variable names

* Fix typo in option description
2024-11-21 14:53:10 +08:00
feli
c1f8d53ce8 [Ck_tile] hot fix, fix rpcf param setting err (#1657)
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
2024-11-14 14:06:36 +08:00
Bartłomiej Kocot
d20735691c [CK TILE] Update gemm universal pipeline (#1644)
* [CK TILE] Update gemm universal pipeline

* Fixes

* fix

* Rebase
2024-11-13 11:46:18 +01:00
Thomas Ning
2b6458ddf2 [CK Tile] Improve the Layout, Padding, and Alignment features of CK Tile GEMM (#1651)
* Finished the feature

* Modified the test file

* Test case update

* addresss comment

* Addressed the review comment

* Fixed the CI error
2024-11-11 18:08:25 -08:00
valarLip
8ef8a994e7 [CK_TILE] add more stride for layernorm to support un-continuous Tensor (#1650)
* [CK_TILE] add more stride for layernorm to support un-continuous Tensor

* align CK coding style

* extend strides to layernrom expample

* clang-format...
2024-11-11 16:02:28 +08:00
Po Yen Chen
13332998a4 Return nullptr when block index is invalid (#1649) 2024-11-11 09:28:32 +08:00
dummycoderfe
bec6fbc65f Ck tile/moe sorting (#1624)
* add moe_sorting & check ok

* fix comments & typo

* Run remod.py under include/ck_tile & example/ck_tile directories

* format codes

* fix output ci check bug

* fix moe sorting readme and error commit file

* use magiv div to accelerate compute

* add an loop unroll for moe lds ops

* add extblocksnel to set zeros for moebufs

* [Ck_tile] moe set zero run ok, add size check and fix ref check

* [Ck_tile]fix moe_sorting fuse set_zero remod

* [Ck_tile] change name style, fix zero buffer size err, change folder

* [Ck_tile] moe_sorting: fix name style

* [Ck_tile] moe_sorting, remove useless params in traits

* [Ck_tile] change outputtile cnt * unit_size; change output buf alloc

---------

Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-11-09 17:57:27 +08:00
dummycoderfe
686a58a912 [Ck tile] layernorm2d fwd optimize (#1637)
* optimze small N case using vec io and using rcp div

* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass

* [Ck_tile] fix blockSize compute in Generic2dBlockShape

* [Ck_tile]fix kfastfdiv template style

* [Ck_tile] layernorm, fix stype in review

---------

Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
2024-11-08 12:28:23 +08:00
carlushuang
cb6c5d39dc [CK_TILE] layernorm have more accurate residual (#1623)
* more accurate residual

* modify comment

* Fix literal case in README.md

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-11-02 13:30:16 +08:00
rocking
fbd654545a [Ck_tile] smoothquant (#1617)
* fix compile error

* fix typo of padding

* Add smoothquant op

* Add smoothquant instance library

* refine type

* add test script

* Re-generate smoothquant.hpp

* Always use 'current year' in copyright

* use Generic2dBlockShape instead

* Add vector = 8 instance back

* Find exe path automatically

* Simplify the api condition

* Remove debugging code

* update year

* Add blank line between function declaration

* explicitly cast return value to dim3

* refine return value

* Fix default warmup and repeat value

* Add comment

* refactor sommthquant cmake

* Add README

* Fix typo

---------

Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
2024-11-01 13:51:56 +08:00
carlushuang
550248deec [layernorm] hot fix (#1620)
* hot fix ln

* some rename
2024-11-01 11:52:50 +08:00
carlushuang
c3a4800c5f [CK_TILE] layernorm support fused-quant/fused-add (#1604)
* add prenorm/postnorm support, refactor using generate.py

* update README

* update README

* fix format

* update some description and fix format

* update format

* format

* use non-raw for loading

* format and update n4096

* dynamic-quant ready

* update readme

* support fused dynamic-quant

* update fused-quant, with smooth

* update README

* update args

* update some based on comment
2024-10-31 14:54:53 +08:00
Bartłomiej Kocot
9a8a52130d Remove virtual destructors from unary ops (#1610)
* Remove virtual destructors from unary ops

* Fixes

* Fixes

* clang format fixes
2024-10-30 17:42:50 +01:00
rocking
7d9111545f clang-format (#1612) 2024-10-30 08:13:30 -07:00
Adam Osewski
24d996aae1 [CK-Tile] Universal gemm memory bound pipeline (#1558)
* CK-Tile GEMM with memory bound pipeline.

* Memory bound gemm pipeline.

* Fix not closed namespace.

* Block gemm mem pipeline draft.

* Do not use ck_tile:: within ck_tile namespace.

* Refactoring & Move Layout info to pipeline problem.

* Get hot loop and TailNum information before lunching kernel.

* Fixes in pipeline.

* Add comment to load_tile_raw and change variable naming style.

* Few small changes & formatting.

* Do not use macro.

* Add gtests.

* Use AccDataType for Output of MFMA instruction.

* Formatting.

* Refactor gemm examples.

* Switch over to current block gemm.

* Use currently available pipeline policy.

* Refactoring and review comment.s

* Fixes after merge.

* Add missing include.

* Add load tile overload which accepts output tensor as parameter.

* This give 8% perf boost at the cost of using more registers.

* Rename example.

* Small changes.

* Fix compilation err and lower K.

* Support different layouts for A/B

* Fix vector size for different layouts.

* Rename Alignment into VectorSize

* Unblock tests.
2024-10-30 10:05:15 +01:00
rocking
3d60953477 [Ck tile] support rmsnorm and related fusion (#1605)
* Add reduce2d new api

* Prevent user use cross warp reduction

* Fix bug of std caculation

* Add rmsnorm2d

* Add rmsnorm small example

* Remove static assert to prevent compile fail

* Add script to test performance and correctness

* Add missing cmake change

* refine naming

* refine example of rmsnorm

* Fix bug of rmsnorm

* Refine naming

* Fix cmake

* clang format

* Refine pipeline name

* Add add_rmsnorm2d_rdquant kernel

* Add reduce op

* host verification

* Fix bug of one pass pipeline

* Refine tile size

* Add two pass pipeline

* Rename two pass to three pass

* Fix bug of kSaveX == false

* Add instance library

* Add test script

* Fix bug of x verification

* Add save_x to trait

* Add README

* Move reduce2d into reduce folder

* Fix bug of welford when number of m warp > 1

* remove reduncant comment

* 1. move 06_rmsnorm2d to 10_rmsnorm2d
2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant

* clang format and add missing header

* Add host validation of add + layernorm2d + rsquant

* Revert "Add host validation of add + layernorm2d + rsquant"

This reverts commit 936cb45797.

* Remove deprecated flag
2024-10-30 15:22:56 +08:00
Qianfeng
8632221814 [CK_TILE] Add fmha fwd headdim96 support (#1608)
* Add ceil_to_qualified_tile_length()

* Rename kK0BlockLength to kQKHeaddim

* Add kSubQKHeaddim concept to support headdim96

* Fix in math.hpp to avoid using __half interfaces

* Add LdsBufferSequence instance for headdim96

* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing

* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time

* Reformat one file

* Fix text alignment in fmha_fwd_splitkv.py

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-30 14:03:16 +08:00
valarLip
9fbd72e97e [CK_TILE] add generic_permute (#1607) 2024-10-29 18:05:53 +08:00
carlushuang
b098b71b05 topk_softmax (#1592)
* topk_softmax

* remove some file

* fix atomix linear_offset

* address various comment, and change sfc get_index api to static(tuple)
2024-10-26 23:52:49 +08:00
Po Yen Chen
54f0e6f4bb [CK_TILE] More fmha splitkv optimizations (#1588)
* Use pre-defined constants for readability

* Use vector write for o_acc tensor

* Remove no-longer used policy method

* Deprecate no-longer used policy/pipeline

* Specify gemm0/gemm1 block warps separately in codegen

* Fix wrong ps_idx creation logic

* Add single-warp block gemm

* Supoprt single-warp gemm0

* Make MakeCBlockTile() as static method

* Use MakeCBlockTile() to get underlying tile distribution

* Use kNumGemm1Warps to compute # threads for gemm1

* Put normal case in the if clause

* Refine fmha splitkv block mapping

* Refine & fix the lse_acc/o_acc layout

* Fix wrong LDS size for K tile

* Use kK0=64 for hdim=128,256 fmha splitkv kernels

* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels

* Undo kK0/kK1 changes

* Use more reasonable GetAlignmentV() computation

* Using store_tile() in fmha splitkv kernel epilogue
2024-10-26 18:35:45 +08:00
dummycoderfe
9183ce69ca hot_fix epsilon pos (#1597)
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
2024-10-25 11:17:45 +08:00
ltqin
0394f8a713 update layernorm (#1570)
* port layernorm

* change warp_welford.hpp

* Update warpshuffle

* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()

* refine welford max count calculation

* unify layernorm api

* Rename file

* Remove save mean and inv std

* Revert "refine welford max count calculation"

This reverts commit 022365802b.

* Fix order of parameter

* refine welford max count calculation again

* Remove fp32 instances

* Fix bug of padding

* refactor api

* Support bf16

* Extract common function

* Refine arg of operator()

* Add kMThreadPerBlock to template parameter

* clang format

* Refine variable name

* Refine file name

* remove redundant line

* refactor layernorm2d pipeline and add block-per-block utility

* fix name

* rename more

* add more block-per-tile instance

* remove duplicated define

* update instance for 2048, 1024 case

* support up to 2048 now

* opt loading

* add n1536

* Add two pass pipeline

* format

* Fix incorrect type

* parallel compilation

* Use smaller N

* fix 2p pass

* Support Repeat_M in distribution

* Refine nameing

* Add reduce example

---------

Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-22 09:26:18 +08:00
Po Yen Chen
95e722a3b3 [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577)
* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged
2024-10-21 10:52:11 +08:00
Qianfeng
14c3cfb1c6 [CK_TILE] Improve headdim96 performance for fmha-bwd (#1573)
* Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96

* Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor

* Fix in bwd_piple_default_policy

* Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines

* Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim

* Update to hd96 tile settings

* Add smoke test scripts for fmha-bwd hd96

* Revert "Add smoke test scripts for fmha-bwd hd96"

This reverts commit 7ca7e1a93d.

* Remove hd96 tile settings in fmha_bwd codegen to save compiling

* Fix lost code line in bwd_pipeline_default_policy

* Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits

* Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor

* tiny adjustments

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: danyao12 <Dan.Yao@amd.com>
2024-10-16 18:14:32 +08:00
Bartłomiej Kocot
d02a92cc0d [CK_TILE] Add block universal gemm pipeline policy (#1557)
* [CK_TILE] Add block universal gemm pipeline policy

* Fixes

* fixes2

* Fixes3

* fixeS
2024-10-15 13:53:41 +02:00
Thomas Ning
35c1777d59 decouple the calling from gemm_pipeline (#1571)
* decouple the calling from gemm_pipeline

* clang format
2024-10-14 13:59:26 +08:00
Thomas Ning
6f27bc9872 Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535)
* ake the cshuffle compilable

* modify Mhe reference on gpu and cpu. Correaccess of cshuffle

* fix the cpu reference code

* Complete the in tile shuffle logic

* restructure the kernel template input

* change the naming pattern of ck_tile gemm pipeline

* Re-format files using remod.py

* Solve the fmha conflict with gemm

* Comment Addressed from Carlus

---------

Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
2024-10-10 18:02:22 +08:00