Commit Graph

528 Commits

Author SHA1 Message Date
Bartłomiej Kocot
25e2e0f04a [CK TILE] Implement cschuflle algorithm (#1842)
* [CK TILE] Implement cschuflle algorithm

* Rebase

* Vector store size fixes

* fixes

* Fixes

* fixes

* fmha fix

* fixes

* fixes of fixes
2025-01-30 11:57:39 +01:00
fangche123
c5fff071e5 add batched_transpose implement (#1660)
* add batched_transpose implement

---------

Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-01-28 16:22:02 -08:00
Andriy Roshchenko
35aebe5936 Add OCP FP8 support in CK_TILE (#1829)
* Add OCP FP8 to CK_TILE

* Validate OCP FP8 in FMHA FWD under VALID=1
2025-01-27 11:59:49 -07:00
Adam Osewski
39dc25a9b8 [CK-Tile] Enable vectorized reads on all layouts & improve perf. (#1835)
* Refactor universal gemm policy.

* Adapt example to refactor changes.

* Introduce static encoding pattern

* Adding shuffled encoding patterns.

* Fix err in reverse tuple.

* Add transpose_tile2d

* Small refactoring + doc

* Enable reading on contiguous dimension in all layouts.

* Transpose A/B register tile if needed for comp v3 pipeline.

* Take contiguous dim size when calculating dram vector load size.

* A/B smem pack size taken from WarpGemm attributes

* Update B LDS layout and setup tile distribution pattern at class level.

* Fix static assert.

* Fix errors in examples.

* Formatting & fix IsTranspose

* Fix VectorSize & refactor.

* Add error loging messages.

* Fix VecLoadSize and TranspseC for mem pipeline.

* Update unit-tests & disable mem pipeline.

* Clang format

* Update include/ck_tile/core/tensor/tile_window.hpp

Co-authored-by: jakpiase <jakub.piasecki@amd.com>

* Fix compilation and reviewers comments.

* Refactor unit-test. Fallback to non-universal gemm.

Need to use GemmPipelineAGmemBGmemCRegV1 for now,
since GemmKernel is now supporting also non-K major vector reads.

---------

Co-authored-by: jakpiase <jakub.piasecki@amd.com>
2025-01-27 16:37:19 +01:00
ruanjm
64d5c4d6cb Implement fp8 quant for layernorm and rmsnorm (#1814) 2025-01-24 16:40:43 +08:00
coderfeli
6b51413b6e compile ok 2025-01-24 08:36:49 +00:00
carlushuang
5b9b083dbc [CK_TILE] not using structures under ck_tile/ops for ck_tile/host (#1834)
* not using structures under ck_tile/ops for ck_tile/host

* update as constexpr function

* Rename fn

* Update other examples.

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2025-01-24 15:35:54 +08:00
carlushuang
052a72655c add fp8 as dst (#1830) 2025-01-22 17:34:27 +08:00
Mateusz Ozga
3c93d3c444 CK-Tile Grouped GEMM refactor and post PR fixes (#1756)
* Grouped gemm simple code refactor

* Offset invoker

* Invoke generic Run, and replace name of parrtitioner variable

* Tests fix type

* Removed namespaces

* Add template param to avoid implicit cast

* Remove generic function

* Constant value

* underline enum to int16_t

* Generalize partitioner function

* Remove whitespaces

* Rename function

* Using support

* Clang-format

* Clang-format

* Fn-partitioner description fn

* Typo

* Typo 2

* Better description

* Better description

* Refactor after review

* Use ctr instead of set fn

* Inovke ctr and typo

* Comments

* Remove unnecessary comment

* Review, remove modulo
2025-01-21 21:06:10 +01:00
Bartłomiej Kocot
bdddf1eace [CK_TILE] Add error threshold calculation for gemm examples (#1821) 2025-01-18 01:01:52 +01:00
Aviral Goel
54de3e55e1 Implementing Test Filters for Smoke and Regression Tests (#1819)
* smoke and regression targets working with tests

* test filters work for both examples and test

* removed uneccesary comments

* added a missing comment

* added a missing comment

* fixed typo in the comments

* updated README

* Update PULL_REQUEST_TEMPLATE.md

updating the template for future addition of test cases

* Update PULL_REQUEST_TEMPLATE.md
2025-01-16 16:40:08 -08:00
carlushuang
1ff50e78c6 [CK_TILE] Fix mock token id, support g1u1/g1u0 through same inline code block (#1808)
* fix mock token id

* prepare host for g1u1

* reformat inline-asm

* restructure uk_0

* restructure gate_up

* done

* change default to init=1

* update readme

* fix a bug in interleave pipeline

* rcp for silu
2025-01-16 17:51:10 +08:00
Bartłomiej Kocot
7790e8c3f7 Add rounding for float to bf16 conversion as default (#1812)
* Add rounding for float to bf16 conversion

* Add bhalf test

* Add inf test bhalf

* Refactor

* update cmake

* Fixes
2025-01-15 16:41:21 +01:00
ruanjm
04dd314883 [CK_TILE] Add Various Fusion Functions to RMSNorm (#1802)
* Add shortcut to RMSNorm

* Modify test for adding shortcut for RMSNorm

* Add fused parameter into tests

* 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp

* 1. Supports various stride and percisions.

* Add support of Epilogue

* Add fuse and epilogue support to rmsnorm ref

* Modify rmsnorm example

* Refactor tests/examples

* Bug fix for newly added tests/examples

* Bug fix for new tests 2

* Modify smoke test scripts

remove dbg code

* Supports non-smooth dyanmic quant

* Update Rmsnorm2dFwd::GetName()

* rename xscale and prec_sx to smoothscale and prec_sm

Bug fix after rename

Remove files

* change example_rmsnorm2d_fwd.cpp

* update performance calculator

* Fix issue in two-pass when fuse add is enabled

* Remove comment of beta

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
2025-01-15 10:23:48 +08:00
aska-0096
b755f37502 add save_x=true instance 2025-01-13 07:12:46 +00:00
Thomas Ning
5d671a5fc4 CK Tile GEMM CICD fixed & register block method refactor (#1776)
* refactor the block_gemm_areg_breg_creg_v1 and add the v2 policy with 2x2 warp gemm

* Finished the 2x2 warp gemm policy and the block selection mechanism

* Clang format

* address poyen's comment

* Address feedbacks

* Fixed the compilation issue

* Change the function name
2025-01-13 13:10:44 +08:00
ClementLinCF
0b8f117f1a [CK_TILE] Adjust kBlockSize of reduce example for better perf (#1779)
* Observed a 2x perf improvement with kBlockSize = 256
* Using 512 threads may lead to redundant computations
2025-01-12 20:50:32 -08:00
aska-0096
35ba08646f fp8 add_rmsnorm_dynamic_dequant 2025-01-10 11:12:16 +00:00
Thomas Ning
73a076eee1 Ck tile/gemm perf measure (#1750)
* Finished adding the performance benchmark for ck tile gemm

* Fix the executable rename problem

* fix the executable name error

* delete the unsupported layout combinations

* Update run_full_test.sh

* Update benchmark_mem_pipeline.sh

* Update benchmark_basic.sh

* change the executable of gemm_universal

* change ck_tile_gemm script permissions

* Addressed the comment

* Addressed the comment

* Fixed the comments

* Fixed Comment

* roll back the malfunctioned change

* Fix the Typo

* finalize the tile_gemm_fp16 performance monitoring

* fix the stash names for ck_tile gemm logs

* change the stashing logic

* change stashing syntax

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2025-01-09 17:41:49 -08:00
darren-amd
26b3829c02 Disable building DPP kernels by default (#1804)
* Disable building DPP kernels by default

* Disable building dpp instances, examples, or tests if DPP_KERNELS is not set

* Add new DPP_KERNELS flag to readme
2025-01-08 13:50:42 -05:00
aska-0096
487a05d612 refine blockgemm pipeline version as base struct. 2025-01-08 14:27:42 +00:00
AMD-dteng
d5c8a334ca enable bias feature that add bias before adding residual (for rtpllm project) (#1741)
* 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16

* delete comment

* 1.remove fmha change 2.change buffer name from bias to xbias

* Now bias can be used independently from fadd

* change kbias to kxbias

---------

Co-authored-by: feli <felix.li@amd.com>
2025-01-08 17:51:06 +08:00
aska-0096
22fe522d0c optimize software pipeline 2025-01-08 09:28:32 +00:00
Po Yen Chen
24b12d04af [CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789)
* Update license year

* Add initial code to override decode problem

* Fix splitkv traits/args overriding error

* Reshape and transpose lse for decode

* Remove debug code

* Prettify example code

* Use better function name

* Add kMergeNumHeadGroupsSeqLenQ flag

Kernel user can use this switch to turn on/off optimization for
some problem sizes

* Add missing flag declarations

* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen

* Group similar statements together

* Remove assumption of seqlen_q=1

* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel

* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel

* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need

* Fix group mode block skip logics

* Undo changes of normal fwd kernel

* Update in GridSize() and using GridSize() for splitkv kernel (#1799)

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
2025-01-07 18:49:24 +08:00
carlushuang
6df5fe2ad8 [CK_TILE]naive attn support FP8 KVCache quant (#1747)
* quant

* fix bug

* simple smoothquant after softmax

* update kv-quant

* update stride

* fix fp8-pertoken-kvcache

* update int8/fp8 quant support

---------

Co-authored-by: so <a.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-01-03 18:43:07 +08:00
Mingtao Gu
4f62f6e9b7 Implement the fp16xint4 scale weight only kernel for Ali (#1786)
* enable int4 scale (weight only) kernel

* format some files

* Add unit test for int4 weight only

* fixed and formatted code

* fixed

* formated

* formated

* fixed

* fixed a bug in the ckProfiler, and formatted the code

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-01-03 18:35:21 +08:00
feli
4bc610416a Ck tile/layernorm: implement naive reduce, opt performance (#1784)
* add no welford

* enable output raw

* raw of int8

* fix build

* fix smoke test err

* [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate

* [cktile]layernorm, fix err commit files and remove uselss

* fix quant 8192 err & change norm_reduce class and file name

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2025-01-03 14:28:59 +08:00
aska-0096
9dd74e0d58 tempsave 2025-01-03 02:53:55 +00:00
Muhammed Emin Ozturk
9e95d54cd2 BF16 GEMM Stream-K (#1541)
* initial

* Cmake file

* successfull compilation but validation failed

* Cmake

* update

* gpu validation

* gemm universal

* gemm universal sk update

* sk bf16 universal instance

* gemm_universal_streamk.hpp

* only build for gfx94

* Cmakelist

* profiler update, bf16 sk only works at gfx42

* clang

* clang

* clang all

* no need flags

* cmake script

* delete comment

* gemm universal sk fix

* clang

* profiler fix

* clang

* update

* update

* delete comment

* code formatting

* cmake

* fix instance

* clang

* argument supported

* argument supported and clang

* update

* fix

* removing unnecessary comments

* clang formatting

* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt

Co-authored-by: afagaj <john.afaganis@gmail.com>

* CopyRight Comment 2025

* clang reformatting

* copy right 2025

---------

Co-authored-by: Emin Ozturk <ozturk.27@osu.edu>
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund>
Co-authored-by: root <root@splinter-126-wr-d3.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund>
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund>
Co-authored-by: afagaj <john.afaganis@gmail.com>
2025-01-02 10:30:04 -08:00
aska-0096
0dbe537032 refine weight preshuffle format. 2025-01-02 13:59:58 +00:00
Adam Osewski
1d8e4ec2ce Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762)
* add a prototype of int4

* clean

* debug

* clean

* clean

* move packed into dynamic_buffer

* fixed coord reset

* add fast pki4 to half conversion

* fix

* fixed reference and host_tensor

* fixed tensor init

* format

* debug i4_to_f16_convert

* format

* fixed splitk

* weight permute

* add b tile permute

* clean

* weight permute with splitki

* format

* improve weight layout

* add and_or_b32

* fixed splitk crush

* add permute switch as a template

* recover v3r1

* clean

* failure with intrawave v2

* fixed

* fixed

* add ckProfiler

* add bfp16 support

* add bf16 example

* fixed int4 to bhalf_t conversion

* format

* fixed int4 to bf16 conversion

* clean

* add instances for mem

* clean

* fixed host tensor size

* fixed

* debug

* fixed

* add pk_i4_t as a struct

* fix

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* revert

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed comments

* revert

* clean

* revert

* revert

* fixed

* Update CMakeLists.txt

* Update script/cmake-ck-dev.sh

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update CMakeLists.txt

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed

* fixed

* fixed

* revert

* revert

* add comments

* format

* fixed assert

* fixed

* Fix I4 define in ckProfiler

* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-01-02 11:48:06 +08:00
aska-0096
72c1ddacb9 Merge branch 'add_a8w8_preshuffle_ckprofiler' of https://github.com/ROCm/composable_kernel into update_cka8w8_uc 2024-12-31 07:23:50 +00:00
aska-0096
5bbff07d40 use bpreshuffle as independent example 2024-12-31 07:20:01 +00:00
aska-0096
bbbedc1fd7 add fp16 instances 2024-12-31 07:14:56 +00:00
aska-0096
f60f9d5917 sanity pass, most tile size enabled. TODO: NWave!=4 2024-12-30 18:22:08 +00:00
aska-0096
74ef5021b6 tempsave 2024-12-30 09:20:25 +00:00
coderfeli
3f9dbcac63 use new pipeline for b preshuffle, run ok; revert olds to fix ckprofiler 2024-12-30 06:52:10 +00:00
coderfeli
54f44e6232 fix brepeat, kloop and lds two buffer; works ok now 2024-12-30 00:25:46 +00:00
Qianfeng
4e076909b6 Remove using partitioner for all fmha kernels (#1778)
* Remove using tile partitioner for fmha_fwd_kernel

* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels

* Remove using tile partitioner for fmha_fwd_appendkv kernel

* Unify the format of GetTileIndex
2024-12-29 14:29:56 +08:00
Bartłomiej Kocot
af66494880 [CK TILE] GEMM and Batched GEMM SplitK support (#1724)
* [CK TILE] Add split K support in GEMM

* Updates

* Fixes

* rebase

* fix

* Fix

* fixes

* support for batched gemm
2024-12-28 14:40:17 +01:00
coderfeli
c263bbe7e0 fix cmake rm compile options 2024-12-27 14:42:47 +00:00
coderfeli
2c056624af fix tail 2024-12-27 08:30:03 +00:00
coderfeli
174b46b04a add cpu shuffle 2024-12-27 07:31:14 +00:00
coderfeli
400cac2839 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8 2024-12-27 05:42:38 +00:00
coderfeli
4a1ec81595 add bypass logic and build 2024-12-26 10:05:25 +00:00
Po Yen Chen
4c2eff023a Correct the dtype checking logics (#1775) 2024-12-25 23:57:28 +08:00
carlushuang
3d15f364b3 [CK_TILE] optimize moe-sorting kernel (#1771)
* opt moe sorting

* remove commented code
2024-12-23 10:59:02 +08:00
Po Yen Chen
37cdbf4f0e [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705)
* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
2024-12-20 14:41:01 +08:00
Mateusz Ozga
e758d006a5 Apply Ck-tile argument parser for vectors [I/O] (#1758)
* Parser for a vector was added. Additionaly we valid correctnes of numbers

* Remove unnecessary comments

* Review part 1

* Review part 2

* Add const to variadic lambda

* Rename C->K
2024-12-19 17:55:35 +01:00
aledudek
453ca37347 [CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730)
* Gemm Kernel Refactor part1

* Gemm Kernel Refactor common gemm pipeline part2

* [CK TILE] Refactor batched gemm to reuse GemmKernel

* [CK TILE] Refactor GemmKernel - review changes part1

* [CK TILE] Refactor GemmKernel - references fix

* [CK TILE] Refactor GemmKernel - naming changes, add problem

* [CK_TILE] Refactor GemmKernel - update tests

* [CK_TILE] Refactor GemmKernel - review changes

* [CK_TILE] Refactor GemmKernel - update test

* [CK_TILE] Refactor GemmKernel - constness fixes

* [CK_TILE] Refactor GemmKernel - update tests
2024-12-18 17:52:46 +01:00