Commit Graph

78 Commits

Author SHA1 Message Date
root
6b60a10cc0 for gfx350 kernel debug 2025-08-29 03:39:09 +00:00
slippedJim
e4f0c4a549 [WIP] enable hd128 swa (#2137)
* enable hd128 swa
2025-05-08 21:00:37 +08:00
slippedJim
d0028193fc [WIP] update hd64 group hex & explicit hd128 in kernel name (#2118)
* remove all pssk kernels

* update: update pssk kernels

* update

* update

* fix kernel name

* fix: update initial version hex

* add shell scripts of smoke test and benchmark test for group mode

* fix typo

---------

Co-authored-by: Wen.Yang <Wen.Yang@example.com>
2025-04-24 10:16:13 +08:00
wen-des
a115ab14ed group mode attetnion for generic 64<hd<=128 (#2092)
* group mode attetnion for generic 64<hd<=128

* fix loop_idx overridden before kvBase in causal mode

* align format

---------

Co-authored-by: Wen.Yang <Wen.Yang@example.com>
Co-authored-by: danyao12 <danyao12@amd.com>
2025-04-20 11:11:11 +08:00
danyao12
7949a5adf8 hd192 temp version 2025-04-05 14:50:43 +08:00
danyao12
ab5f40cfaf enable SBHD 2025-03-24 17:58:01 +08:00
danyao12
bbb526b9d2 support for more layouts 2025-03-24 15:10:33 +08:00
danyao12
5e0ebbe8b6 fix hd64 group mode causal api 2025-03-23 12:43:25 +08:00
danyao12
917f7e55f8 Merge branch 'develop' into ck_tile/fa_bwd_v3 2025-03-22 20:38:04 +08:00
carlushuang
6c08c5c46d add mask support in hdim=192/128 (#1999) 2025-03-21 18:28:43 +08:00
wen-des
8a25aa2669 support group mode for hd=64 of fa bwd v3 (#1990)
* support group mode for hd=64 of fa bwd v3

* bugfixed for causal mask kernels when using kernel balence

* tiny align

---------

Co-authored-by: Wen.Yang <Wen.Yang@example.com>
Co-authored-by: danyao12 <danyao12@amd.com>
2025-03-21 11:42:58 +08:00
carlushuang
e3c9886cdf [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-03-20 11:00:29 -07:00
rocking
b819c217e4 Sync the kname with instance name (#1989)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-03-20 00:06:45 +08:00
danyao12
e80ff1acbb tiny fix 2025-03-18 12:08:34 +08:00
carlushuang
3e81279d26 Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit 8cbcd3e0d0.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-13 11:41:39 +08:00
Illia Silin
8cbcd3e0d0 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 7a93b16ff6.
2025-03-11 10:40:18 -07:00
carlushuang
7a93b16ff6 [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update
2025-03-11 21:07:40 +08:00
wenchenvincent
489602f9a8 Enabled bwd support for hdim_qk != hdim_v for TE integration. (#1965) 2025-03-11 11:35:27 +08:00
Max Podkorytov
9e132eb77c refactor ck-tile kernel launch (#1925) 2025-03-07 08:29:40 -08:00
Ye Wang
151e999931 with receipt id=5, enable bias for te 2025-03-05 00:24:41 -06:00
danyao12
173ed584e0 Merge branch 'develop' into ck_tile/fa_bwd_v3 2025-03-03 19:17:27 +08:00
danyao12
06ffd0b3ca add api limit for top-left causal mask 2025-03-03 14:00:34 +08:00
danyao12
60178ac970 hd64~128(x8) fp16/bf16 a32 w/cas_kb arbitrary seqlen_q/seqlen_k/strides 2025-02-28 15:17:27 +08:00
rocking
faa2235dad explicit show no feature in kernel name (#1920) 2025-02-28 14:23:30 +08:00
slippedJim
a9bcd3c98d make fmha bwd api template for v2 & v3 (#1918)
* use template fmha_bwd function

* update

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-02-27 19:26:19 +08:00
rocking
e9ee568681 Apply filter to every kernel in the codgen of FMHA (#1911)
* add receipt for fwd

* Add receipt for bwd

* Use kernel name to avoid more receipt

* apply filter to every kernel
2025-02-26 20:20:29 +08:00
Dan Yao
32760acdbe Merge pull request #1897 from ROCm/yewang12/ck_fav3_thread_local
make fmha_bwd_v3_kernel thread_local
2025-02-22 12:00:42 +08:00
danyao12
440a9d3bf7 Merge branch 'develop' into ck_tile/fa_bwd_v3 2025-02-20 16:01:26 +08:00
rocking
e4358c01d9 only output the deterministic bwd kernel for aiter (#1903)
* only output the deterministic kernel

* Add comment
2025-02-20 04:27:01 +08:00
danyao12
f1c0a36b9c Merge branch 'develop' into ck_tile/fa_bwd_v3 2025-02-19 15:22:29 +08:00
rocking
f0d49d14fc Add receipt 10~12 for codegen of aiter integration (#1877)
* Add receipt for aiter integration

* update receipt

* Add hdim 96 instances

* Revert "Add hdim 96 instances"

This reverts commit f339449f54.
2025-02-19 09:01:08 +08:00
Ye Wang
216a6c2518 make fmha_bwd_v3_kernel thread_local
In Jax TE, multiple threads in the same process are spawn to train for each GPU. Therefore hipModueLoadData, hipGetFunction need to be run for each GPU in each corresponding threads.
2025-02-18 10:45:09 -06:00
Andres Lugo
8086bbe3a7 Add receipt 4 option to codegen (#1875)
* Add receipt 4 option to codegen

* Remove repeated code

* Review comments
2025-02-11 10:11:46 -08:00
danyao12
cf70a2efbb rename 2025-02-08 15:44:33 +08:00
danyao12
8a8dc7f6c3 add hd64 fp16 kernels 2025-02-05 16:24:21 +08:00
danyao12
008c91c988 add layout restrictions 2025-01-27 10:57:14 +08:00
danyao12
92494a8a70 separate hd pad/unpad kernels 2025-01-24 13:55:02 +08:00
danyao12
b16fa5f08c hd padding(hd % 8 == 0) support from 64 to 128 2025-01-23 15:43:40 +08:00
danyao12
248fd58859 remove v3 spec 2025-01-21 15:36:29 +08:00
danyao12
d61f4b8393 fix hd64 seqlen64 memory fault 2025-01-13 12:14:51 +08:00
Po Yen Chen
24b12d04af [CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789)
* Update license year

* Add initial code to override decode problem

* Fix splitkv traits/args overriding error

* Reshape and transpose lse for decode

* Remove debug code

* Prettify example code

* Use better function name

* Add kMergeNumHeadGroupsSeqLenQ flag

Kernel user can use this switch to turn on/off optimization for
some problem sizes

* Add missing flag declarations

* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen

* Group similar statements together

* Remove assumption of seqlen_q=1

* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel

* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel

* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need

* Fix group mode block skip logics

* Undo changes of normal fwd kernel

* Update in GridSize() and using GridSize() for splitkv kernel (#1799)

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
2025-01-07 18:49:24 +08:00
danyao12
466b82a5c4 add data type config to FAv3 2025-01-07 15:39:33 +08:00
danyao12
cd4d462909 Merge branch 'develop' into ck_tile/fa_bwd_v3 2025-01-07 14:55:08 +08:00
danyao12
0c126ffc19 qdo/kv strides split 2025-01-03 16:01:30 +08:00
Qianfeng
4e076909b6 Remove using partitioner for all fmha kernels (#1778)
* Remove using tile partitioner for fmha_fwd_kernel

* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels

* Remove using tile partitioner for fmha_fwd_appendkv kernel

* Unify the format of GetTileIndex
2024-12-29 14:29:56 +08:00
danyao12
43d8190333 add templates 2024-12-28 22:26:55 +08:00
danyao12
2defe2f6e9 enable hd64 bf16 atomic32 2024-12-28 13:09:34 +08:00
Po Yen Chen
4c2eff023a Correct the dtype checking logics (#1775) 2024-12-25 23:57:28 +08:00
danyao12
64bf2f361d enable hd64 bf16 causal 2024-12-25 13:31:05 +08:00
Po Yen Chen
37cdbf4f0e [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705)
* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
2024-12-20 14:41:01 +08:00