Commit Graph

1780 Commits

Author SHA1 Message Date
Clement Lin
3e61925277 Remove unused code 2025-04-09 15:08:09 +08:00
YC Lin
fd26846d61 [GEMM] Refactor block gemm, pipeline, and policy of instruction schedule opt 2025-04-09 03:19:10 +00:00
YC Lin
fe61498468 [Add] Add build option for generating assembly 2025-04-06 23:50:26 +00:00
YC Lin
aac02a92ac [GEMM] Refactor block gemm and pipeline policy of instruction schedule 2025-04-06 23:37:29 +00:00
Clement Lin
9151a1fb42 Add flash_attention_fwd toy_example 2025-04-04 00:39:02 +08:00
mhYang
f28aabb42f Update tile size and use slc 2025-04-02 12:23:23 +00:00
mhYang
04be0fb437 Make buffer coherence configurable in tensor view 2025-04-02 11:51:23 +00:00
mhYang
d5531ab9c9 Fix add flops calculation 2025-04-01 19:13:18 +00:00
ClementLinCF
04513ca683 Create README.md 2025-04-01 09:11:29 +08:00
mhYang
d1dbc69eda Use mfma 16x16x32 2025-03-31 23:18:22 +00:00
mhYang
ee28e965f2 Fix KERNEL_D config 2025-03-31 17:53:57 +00:00
YC Lin
68cd6609eb [GEMM] Add pragma message for different MFMA options 2025-03-30 20:05:35 +00:00
YC Lin
a8027a5b2f [GEMM] Fix print typos 2025-03-30 19:55:13 +00:00
Clement Lin
5af7efdec5 Fix indentation typo 2025-03-30 15:06:07 +08:00
Clement Lin
de9385ba51 [GEMM] Fix MFMA condition checks 2025-03-30 14:02:30 +08:00
Clement Lin
5dd8e4ae0c [GEMM] Add new macor options check 2025-03-30 10:07:21 +08:00
Clement Lin
7bc473835e [GEMM] Add macros for multiple optimization options 2025-03-29 22:58:51 +08:00
YC Lin
428bcdeb40 [GEMM] default MFMA config 2025-03-29 21:11:29 +00:00
YC Lin
9a0d9dfc0a git push test 2025-03-29 21:09:00 +00:00
root
a3c6ca1761 [GEMM] fix MFMA configurations 2025-03-29 21:05:21 +00:00
mhYang
16a4e1585a Adjust mfma schedule order 2025-03-28 18:32:12 +00:00
Clement Lin
5428f17ca2 [GEMM] Replace const auto with constexpr index_t 2025-03-28 17:39:49 +08:00
Clement Lin
4eb246f20c [GEMM] Update cache-aware wg schedule 2025-03-28 17:31:19 +08:00
bobofang
8f1aa6fc6f Add MFMA M16N16K16 and M16N16K32 methods
these two methods are default off
2025-03-28 16:31:53 +00:00
YC Lin
f1562d5911 [GEMM] remove a_col_major/b_row_majro case 2025-03-26 16:31:24 +00:00
root
6d2b55914e [GEMM] modify if-else locations 2025-03-26 13:35:00 +00:00
mhYang
9ecde871a3 Fix AccDataType and CDataType
1. Fix AccDataType and CDataType
2. Remove indent
3. Align merge_transform for tutorial
2025-03-25 20:08:59 +00:00
mhYang
67072b3ba9 Fix build error 2025-03-25 18:55:34 +00:00
MHYang-gh
b0b5827673 Fix A/B lds transform (#2007) 2025-03-25 18:52:41 +00:00
root
7d08f99b02 [GEMM] disable/enable instruction scheduling 2025-03-25 16:53:20 +00:00
mhYang
4494c54dcd Fix missing message 2025-03-21 15:06:29 +00:00
mhYang
8f3b534d29 Fix xor transform dim. 2025-03-21 15:00:05 +00:00
Clement Lin
1f604e9b0a [GEMM] Add cache-aware WG schedule and adjust block tile
113 -> 121.7 TFops
2025-03-21 09:15:17 +08:00
mhYang
93193e42ea Add LDS bank conlict solutions 2025-03-20 21:36:56 +00:00
bobofang
d635209d59 Fix add accuracy issue
2673 GB/s -> 3271 GB/s
Perf: 0.0512898 ms, 3271.06 GB/s
2025-03-19 12:26:30 +08:00
root
ff15e2da7a [GEMM] use mfma k8 warp gemm 2025-03-17 16:01:29 +00:00
root
10033c1cdc [GEMM] disable/enable prefetch 2025-03-17 14:22:49 +00:00
Clement Lin
803ecb93d8 [CK TILE] Toy example - basic gemm 2025-03-12 15:55:47 +08:00
Clement Lin
1afc32c59c Adjust block shape
2673 GB/s -> 3647 GB/s
2025-03-12 14:58:06 +08:00
Clement Lin
58bc69aa99 Utilize vectorized memory access
1998.24 GB/s -> 2673 GB/s
2025-03-12 14:44:41 +08:00
Clement Lin
399cdb6f9f Adjust the size of thread block
1968.42 GB/s -> 1998.24 GB/s
2025-03-12 14:33:36 +08:00
Clement Lin
712d96cef5 [CK TILE] Toy example - basic add 2025-03-11 16:23:12 +08:00
Mingtao Gu
0db7c8f0b2 Ck int4 moe develop (#1949)
* Add Gemm fp8xint4 example and kernel, function pass.

* Init Gemm_fp8xint4 Bpreshuffle

* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet

* General fix.

* fp8xint4 bpreshuffle function pass

* fix.

* init b preshuffle dequant in VGPR.

* fix bug, function pass.

* move b thread dequant copy to blockwise.

* fix bug, function now passes.

* modified the tile size to 256, 128x128x128.

* fixed a bug.

* Initial int4 moe, compile pass, function not check.

* fix bug in moe_gemm1.cpp, now function pass.

* test expert = 8 and function pass.

* Added moe_pk_i4_gemm2, function pass.

* Added b preshuffle pipeline v3 support.

* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.

* Split the blockwise pipeline for fp8xint4.

* commit missing files

* opt gemm2 to 2x2 wave

* fix swizzle = false

* update int4 moe with latest input changes.

* update tile size.

* enable pipeline v3.

* fix nswizzle = true

* commit a version for compiler debug.

* Updated transfer_v3r1_gather to support pk_i4_t type.

* for int4 moe2 for type_convert support.

* remove some values between mfma instructions.

* fix int4 moe

* Updated transfer_v3r1_gather to support pk_i4_t type.

* i4 support lds multiple shuffle

* fixed int4 moe tflops calculation.

* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle

* updated gemm2.

* change int4 moe example names

* fix and format code.

* format.

* format codes.

* update fp8xint4 example tile size.

* add <unordered_map> header

* fixed.

* format.

* Added conditional compilation for int4 -> fp8 conversion kernels

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-10 11:16:44 +08:00
Thomas Ning
c954bd0cfa Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955)
* tempsave, not selected

* finish the feature and merge with develop

---------

Co-authored-by: aska-0096 <haocwang@amd.com>
2025-03-07 13:44:06 -08:00
Thomas Ning
9d51d17dd0 Fix on the error (#1956) 2025-03-07 13:43:52 -08:00
Illia Silin
0e8e711ec8 add missing headers (#1959) 2025-03-07 11:11:30 -08:00
Max Podkorytov
9e132eb77c refactor ck-tile kernel launch (#1925) 2025-03-07 08:29:40 -08:00
Qianfeng
4f54fa3058 Ck tile/complete k prefetch (#1941)
* Re-implement qr_ks_vs_async pipeline by using kLoadOnce

* Remove last block_sync_lds() in the loop

* Tiny adjustment in qr_ks_vs_async pipeline for better performance

* Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline

* Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline

* Use un-rolled gemm for Gemm-0

* Use k0_loops small tile load/store to replace the big tile load/store for K

* Remove the commented lines in qx_ks_vs_custom_policy.hpp

* Tune the prefetching of V in qr_ks_vs_async pipeline

* Move the codes for storing the first v_lds tile some later

* Let BlockDropout reuse LDS with V

* Switch to separate code blocks according to iteration index

* Interleave code blocks for better performance

* Move clear_tile(s_acc) for better interleaving

* Move code interleaving

* Use MakeQDramTileDistribution for q_dram_window

* Roll-back to load Q directly from global memory instead of using LDS as intermediary stop

* Let V reuse the LDS of K

* Use array of tiles to represent Q in vgprs

* Use QLoadOnce == false for qr_ks_vs_async pipeline

* Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline

* Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile()

* Move the definition of v_tiles out from the loop

* Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile()

* Remove using KLoadOnce in qx_ks_vs_custom_policy

* Remove un-used get_slice_tile() call

* Move the code line of clear_tile(s_acc)

* Tune the lines of codes to make them more tidy

* Re-arrange the codes before the main-loop

* Add comments

* Unify the alignment to be 8 for Q/K/V Lds decriptors

* Tuning to K pre-loading

* Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false

* Adjust the pipeline codes

* Use NumPrefetchV to separate from NumVLdsBuffers

* Tune the location of a scheduler barrier code line

* Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths

* Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel

* Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline

* Move the location for store_tile() of first v_tile

* Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline

* Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<>

* Try to fix old bugs in qx_ks_vs_custom_policy

* Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler

* Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy

* Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV)

* Tiny Fix (#1888)

* Ck tile/paged attention workaround (#1894)

* Correction in GetRangeAlongX()

* Work-around to solve the failures in test_paged_attention_ck in xformers

* Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline

* Remove one call of move_tile_window for q_dram_window

* Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers()

* Tiny fix in qr_ks_vs_whole_k_prefetch pipeline

* Adjust the location of codes for storing the first V tile to LDS

* Tiny fix and add comments

* Change GetSmemKPackK size to improve performance

* Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy

* Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access

* Adjustment in intermediate iteration codes for tiny performance improvement

* Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation

* Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue

* Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer()

* Remove useless AsyncopyV

* Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used

* Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy

* Recover the qr_ks_vs_async pipeline

* Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline

* Revert "Try to fix old bugs in qx_ks_vs_custom_policy"

This reverts commit 39b82ca194.

* Tiny fix with regard to whole_k_prefetch pipeline compiling

* Update kPadSeqLenK setting in fmha_fwd_kernel

* Use q_element_func and k_element_func

* Use single q_tile rather than multiple sliced q_tiles

* Codes refine according to the comments

* Re-format one file

* Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true
2025-03-07 14:19:51 +08:00
Illia Silin
43c90b5234 RE-enable DL and DPP instances by default. (#1954)
* enable DL and DPP instances by default

* fix cmake logic
2025-03-06 21:45:31 -08:00
Max Podkorytov
7a4a5d6c08 Update CODEOWNERS (#1953)
Add @tenpercent
2025-03-06 17:38:29 -08:00