Commit Graph

1065 Commits

Author SHA1 Message Date
Qianfeng Zhang
473fbc374b Rename the hacked block_gemm_areg_bsmem_creg_v2 2025-05-18 15:23:40 +00:00
Qianfeng Zhang
58e45ec53a Move the lambda for dividing by max_seqlen from kernel to pipeline 2025-05-18 07:58:52 +00:00
Qianfeng Zhang
0771390a28 Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop 2025-05-18 03:41:31 +00:00
Po Yen Chen
8cb0474b3d Use only qr_async pipeline for batch_prefill (#2195) 2025-05-15 11:47:29 -07:00
Qianfeng Zhang
586968785a Set example option -save_mask default to 0 2025-05-14 13:56:54 +00:00
Qianfeng Zhang
b0d3704390 Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask 2025-05-14 02:00:22 +00:00
Qianfeng Zhang
5b0a2618fd Add -save_mask option to the example to output int8 mask tensor 2025-05-14 01:54:45 +00:00
Illia Silin
58f9e9ffbc Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format
2025-05-13 10:18:14 -07:00
Qianfeng Zhang
c3761c3bd6 Update the rules of hstu masking 2025-05-13 10:37:19 +00:00
Po Yen Chen
2920604786 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-05-13 12:19:25 +08:00
Thomas Ning
b49f7de81f Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166)
* make the work compiled

* Solved the example code, but still have the profiler error

* Finished the feature

* Clang format and update the CHANGELOG

* solve the preshuffle v1 & v2 problem

* Comment Addressed

* Comment Addressed
2025-05-12 09:52:58 -07:00
Thomas Ning
9d1e44e56a Vectorized Transpose for Batched Transpose CK Tile Operator (#2131)
* Shared Memory for single data point

* CKTile Transpose vectorize CP1

* CKTile Transpose vectorize CP2

* CKTile Transpose vectorize CP2.1

* fixed the compile error of the transpose tile 2d

* Have the correct result for the current test sample

* Changes to printing tensor

* fp8 support added

* Debugging for transpose

* solving the corner issue

* Changed padding flag

* Intermideate Debugging

* Intermidiate Debugging

* Intermediate Debugging

* Finished debugging of the transpose op

* Code Cleanup

* Adding edge case smoke tests

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Addressing Review Comment

* Addressing Comments

* Addressing Comments

* Measuring Perf Tests

* Code Cleanup

* Changlog

* Added the running iterations

* clang format

* Fix the changelog

* Fix the compilation error

* change the printing factor

---------

Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
2025-05-12 00:41:45 -07:00
Qianfeng Zhang
3a320bcdf6 Add test cases for better functional verification 2025-05-10 16:04:06 +00:00
Qianfeng Zhang
79cd1f0653 Fix sequence dim length for o_dram descriptor in the kernel 2025-05-10 16:02:52 +00:00
Mingtao Gu
a23390163d fix moe gemm2 for gfx950 (#2164)
Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-05-09 08:25:31 -07:00
BingYuan.Zhou
6a3960c1e1 Flatmm merge (#2168)
* sync with function interface of cshuffleepiloge,fix flatmm build fail

* move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm

---------

Co-authored-by: solin <bingzhou@amd.com>
2025-05-08 12:59:57 +08:00
Khushbu Agarwal
c7b8e86e34 [CK_Tile] Simplified Mem pipeline (#2159)
* simplify code

* compiled the code

* Simplified example and codegen for mem pipeline

* Reveting config and universal gemm example

* clang formatted

* remove comments

* clang formatted

* Add memory operation changes for defualt pipeline

* fix config file

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-05-07 18:37:31 -07:00
Qianfeng Zhang
1d1dd8f1eb Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time"
This reverts commit 2972de4c88.
2025-05-07 13:37:15 +00:00
Qianfeng Zhang
d32851e15c Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack 2025-05-07 10:09:23 +00:00
Qianfeng Zhang
632fd06a7a Use kK1=16 2025-05-07 09:51:16 +00:00
Qianfeng Zhang
079f7e3a03 Use type_convert rather than static_cast in f_silu 2025-05-07 07:11:54 +00:00
kylasa
956fe8f751 Simple copy kernel, which can be a tool to experiment with CK_Tile API with minimal code. (#2156)
* Test Copy kernel code for testing tile distribution logic

* Fix the error

* Solved the problem

* Updated comments and document formatting

* Removed unused tile distribution and code cleanup

* Added README.md and formatting for CI/CD.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-05-07 00:02:59 -07:00
Qianfeng Zhang
72d55d1b40 Add max_seqlen as divider in siLu 2025-05-06 16:18:52 +00:00
carlushuang
4e9b76f88c [CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153)
* combine 2-3 as single stage

* support zeroing

* improve long tokens

* update specialization

* b16 ws

* 8bit topk optimize

* update 15 example
2025-05-06 17:32:07 +08:00
Qianfeng Zhang
374e0626e6 Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention 2025-05-06 08:24:03 +00:00
jakpiase
0bcb804ad0 [CK_TILE] Remove scratch usage from universal gemm (#2001)
* moves kbatch condition outside of kernel

* add reviewer comments

* fixes

* fix tests

* fixes after review

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-05 18:46:44 +02:00
Qianfeng Zhang
611f2ce1f9 Override and fix GetAlignmentK() 2025-05-03 16:23:54 +00:00
Andriy Roshchenko
79b0bfeb41 MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152)
* Add gemm_mx_fp8_bf8 example with row-major B

* Add more overloads of MX MFMA instructions

* Add MK_KN (RRR) tests

* Add KM_NK (CCR) tests

* Add more problem sizes to Large tests

* Add test_gemm_mx to the list of regression tests
2025-05-01 11:55:48 -06:00
Illia Silin
9a9f59ae69 Revert "Add ck tile examples to package (#1880)" (#2150) 2025-04-30 10:20:16 -07:00
Qianfeng Zhang
da89540ee0 Use kN0=32 2025-04-30 05:42:43 +00:00
Qianfeng Zhang
2972de4c88 Temporarily close the instance for hdim64 and hdim256 to save compiling time 2025-04-30 02:20:41 +00:00
Daniel Su
d9786f3363 Check max-ilp-scheduling compiler option for moe_gemm examples (#2127) 2025-04-28 13:40:22 -07:00
jakpiase
434d19f696 Add ck tile examples to package (#1880)
* add ck tile examples to package

* Update jenkinsfile

* fix for jenkinsfile

* fix for building ck tile code on non gfx9

* compile ck tile examples only for gfx94

* include ck tile examples in all target

* fix for basic gemm UseStructuredSparsity

* Update CMakeLists.txt

* Update gemm_pipeline_problem.hpp

* add targets to rocm install

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-04-28 09:53:19 -07:00
lalala-sh
83394e40d2 fix moe i4 example bug (#2139) 2025-04-28 09:49:31 -07:00
Qianfeng Zhang
d63dab90da Hack block_gemm_areg_bsmem_creg_v2 to let s_acc for gemm_0 not need be cleared first 2025-04-28 15:24:13 +00:00
Qianfeng Zhang
f1f4e249a6 Adjust the v_tile and k_tile loading location 2025-04-28 09:25:09 +00:00
Qianfeng Zhang
f53be61a74 Put two gemms call inside one n0loop unroll 2025-04-28 06:41:37 +00:00
Yi DING
8add2cf45d Fix fp8 convert & add option for basic example (#2129) 2025-04-27 16:26:05 -07:00
Qianfeng Zhang
1af27022ef Add IsFullTileInsideMask() to avoid pixel-by-pixel checking when kUseCausl=true but kUseLocal=false 2025-04-27 09:31:38 +00:00
Po Yen Chen
3d4d70d2fc Avoid using store_tile_raw() for fp32 tensors (#2072) 2025-04-26 23:07:41 -07:00
Qianfeng Zhang
054c397e05 Replace set_tile_if() by sweep_tile_span() to reduce branching 2025-04-27 05:00:09 +00:00
Qianfeng Zhang
95c93ba92e Update the GridSize() and GetTileIndex() in hstu kernel 2025-04-26 10:01:23 +00:00
Qianfeng Zhang
1b463e915d Add scripts for measuring jagged with/no causal cases 2025-04-25 15:59:51 +00:00
Qianfeng Zhang
9996270087 Tiny update in IsTokenPairInsideMask() 2025-04-25 15:36:58 +00:00
Qianfeng Zhang
27f7ab4f2c Use compiler builtin directly in f_silu for float type 2025-04-25 15:05:39 +00:00
Qianfeng Zhang
80677eb6e0 Code re-arrangement in pipeline 2025-04-25 14:41:37 +00:00
Qianfeng Zhang
4a49119d98 Update the seqlen_k_curr inside the first gemm loop 2025-04-25 13:59:48 +00:00
joyeamd
41541aff7a SWDEV-52596 for hdim=256, when use splitkv pipeline, two new pipelines need to be added (#2126) 2025-04-25 16:31:09 +08:00
Qianfeng Zhang
7818cce1c3 Rename the performance measurement scripts 2025-04-25 06:09:17 +00:00
Qianfeng Zhang
05910ebe0b Add support for WarpGem-16x16x32 in QK-BlockGemm (which enables using ds_write/read_b128 for K 2025-04-25 06:06:50 +00:00