Commit Graph

1043 Commits

Author SHA1 Message Date
Aviral Goel
c52649ad57 Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) (#2234)
* added catch statements to examples

* clang format
2025-05-27 22:32:42 -07:00
Qianfeng Zhang
10c35125d2 Add example parameter max_seqlen and max_target 2025-05-27 14:43:43 +00:00
Qianfeng Zhang
c9e19351c7 Update to the method for calculating max_seqlen in the example 2025-05-27 10:38:52 +00:00
Qianfeng Zhang
dc0977faad Use NRepetitions2DEpilogue for outputing o_acc tile 2025-05-26 14:43:30 +00:00
Zzz9990
ece38b9d7a [VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221)
* fix splitkv compiler issue since lse is used to select kernel instances

* bypass seqlen == 1

* add chunked prefill into mha varlen

This reverts commit aa9847e42d.

* skip compile when receipt 2-4 and add comments

* fix

---------

Co-authored-by: fsx950223 <fsx950223@outlook.com>
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1 fix the buffer intrinsic names for clang >=20 (#2228) 2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6 Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 58f9e9ffbc.
2025-05-22 15:41:17 -07:00
Qianfeng Zhang
81f7b139e0 Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting 2025-05-22 09:49:11 +00:00
SamiAario-AMD
380bca2b85 Fix 11_add_rmsnorm2d_rdquant (#2207) 2025-05-20 15:15:28 -07:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00
Sami Remes
d1e6f0982d [CK_TILE] Grouped GEMM tile loop (#2146)
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm

* Some helper functions for persistent kernel case

* Get max occupancy grid using device properties

* Implement tile loop in main entry point to grouped gemm

* Enable GridSize() on device

* Handle offset tile index using real current block index

* Add persistent kernel choice to grouped gemm example

* Use a for-loop for iterating over the group

* Reduce VGPR spills by early-exit

* Enable persistent kernel choice in grouped_gemm example

* Add persistent kernel option to grouped_gemm test

* Fix formatting with remod.py

* Remove GridUpdateBlocks as blocks are now iteratively computed

* Add comment about VGPR spilling

* Fix formatting

* Use CK_TILE_HOST instead of __host__

* Enable all Row/Col combinations in grouped gemm unit test

* Add some KBatch=2 cases to grouped gemm tests

* Fix SplitK for grouped gemm

* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm

* Add type traits

* Split examples to regular and tileloop

* Formatting

* Use hipExtStreamGetCUMask to get current active CUs for the given stream

* Align test and example kernel config, and disable validation for splitk repeats

* Remove debug options from CMakeLists.txt

* Separate the code paths for persistent/non-persistent in test

* Fix formatting

* Address review comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-20 17:18:57 +03:00
Qianfeng Zhang
a1346aaf3e Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm 2025-05-20 07:53:25 +00:00
Qianfeng Zhang
0a8ea6bd02 Adjust the threshold values for fp16/bf16 in the example 2025-05-20 07:52:37 +00:00
Qianfeng Zhang
29cf1610f1 Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt 2025-05-20 07:52:01 +00:00
Aviral Goel
c4929225f6 remove debug statements from CMakeLists (#2204) 2025-05-19 17:31:04 -07:00
jefyang1
f18170064d Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-19 17:29:51 -07:00
jefyang1
b8b12bb81e Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203)
* Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950

* Run clang format
2025-05-19 14:25:50 -07:00
Qianfeng Zhang
fac03abbda Change do-while main-loop to while-do and remove early exiting check 2025-05-19 15:39:31 +00:00
Qianfeng Zhang
14ab6f154d Adjust the codes before the main-loop 2025-05-19 13:59:57 +00:00
Qianfeng Zhang
f411d676f2 Move k_tile loading and v_tile loading earlier in the loop 2025-05-19 13:59:24 +00:00
Qianfeng Zhang
902b1c645c Move k_tile loading in the loop earlier 2025-05-19 13:58:43 +00:00
Qianfeng Zhang
f582c21418 Replace s_acc and pcomp tile array by single tile object for simplification 2025-05-19 13:57:11 +00:00
Qianfeng Zhang
4e65469fe8 Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation 2025-05-18 16:20:47 +00:00
Qianfeng Zhang
e4e70f8b0a Set the block_per_cu to 3 for hdim-128 2025-05-18 15:58:57 +00:00
Qianfeng Zhang
ff3415d97d Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1 2025-05-18 15:26:31 +00:00
Qianfeng Zhang
694295a9d3 Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0 2025-05-18 15:25:43 +00:00
Qianfeng Zhang
afd7793e92 Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0 2025-05-18 15:25:05 +00:00
Qianfeng Zhang
7c0ac51b4b Hack block_gemm_areg_bsmem_creg_v2 for gemm_1 2025-05-18 15:24:17 +00:00
Qianfeng Zhang
473fbc374b Rename the hacked block_gemm_areg_bsmem_creg_v2 2025-05-18 15:23:40 +00:00
Qianfeng Zhang
58e45ec53a Move the lambda for dividing by max_seqlen from kernel to pipeline 2025-05-18 07:58:52 +00:00
Qianfeng Zhang
0771390a28 Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop 2025-05-18 03:41:31 +00:00
Po Yen Chen
8cb0474b3d Use only qr_async pipeline for batch_prefill (#2195) 2025-05-15 11:47:29 -07:00
Qianfeng Zhang
586968785a Set example option -save_mask default to 0 2025-05-14 13:56:54 +00:00
Qianfeng Zhang
b0d3704390 Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask 2025-05-14 02:00:22 +00:00
Qianfeng Zhang
5b0a2618fd Add -save_mask option to the example to output int8 mask tensor 2025-05-14 01:54:45 +00:00
Illia Silin
58f9e9ffbc Update the buffer load/store intrinsic names for clang>=20. (#2192)
* fix the buffer load/store intrinsic names

* fix clang format
2025-05-13 10:18:14 -07:00
Qianfeng Zhang
c3761c3bd6 Update the rules of hstu masking 2025-05-13 10:37:19 +00:00
Po Yen Chen
2920604786 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-05-13 12:19:25 +08:00
Thomas Ning
b49f7de81f Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166)
* make the work compiled

* Solved the example code, but still have the profiler error

* Finished the feature

* Clang format and update the CHANGELOG

* solve the preshuffle v1 & v2 problem

* Comment Addressed

* Comment Addressed
2025-05-12 09:52:58 -07:00
Thomas Ning
9d1e44e56a Vectorized Transpose for Batched Transpose CK Tile Operator (#2131)
* Shared Memory for single data point

* CKTile Transpose vectorize CP1

* CKTile Transpose vectorize CP2

* CKTile Transpose vectorize CP2.1

* fixed the compile error of the transpose tile 2d

* Have the correct result for the current test sample

* Changes to printing tensor

* fp8 support added

* Debugging for transpose

* solving the corner issue

* Changed padding flag

* Intermideate Debugging

* Intermidiate Debugging

* Intermediate Debugging

* Finished debugging of the transpose op

* Code Cleanup

* Adding edge case smoke tests

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Adding Transpose test to CI/CD

* Addressing Review Comment

* Addressing Comments

* Addressing Comments

* Measuring Perf Tests

* Code Cleanup

* Changlog

* Added the running iterations

* clang format

* Fix the changelog

* Fix the compilation error

* change the printing factor

---------

Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
2025-05-12 00:41:45 -07:00
Qianfeng Zhang
3a320bcdf6 Add test cases for better functional verification 2025-05-10 16:04:06 +00:00
Qianfeng Zhang
79cd1f0653 Fix sequence dim length for o_dram descriptor in the kernel 2025-05-10 16:02:52 +00:00
Mingtao Gu
a23390163d fix moe gemm2 for gfx950 (#2164)
Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-05-09 08:25:31 -07:00
BingYuan.Zhou
6a3960c1e1 Flatmm merge (#2168)
* sync with function interface of cshuffleepiloge,fix flatmm build fail

* move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm

---------

Co-authored-by: solin <bingzhou@amd.com>
2025-05-08 12:59:57 +08:00
Khushbu Agarwal
c7b8e86e34 [CK_Tile] Simplified Mem pipeline (#2159)
* simplify code

* compiled the code

* Simplified example and codegen for mem pipeline

* Reveting config and universal gemm example

* clang formatted

* remove comments

* clang formatted

* Add memory operation changes for defualt pipeline

* fix config file

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-05-07 18:37:31 -07:00
Qianfeng Zhang
1d1dd8f1eb Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time"
This reverts commit 2972de4c88.
2025-05-07 13:37:15 +00:00
Qianfeng Zhang
d32851e15c Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack 2025-05-07 10:09:23 +00:00
Qianfeng Zhang
632fd06a7a Use kK1=16 2025-05-07 09:51:16 +00:00
Qianfeng Zhang
079f7e3a03 Use type_convert rather than static_cast in f_silu 2025-05-07 07:11:54 +00:00
kylasa
956fe8f751 Simple copy kernel, which can be a tool to experiment with CK_Tile API with minimal code. (#2156)
* Test Copy kernel code for testing tile distribution logic

* Fix the error

* Solved the problem

* Updated comments and document formatting

* Removed unused tile distribution and code cleanup

* Added README.md and formatting for CI/CD.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-05-07 00:02:59 -07:00