Aviral Goel
aed0f5880c
Label CMakeLists message() as DEBUG or STATUS for clean build output ( #2301 )
...
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
2025-06-10 10:46:47 -07:00
Qianfeng Zhang
08886e99d5
Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control
2025-06-10 15:44:29 +00:00
Qianfeng Zhang
4632d30cc0
Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict
2025-06-08 13:25:57 +00:00
Qianfeng Zhang
84eb9adc71
Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility
2025-06-07 15:01:19 +00:00
Sami Remes
1c6f83df6c
[CK_TILE] Tileloop persistent gemm - resubmit ( #2299 )
...
* Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 )
This reverts commit 233e274077 .
* Add missing header for kentry
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-06-06 14:18:49 -07:00
Qianfeng Zhang
b2db644dcd
Add assert(contextual_seqlen >= 0) in example
2025-06-06 15:56:04 +00:00
Qianfeng Zhang
d7930cd541
Update IsFulleTileInsideMask() for kUseLocal is true situtation
2025-06-06 15:55:26 +00:00
Qianfeng Zhang
9e6a24010a
Move all test and bench scripts to folder scripts
2025-06-06 08:22:38 +00:00
Qianfeng Zhang
2bb59dff0a
Add two scripts
2025-06-06 08:15:25 +00:00
Illia Silin
233e274077
Revert "[CK_TILE] Tile loop persistent gemm kernel ( #2191 )" ( #2293 )
...
This reverts commit ffb52783d0 .
2025-06-05 09:24:00 -07:00
Sami Remes
7ea1508b59
[CK_TILE] Move GEMM pipeline tail handling logic to pipelines ( #2222 )
...
* Add TailHandler for V3, V4 and Mem pipelines
* Adapt examples and tests to use TailHandler
* move tail-handling logic to pipeline in persistent grouped gemm
* Fix Mem pipeline dispatching, add CompV4 dispatching
* Use a macro for handling the many tails of Mem pipeline
* Fix formatting again
* Use const-ref RunFunction, remove unnecessary try_run
2025-06-04 11:50:21 +03:00
Sami Remes
ffb52783d0
[CK_TILE] Tile loop persistent gemm kernel ( #2191 )
...
* Implement tile loop persistent gemm kernel
* Enable timing
* Add tests for persistent gemm
* Fix formatting
* Fix gemm_basic
* Rename True/False to Persistent/NonPersistent
* Use only one set of layouts for persistent tests
* Fix gemm example persistent template parameter
* Fix formatting
2025-06-04 11:46:28 +03:00
Khushbu Agarwal
59a85cb4bc
[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes ( #2262 )
...
* debugging issue
* debugging issue
* debugging
* debugging
* reverting debugging code
* clang formatted
* updating default_config.json
* fix ci failure
* clang formatted
2025-06-03 20:16:10 -07:00
Khushbu Agarwal
2e38eb4f1c
Rotating buffer PR CI fix ( #2257 )
...
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 )"
This reverts commit bbdaf79a52 .
* fix regression
2025-06-02 10:25:01 -07:00
Qianfeng Zhang
9582ae2dff
Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes
2025-05-31 06:22:54 +00:00
Qianfeng Zhang
bec35abafe
Tune the settings for hdim-256
2025-05-30 08:51:49 +00:00
Qianfeng Zhang
832747c58d
Add example parameter alpha to ease the testing
2025-05-30 08:51:24 +00:00
slippedJim
57f497452a
remove restriction of group mode hd192 no lse ( #2252 )
...
Co-authored-by: Jim <jimguo12@amd.com >
2025-05-30 10:14:21 +08:00
Po Yen Chen
28cd0dffc9
[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization ( #2251 )
...
* Add constraint on traits/tile/pipeline
* Use kM0=128 if max_seqlen_q == 8192
* Re-format codegen script
* Remove redundant attr name postix
* Fix import error: default field in dataclass
* Use kK0=64 & kK1=64 to hide latency
* Use CU utilization to decide tile size
2025-05-29 18:36:33 +09:00
Qianfeng Zhang
781cba355a
Convert P to fp16/bf16 before doing second gemm in reference hstu implementation
2025-05-29 01:07:04 +00:00
Qianfeng Zhang
36a0f2020c
not-critical updates in example and block_masking codes
2025-05-29 01:06:32 +00:00
Qianfeng Zhang
68a5ab8ff8
Add init_qkv and dump_output example parameters for easier debugging
2025-05-29 01:05:49 +00:00
Illia Silin
bbdaf79a52
Revert "[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )" ( #2256 )
...
This reverts commit 99857e10e6 .
2025-05-28 09:46:52 -06:00
Khushbu Agarwal
99857e10e6
[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )
...
* Add rotating buffer feature for universal gemm
* adding changes in tile_engine
* Updated code to merge kernel_launch
* removing comments
* Enable rotating buffer changes to flatmm
* Created diff launch_kernel function for rotating buffer
* Simplfied calculation using macros
* merge code with new changes in tile_engine
* clang formatted
* Redefine macros
2025-05-27 23:00:58 -07:00
Aviral Goel
c52649ad57
Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) ( #2234 )
...
* added catch statements to examples
* clang format
2025-05-27 22:32:42 -07:00
Qianfeng Zhang
10c35125d2
Add example parameter max_seqlen and max_target
2025-05-27 14:43:43 +00:00
Qianfeng Zhang
c9e19351c7
Update to the method for calculating max_seqlen in the example
2025-05-27 10:38:52 +00:00
Qianfeng Zhang
dc0977faad
Use NRepetitions2DEpilogue for outputing o_acc tile
2025-05-26 14:43:30 +00:00
Zzz9990
ece38b9d7a
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit aa9847e42d .
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1
fix the buffer intrinsic names for clang >=20 ( #2228 )
2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6
Revert "Update the buffer load/store intrinsic names for clang>=20. ( #2192 )" ( #2227 )
...
This reverts commit 58f9e9ffbc .
2025-05-22 15:41:17 -07:00
Qianfeng Zhang
81f7b139e0
Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting
2025-05-22 09:49:11 +00:00
SamiAario-AMD
380bca2b85
Fix 11_add_rmsnorm2d_rdquant ( #2207 )
2025-05-20 15:15:28 -07:00
Sami Remes
d1e6f0982d
[CK_TILE] Grouped GEMM tile loop ( #2146 )
...
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm
* Some helper functions for persistent kernel case
* Get max occupancy grid using device properties
* Implement tile loop in main entry point to grouped gemm
* Enable GridSize() on device
* Handle offset tile index using real current block index
* Add persistent kernel choice to grouped gemm example
* Use a for-loop for iterating over the group
* Reduce VGPR spills by early-exit
* Enable persistent kernel choice in grouped_gemm example
* Add persistent kernel option to grouped_gemm test
* Fix formatting with remod.py
* Remove GridUpdateBlocks as blocks are now iteratively computed
* Add comment about VGPR spilling
* Fix formatting
* Use CK_TILE_HOST instead of __host__
* Enable all Row/Col combinations in grouped gemm unit test
* Add some KBatch=2 cases to grouped gemm tests
* Fix SplitK for grouped gemm
* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm
* Add type traits
* Split examples to regular and tileloop
* Formatting
* Use hipExtStreamGetCUMask to get current active CUs for the given stream
* Align test and example kernel config, and disable validation for splitk repeats
* Remove debug options from CMakeLists.txt
* Separate the code paths for persistent/non-persistent in test
* Fix formatting
* Address review comments
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2025-05-20 17:18:57 +03:00
Qianfeng Zhang
a1346aaf3e
Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm
2025-05-20 07:53:25 +00:00
Qianfeng Zhang
0a8ea6bd02
Adjust the threshold values for fp16/bf16 in the example
2025-05-20 07:52:37 +00:00
Qianfeng Zhang
29cf1610f1
Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt
2025-05-20 07:52:01 +00:00
Qianfeng Zhang
fac03abbda
Change do-while main-loop to while-do and remove early exiting check
2025-05-19 15:39:31 +00:00
Qianfeng Zhang
14ab6f154d
Adjust the codes before the main-loop
2025-05-19 13:59:57 +00:00
Qianfeng Zhang
f411d676f2
Move k_tile loading and v_tile loading earlier in the loop
2025-05-19 13:59:24 +00:00
Qianfeng Zhang
902b1c645c
Move k_tile loading in the loop earlier
2025-05-19 13:58:43 +00:00
Qianfeng Zhang
f582c21418
Replace s_acc and pcomp tile array by single tile object for simplification
2025-05-19 13:57:11 +00:00
Qianfeng Zhang
4e65469fe8
Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation
2025-05-18 16:20:47 +00:00
Qianfeng Zhang
e4e70f8b0a
Set the block_per_cu to 3 for hdim-128
2025-05-18 15:58:57 +00:00
Qianfeng Zhang
ff3415d97d
Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1
2025-05-18 15:26:31 +00:00
Qianfeng Zhang
694295a9d3
Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0
2025-05-18 15:25:43 +00:00
Qianfeng Zhang
afd7793e92
Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0
2025-05-18 15:25:05 +00:00
Qianfeng Zhang
7c0ac51b4b
Hack block_gemm_areg_bsmem_creg_v2 for gemm_1
2025-05-18 15:24:17 +00:00
Qianfeng Zhang
473fbc374b
Rename the hacked block_gemm_areg_bsmem_creg_v2
2025-05-18 15:23:40 +00:00
Qianfeng Zhang
58e45ec53a
Move the lambda for dividing by max_seqlen from kernel to pipeline
2025-05-18 07:58:52 +00:00