Qianfeng Zhang
b75077475b
Remove useless codes in the two trload pipelines
2025-11-15 13:48:50 +00:00
Qianfeng Zhang
238b5c4f08
Separate Traits from Problem while being used for defining the pipeline
2025-11-14 16:42:06 +00:00
Qianfeng Zhang
95c1bb25e3
Remove the k_element_func and v_element_func from the pipeline since they are not used
2025-11-13 14:53:47 +00:00
Qianfeng Zhang
881ddc5741
Update to the two trload pipeline to load whole Q-tile once through LDS on mi350
2025-11-12 15:59:38 +00:00
Qianfeng Zhang
8f876f094e
Simplify the codes in block_gemm_areg_bsmem_creg_v2_hack_1
2025-11-10 16:11:26 +00:00
Qianfeng Zhang
303818a851
Simplify the codes in block_gemm_areg_bsmem_trload_creg
2025-11-10 15:27:34 +00:00
Qianfeng Zhang
bd0444f365
[Performance] Change the tile settings for mi350/trload no_softmax pipeline to enable to use mfma-16x16x32 for Gemm-1
2025-11-06 08:20:11 +00:00
Qianfeng Zhang
54cd431f16
Improve the softmax+trload pipeline by using kN0=64 and prefetch only two k tiles
2025-11-05 16:23:05 +00:00
Qianfeng Zhang
d190af2ef5
Tiny fix in trload with_softmax/no_softmax pipeline
2025-11-05 14:44:13 +00:00
Qianfeng Zhang
99993acca4
Improve both the with_softmax and no_softmax pipelines
2025-11-04 15:46:49 +00:00
Qianfeng Zhang
bc22b83b19
Add kUseTrLoad = false in non-trload pipeline
2025-11-03 12:40:16 +00:00
Qianfeng Zhang
e40ab20b9e
Clarifying the using of CK_TILE_HOST and CK_TILE_HOST_DEVICE trying to save compiling time
2025-11-03 09:40:03 +00:00
Qianfeng Zhang
e31829384d
Change in updating max_uih_seqlen in the example
2025-11-02 03:59:46 +00:00
Qianfeng Zhang
39cb8c33d1
Use supplement_array_by_last_element(num_targets, ) in example
2025-11-02 03:34:54 +00:00
Qianfeng Zhang
80e08b6efe
Use supplement_array_by_last_element() in example to simplify the codes
2025-11-02 03:17:16 +00:00
Qianfeng Zhang
10133e5d51
Update to README.md
2025-11-02 03:16:48 +00:00
Qianfeng Zhang
8408ec0a02
Add scripts for testing the using of separate sequence lengths for k/v
2025-11-02 03:16:22 +00:00
Qianfeng Zhang
17e404be3b
Support separate sequence lengths for q and kv
2025-11-02 03:14:53 +00:00
Qianfeng Zhang
eaf9650fed
Use separate pipelines for using or not-using softmax situations
2025-10-30 10:01:52 +00:00
Qianfeng Zhang
207e6f10b8
Implementation of hstu attention pipeline using trload for v on mi350
2025-10-29 15:45:14 +00:00
Qianfeng Zhang
a464269bb6
Fix in the comments
2025-10-27 10:47:40 +00:00
Qianfeng Zhang
4eeb5cc917
Update to gemm_0's CBlockDistribution encoding so that it is compatible with gemm_1's ABlockDistribution encoding
2025-10-27 10:47:23 +00:00
Qianfeng Zhang
98a241a2eb
Using separate tile settings for no-softmax and with-softmax hstu attention situations
2025-10-24 01:47:55 +00:00
Qianfeng Zhang
7c4012266a
Update to benchmark scripts to consider for using softmax
2025-10-23 10:09:37 +00:00
Qianfeng Zhang
d1505786f8
Add support of softmax in hstu attention
2025-10-20 14:26:55 +00:00
Qianfeng Zhang
a874839dc2
Add template parameter to gemm_0 MakeCBlockTile() for the need of defining PcompBlockTileType
2025-10-20 14:26:29 +00:00
Qianfeng Zhang
1a8f2f21fb
Move scaling by attn_scale to inside the main-loop
2025-10-20 14:22:18 +00:00
Qianfeng Zhang
bbda3f6f1c
Let IsTokenPairInsideMask() return bool type
2025-10-20 14:21:26 +00:00
Qianfeng Zhang
fdb89d3e2f
Add instances to consider for adding softmax support
2025-10-20 14:20:54 +00:00
Qianfeng Zhang
2072e53d1e
Remove K0 from tile setting since it is not used
2025-10-14 07:15:26 +00:00
Qianfeng Zhang
22a7b31865
Change to pipeline so that it is easier to add support of using softmax
2025-10-12 06:09:55 +00:00
Qianfeng Zhang
d308b09fae
Remove using IGLP method for instruction scheduling for kUseLocal true path
2025-10-12 06:09:25 +00:00
Qianfeng Zhang
6b40ce4074
Fix in GetQKBlockGemm()
2025-09-27 14:59:32 +00:00
Qianfeng Zhang
27b96b15c4
Simplify the warp_gemm definitions in GetQKBlockGemm and GetKVBlockGemm
2025-09-25 15:38:55 +00:00
Qianfeng Zhang
bd32cc0de0
Remove useless constant statement in the kernel
2025-09-19 07:24:29 +00:00
Qianfeng Zhang
db62a9f47e
Remove un-necessary HSTU_CHECK() callings
2025-09-13 16:39:14 +00:00
Qianfeng Zhang
2427426640
Add HSTU_CHECK() and use it in example codes
2025-09-13 16:38:33 +00:00
Qianfeng Zhang
a5b7360862
Smalle update in reference hstu attention
2025-09-13 06:53:54 +00:00
Qianfeng Zhang
798fc3cc0b
Detach HstuBlockMask from pipeline definition and construct the HstuBlockMask type in the kernel according to window_size
2025-09-12 09:11:47 +00:00
Qianfeng Zhang
7d10353fda
Unify the license statements on all the source files
2025-09-11 10:27:00 +00:00
Qianfeng Zhang
1c030e8c3c
Remove using MakeKargsImpl() to simplify the hstu kernel
2025-09-10 15:28:12 +00:00
Qianfeng Zhang
72eb4e95d8
Clarify the using of kSubQKHeaddim and kQKHeaddim so that less regular hdim (eg. 96, 160) can be efficiently supported
2025-09-09 12:55:01 +00:00
Qianfeng Zhang
f8dea2bc86
Use set_slice_tilie() to replace direct thread_buffer assignment
2025-09-09 12:54:32 +00:00
Qianfeng Zhang
4bf65d9fe5
Merge branch 'develop' into hstu_attention_mi350_fwd_bwd and change in using ck_tile::make_kernel
2025-09-01 07:35:35 +00:00
Aviral Goel
fcff0043ae
chore(gemm): clang format to pass CI ( #2758 )
2025-08-29 00:38:46 -07:00
Vijay Krish
4208e28988
ck_tile kernel for gemm with groupwise quantized B tensor. ( #2663 )
...
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
* Solve merge conflict
* [CK TILE] Update CHANGELOG.md
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com >
Co-authored-by: ThomasNing <thomas.ning@amd.com >
Co-authored-by: Cong Ma <congma13@amd.com >
2025-08-28 23:43:02 -07:00
Cong Ma
428090f749
Support transposed C tile in Aquant ( #2679 )
...
The performance of Aquant has increased after enabling transposed C.
Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
2025-08-28 13:28:09 -07:00
asleepzzz
038ea82315
Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )" ( #2757 )
...
This reverts commit ead4447b20 .
2025-08-28 22:50:42 +08:00
linqunAMD
4a49dac7c6
[Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel ( #2728 )
...
* fix copy basic build error
* fix other ck tile test build error
2025-08-28 20:30:30 +08:00
Yi DING
ead4447b20
[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )
...
* 16x192
* Use buffer_load_lds for lse/d
* Dispatch & cleanup
* Avoid zeroing dq & fix
* fix
2025-08-28 18:54:18 +08:00