composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	85bc8fd805	Add example parameter max_seqlen and max_target	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	46301a85d9	Update to the method for calculating max_seqlen in the example	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	a079b95b77	Use NRepetitions2DEpilogue for outputing o_acc tile	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	eba3242ab8	Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	d74b41070f	Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	4833daf43d	Adjust the threshold values for fp16/bf16 in the example	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	6e38888f46	Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	2b94d9261c	Change do-while main-loop to while-do and remove early exiting check	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	abc8335c43	Adjust the codes before the main-loop	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	946e917e2c	Move k_tile loading and v_tile loading earlier in the loop	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	45ac659ae0	Move k_tile loading in the loop earlier	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	109dcfc2f0	Replace s_acc and pcomp tile array by single tile object for simplification	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	40056b95a9	Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	8c43b793c9	Set the block_per_cu to 3 for hdim-128	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	1bbefda240	Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	11718b0af4	Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	d01b4f27c6	Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	4545d2efc1	Hack block_gemm_areg_bsmem_creg_v2 for gemm_1	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	6e7553be77	Rename the hacked block_gemm_areg_bsmem_creg_v2	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	e5977717a8	Move the lambda for dividing by max_seqlen from kernel to pipeline	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	70237d2e5c	Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	27e64a682a	Set example option -save_mask default to 0	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	0ee9dff5cb	Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	4ff88b4400	Add -save_mask option to the example to output int8 mask tensor	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	124539e123	Update the rules of hstu masking	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	b2cd7757f0	Add test cases for better functional verification	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	3d83d23a55	Fix sequence dim length for o_dram descriptor in the kernel	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	010b3f48b3	Revert "Temporarily close the instance for hdim64 and hdim256 to save compiling time" This reverts commit `2972de4c88`.	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	bce38c1531	Simplification in the static iterations of block_gemm_areg_bsmem_creg_v2_hack	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	87b5aa78bd	Use kK1=16	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	bce88a9e73	Use type_convert rather than static_cast in f_silu	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	9c3e49a1d0	Add max_seqlen as divider in siLu	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	717aae7ce7	Remove using cast_tile_pk_fp16_fp32 for better accuracy for fp16 hstu attention	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	380165c3dc	Override and fix GetAlignmentK()	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	34998cfd19	Use kN0=32	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	94f8d71ee2	Temporarily close the instance for hdim64 and hdim256 to save compiling time	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	9df0fad750	Hack block_gemm_areg_bsmem_creg_v2 to let s_acc for gemm_0 not need be cleared first	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	6bf4877a20	Adjust the v_tile and k_tile loading location	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	ba037426c5	Put two gemms call inside one n0loop unroll	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	23852ef4c0	Add IsFullTileInsideMask() to avoid pixel-by-pixel checking when kUseCausl=true but kUseLocal=false	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	ddd9227453	Replace set_tile_if() by sweep_tile_span() to reduce branching	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	187f4d3f68	Update the GridSize() and GetTileIndex() in hstu kernel	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	9a08e1090e	Add scripts for measuring jagged with/no causal cases	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	c0128a9156	Tiny update in IsTokenPairInsideMask()	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	ac0e593e0d	Use compiler builtin directly in f_silu for float type	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	31c21c74d8	Code re-arrangement in pipeline	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	eb2564fe46	Update the seqlen_k_curr inside the first gemm loop	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	40683ee932	Rename the performance measurement scripts	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	79fdd564b8	Add support for WarpGem-16x16x32 in QK-BlockGemm (which enables using ds_write/read_b128 for K	2026-06-23 09:19:46 +00:00
Qianfeng Zhang	1986d8c578	Update in K-Lds laying-out to consider for both WarpGemm-32x32x16 and WarpGemm-16x16x16	2026-06-23 09:19:46 +00:00

1 2 3 4 5 ...

1149 Commits