composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	ff807ddd1a	Tiny movement in the code lines of the pipeline	2026-06-23 09:20:58 +00:00
Qianfeng Zhang	818722a3c6	Use two work-groups per compute-unit for scheduling the kernel	2026-06-23 09:20:58 +00:00
Qianfeng Zhang	1a917993cf	Simplify the codes in all host/device IsTokenPairInsideMask() trying to reduce branching	2026-06-23 09:20:58 +00:00
Qianfeng Zhang	57050d93a6	Fix masking for min_full_attn_seqlen > 0 situation	2026-06-23 09:20:58 +00:00
Qianfeng Zhang	508f4ac632	Update to test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py to align their testings	2026-06-23 09:20:58 +00:00
Qianfeng Zhang	49a12df133	Completely remove the dependency to include/ck_tile/ops/fmha/ops headers	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	f62c52a499	Fix in using KV LdsBuffers to avoid un-expected over-writting that causes un-deterministic results	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	9cdd64f337	Change while() do to do while() for the main loop to let the compiler to generate more elegant codes	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	3413400044	Use batch dim as first grid dim by default and replace env ASSUME_LEAST_VARIED_SEQLEN by ASSUME_HIGHLY_VARIED_SEQLEN	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	dcef0260ff	Align the -seqlens=xxx in the mattn0_full0 and mattn256_full256 scripts with the required benchmarks	2026-06-23 09:20:57 +00:00
root	07635af84c	Tiny fix in hstu attention IsFullTileInsideMask()	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	e033a82bd7	Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	d39ba52e07	Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	5f16f5db20	Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	88a0838453	Add assert(contextual_seqlen >= 0) in example	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	5944a63f11	Update IsFulleTileInsideMask() for kUseLocal is true situtation	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	cbc5485589	Move all test and bench scripts to folder scripts	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	e36446aca9	Add two scripts	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	a8738f4455	Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	3cc6f4abc8	Tune the settings for hdim-256	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	fb5aa39762	Add example parameter alpha to ease the testing	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	e59ec37ffb	Convert P to fp16/bf16 before doing second gemm in reference hstu implementation	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	c83d8587c7	not-critical updates in example and block_masking codes	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	08e381d632	Add init_qkv and dump_output example parameters for easier debugging	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	85bc8fd805	Add example parameter max_seqlen and max_target	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	46301a85d9	Update to the method for calculating max_seqlen in the example	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	a079b95b77	Use NRepetitions2DEpilogue for outputing o_acc tile	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	eba3242ab8	Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	d74b41070f	Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	4833daf43d	Adjust the threshold values for fp16/bf16 in the example	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	6e38888f46	Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	2b94d9261c	Change do-while main-loop to while-do and remove early exiting check	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	abc8335c43	Adjust the codes before the main-loop	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	946e917e2c	Move k_tile loading and v_tile loading earlier in the loop	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	45ac659ae0	Move k_tile loading in the loop earlier	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	109dcfc2f0	Replace s_acc and pcomp tile array by single tile object for simplification	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	40056b95a9	Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	8c43b793c9	Set the block_per_cu to 3 for hdim-128	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	1bbefda240	Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	11718b0af4	Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	d01b4f27c6	Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	4545d2efc1	Hack block_gemm_areg_bsmem_creg_v2 for gemm_1	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	6e7553be77	Rename the hacked block_gemm_areg_bsmem_creg_v2	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	e5977717a8	Move the lambda for dividing by max_seqlen from kernel to pipeline	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	70237d2e5c	Move the dividing by max_seqlen out of f_silu to be handle outside the main-loop	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	27e64a682a	Set example option -save_mask default to 0	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	0ee9dff5cb	Add scripts (test_ck_hstu_mask.sh and test_pytorch_hstu_mask.py) for checking mask	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	4ff88b4400	Add -save_mask option to the example to output int8 mask tensor	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	124539e123	Update the rules of hstu masking	2026-06-23 09:20:57 +00:00
Qianfeng Zhang	b2cd7757f0	Add test cases for better functional verification	2026-06-23 09:20:57 +00:00

1 2 3 4 5 ...

1173 Commits