composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 00:58:44 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	423cc72bc4	Move the calling of mask.GetTileRangeAlongX() to the kernel	2026-03-28 14:29:17 +00:00
Qianfeng Zhang	eefe426ef7	Add consideration for max_seqlen_q <= 64 in get_hstu_attention_fwd_mtile()	2026-03-26 14:40:01 +00:00
Qianfeng Zhang	76da618c85	Enable run-time selection of MTile sizes according to the predicted CU utilization ratio	2026-03-20 05:36:44 +00:00
Qianfeng Zhang	302537c5a8	Update to support grouped mode hstu attention	2026-03-09 16:15:58 +00:00
Qianfeng Zhang	73d6e0eb67	Using in-place version of block_tile_reduce() so that using of m_local is avoided	2026-03-05 16:27:41 +00:00
Qianfeng Zhang	2be2c3cd11	Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()	2026-02-21 14:59:00 +00:00
Qianfeng Zhang	f2a555dac7	Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts	2026-02-09 15:55:13 +00:00
Qianfeng Zhang	6f8b9548b5	Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument	2026-02-09 09:02:17 +00:00
Qianfeng Zhang	bdfa0a74c2	Update to hstu masking to separate the implementation for cross-attention and self-attention	2026-02-08 16:00:47 +00:00
Qianfeng Zhang	0711f4f90a	Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention	2026-02-06 15:40:07 +00:00
Qianfeng Zhang	d169ed2194	Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950	2026-02-05 15:57:18 +00:00
Qianfeng Zhang	8af5e26717	Add softmax selection to two of the testing scripts	2026-02-05 15:27:15 +00:00
Qianfeng Zhang	0a8c5f523a	[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling	2026-02-02 15:59:38 +00:00
Qianfeng Zhang	c360e0cbc4	Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256	2026-01-30 10:02:31 +00:00
Qianfeng Zhang	749e83f2fd	Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q	2026-01-26 13:45:42 +00:00
Qianfeng Zhang	1d4d925ba3	Fix in K-LdsBuffer and V-LdsBuffer over-lap checking	2025-12-27 05:43:11 +00:00
Qianfeng Zhang	d2dadc22a7	Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop	2025-12-26 15:38:52 +00:00
Qianfeng Zhang	df902c6a06	Tiny fix in using v_tiles[] index	2025-12-25 15:37:22 +00:00
Qianfeng Zhang	2d53d67b6d	Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving	2025-12-25 14:58:09 +00:00
Qianfeng Zhang	ddf0f1c8ed	Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving	2025-12-25 14:30:57 +00:00
Qianfeng Zhang	02cae85af5	Load Q directly from global memory to registers for BlockGemm	2025-12-20 14:08:55 +00:00
Qianfeng Zhang	3d90b5f90e	Remove un-used including from default policy file	2025-12-19 10:13:41 +00:00
Qianfeng Zhang	9e47664092	Move common codes to detail namespace from Problem class scope	2025-12-17 10:37:21 +00:00
Qianfeng Zhang	89daa890d1	Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f)	2025-12-17 07:47:17 +00:00
Qianfeng Zhang	1cf868026b	Add support of loading QK tiles of hdim96 without padding to hdim128	2025-12-16 16:39:40 +00:00
Qianfeng Zhang	588f573ee1	Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile	2025-12-16 15:03:57 +00:00
Qianfeng Zhang	179f0e857e	Rename WarpTile in fwd setting	2025-12-14 16:40:52 +00:00
Qianfeng Zhang	125934a966	Simplifying the codes in defining KDram and QDram tile distribution	2025-12-14 14:23:56 +00:00
Qianfeng Zhang	1ab5e9da93	Tiny update in GetMaxVectorSize()	2025-12-14 04:43:02 +00:00
Qianfeng Zhang	f79a29ac80	Rename and add scripts for testing hdim96	2025-12-12 16:16:43 +00:00
Qianfeng Zhang	b3d54477f1	Enable hdim96 instances	2025-12-12 16:16:23 +00:00
Qianfeng Zhang	18108d0d54	Fix with regard to define stride in MakeKLdsBlockDescriptor()	2025-12-12 09:55:53 +00:00
Qianfeng Zhang	db39b44bab	Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV	2025-12-11 10:47:54 +00:00
Qianfeng Zhang	8640ffe8eb	Further correction with regard to using n0_loops and k1_loops	2025-12-08 16:03:56 +00:00
Qianfeng Zhang	641dae10e8	Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning	2025-12-08 13:07:42 +00:00
Qianfeng Zhang	3a89eb8857	Simplify the codes in block_gemm	2025-12-06 15:45:38 +00:00
Qianfeng Zhang	4731c8e519	Further clarification in using kSubQKHeaddim and kQKHeaddim	2025-12-03 09:46:44 +00:00
Qianfeng Zhang	2549bc1fee	Clarify the using of kSubQKHeaddim and kQKHeaddim	2025-12-03 08:57:57 +00:00
Qianfeng Zhang	7234b2fc1a	Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines	2025-12-01 14:58:02 +00:00
Qianfeng Zhang	c1817464be	Tiny fix in GetQKBlockGemm	2025-11-30 14:04:48 +00:00
Qianfeng Zhang	f01e0ef37d	Enable the using of WarpTile-32x32x16 and add scripts to verify	2025-11-30 04:58:28 +00:00
Qianfeng Zhang	d99493606e	Add static_assert and comments in the with_softmax pipelines	2025-11-28 15:19:33 +00:00
Qianfeng Zhang	f952d3571c	Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950	2025-11-28 14:02:16 +00:00
Qianfeng Zhang	a0e4315d4e	Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline	2025-11-27 15:30:53 +00:00
Qianfeng Zhang	69c97c06d7	Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md	2025-11-27 08:27:52 +00:00
Qianfeng Zhang	f9e8c5539f	Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window	2025-11-23 04:49:01 +00:00
Qianfeng Zhang	4f33eb5857	Merge branch 'develop' into hstu_attention_mi350_fwd_bwd	2025-11-23 04:20:53 +00:00
Emily Martins	02ab76c2cb	Fix CK Tile DP + 2 Tile Stream-K Validation Errors (#3269 ) When there are multiple workgroups contributing to a tile, when using atomics, there may be round off error in cases where the accumulator type is not the same as the C type. To compute an error tolerance for test validation, the Stream-K Tile Partitioner has a function called estimate_num_wgs_per_tile to estimate the number of workgroups per tile. That said, this function only provides an estimate. In some cases for DP+2TSK, the function returns 1 rather than the more accurate value of 2. Thus, this change updates the estimate_num_wgs_per_tile function to explicitely return the value of 2 in cases for DP+2TSK to ensure that we have a better error tolerance to avoid test failures due to round-off error.	2025-11-21 20:29:47 -07:00
Illia Silin	21ae743acd	Enable daily builds on gfx1010 (#3258 ) * add build/test on gfx1010 * only build and run on gfx1010 once daily	2025-11-21 07:22:01 -08:00
John Shumway	ea6e4fcbbc	Fix builder errors. (#3260 ) There were four errors to fix: 1. The checks for defaulted direction were not implemented in the predicate concept. 2. Had to delete an obsolete and undefined operation enum. 3. A factory was passing a boolean in place of an integer. 4. Some of the factory tests are not compiling correctly when linking in the full source (with CK_EXPERIMENTAL_BUILDER=ON), so I commented them out.	2025-11-21 15:25:45 +01:00

1 2 3 4 5 ...

2919 Commits