composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	76da618c85	Enable run-time selection of MTile sizes according to the predicted CU utilization ratio	2026-03-20 05:36:44 +00:00
Qianfeng Zhang	302537c5a8	Update to support grouped mode hstu attention	2026-03-09 16:15:58 +00:00
Qianfeng Zhang	73d6e0eb67	Using in-place version of block_tile_reduce() so that using of m_local is avoided	2026-03-05 16:27:41 +00:00
Qianfeng Zhang	2be2c3cd11	Pass partition_index to get_x_indices_from_distributed_indices() to reduce calls of __builtin_amdgcn_readfirstlane()	2026-02-21 14:59:00 +00:00
Qianfeng Zhang	f2a555dac7	Align the masking logic in HstuCrossAttentionBlockMask with pytorch mask_v2 scripts	2026-02-09 15:55:13 +00:00
Qianfeng Zhang	6f8b9548b5	Use kIsCrossAttention as Problem attribute to replace using is_cross_attention as kernel argument	2026-02-09 09:02:17 +00:00
Qianfeng Zhang	bdfa0a74c2	Update to hstu masking to separate the implementation for cross-attention and self-attention	2026-02-08 16:00:47 +00:00
Qianfeng Zhang	0711f4f90a	Add is_cross_attention as both host API and kernel parameter so that separate masking rules are used for self or cross attention	2026-02-06 15:40:07 +00:00
Qianfeng Zhang	d169ed2194	Change to tile setting to use mfma-32x32x16 for WithSoftmax pipeline on gfx950	2026-02-05 15:57:18 +00:00
Qianfeng Zhang	8af5e26717	Add softmax selection to two of the testing scripts	2026-02-05 15:27:15 +00:00
Qianfeng Zhang	0a8c5f523a	[Performance] Use N0Sub=16 for trload with softmax pipeline to reduce vgpr spilling	2026-02-02 15:59:38 +00:00
Qianfeng Zhang	c360e0cbc4	Add scripts for benchmark sparsity 0.9 cases with mattn256 & full256	2026-01-30 10:02:31 +00:00
Qianfeng Zhang	749e83f2fd	Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q	2026-01-26 13:45:42 +00:00
Qianfeng Zhang	1d4d925ba3	Fix in K-LdsBuffer and V-LdsBuffer over-lap checking	2025-12-27 05:43:11 +00:00
Qianfeng Zhang	d2dadc22a7	Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop	2025-12-26 15:38:52 +00:00
Qianfeng Zhang	df902c6a06	Tiny fix in using v_tiles[] index	2025-12-25 15:37:22 +00:00
Qianfeng Zhang	2d53d67b6d	Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving	2025-12-25 14:58:09 +00:00
Qianfeng Zhang	ddf0f1c8ed	Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving	2025-12-25 14:30:57 +00:00
Qianfeng Zhang	02cae85af5	Load Q directly from global memory to registers for BlockGemm	2025-12-20 14:08:55 +00:00
Qianfeng Zhang	3d90b5f90e	Remove un-used including from default policy file	2025-12-19 10:13:41 +00:00
Qianfeng Zhang	9e47664092	Move common codes to detail namespace from Problem class scope	2025-12-17 10:37:21 +00:00
Qianfeng Zhang	89daa890d1	Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f)	2025-12-17 07:47:17 +00:00
Qianfeng Zhang	1cf868026b	Add support of loading QK tiles of hdim96 without padding to hdim128	2025-12-16 16:39:40 +00:00
Qianfeng Zhang	588f573ee1	Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile	2025-12-16 15:03:57 +00:00
Qianfeng Zhang	179f0e857e	Rename WarpTile in fwd setting	2025-12-14 16:40:52 +00:00
Qianfeng Zhang	125934a966	Simplifying the codes in defining KDram and QDram tile distribution	2025-12-14 14:23:56 +00:00
Qianfeng Zhang	1ab5e9da93	Tiny update in GetMaxVectorSize()	2025-12-14 04:43:02 +00:00
Qianfeng Zhang	f79a29ac80	Rename and add scripts for testing hdim96	2025-12-12 16:16:43 +00:00
Qianfeng Zhang	b3d54477f1	Enable hdim96 instances	2025-12-12 16:16:23 +00:00
Qianfeng Zhang	18108d0d54	Fix with regard to define stride in MakeKLdsBlockDescriptor()	2025-12-12 09:55:53 +00:00
Qianfeng Zhang	db39b44bab	Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV	2025-12-11 10:47:54 +00:00
Qianfeng Zhang	8640ffe8eb	Further correction with regard to using n0_loops and k1_loops	2025-12-08 16:03:56 +00:00
Qianfeng Zhang	641dae10e8	Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning	2025-12-08 13:07:42 +00:00
Qianfeng Zhang	3a89eb8857	Simplify the codes in block_gemm	2025-12-06 15:45:38 +00:00
Qianfeng Zhang	4731c8e519	Further clarification in using kSubQKHeaddim and kQKHeaddim	2025-12-03 09:46:44 +00:00
Qianfeng Zhang	2549bc1fee	Clarify the using of kSubQKHeaddim and kQKHeaddim	2025-12-03 08:57:57 +00:00
Qianfeng Zhang	7234b2fc1a	Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines	2025-12-01 14:58:02 +00:00
Qianfeng Zhang	c1817464be	Tiny fix in GetQKBlockGemm	2025-11-30 14:04:48 +00:00
Qianfeng Zhang	f01e0ef37d	Enable the using of WarpTile-32x32x16 and add scripts to verify	2025-11-30 04:58:28 +00:00
Qianfeng Zhang	d99493606e	Add static_assert and comments in the with_softmax pipelines	2025-11-28 15:19:33 +00:00
Qianfeng Zhang	f952d3571c	Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950	2025-11-28 14:02:16 +00:00
Qianfeng Zhang	a0e4315d4e	Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline	2025-11-27 15:30:53 +00:00
Qianfeng Zhang	69c97c06d7	Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md	2025-11-27 08:27:52 +00:00
Qianfeng Zhang	f9e8c5539f	Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window	2025-11-23 04:49:01 +00:00
Qianfeng Zhang	4f33eb5857	Merge branch 'develop' into hstu_attention_mi350_fwd_bwd	2025-11-23 04:20:53 +00:00
Emily Martins	2e4b8a8fc4	[CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202 ) * Remove old CK Tile Stream-K implementation The original CK Stream-K implementation was based on old CK's Stream-K block to C tile map. However, this implementation did not align with the original Stream-K paper. Thus, we implemented a new tile partitioner and associated Stream-K kernel, which was placed in the reboot namespace. Now that the new Stream-K implementation is ready, this change removes all artifacts of the old implementation. Specifically, the following changes were made: - Removes old Stream-K tile partitioner from CK Tile - Removes the reboot namespace such that the new implementation resides in the ck_tile namespace only. - Adds tests for bf8 and fp8 using the new implementation - Removes tests for the old implementation - Remove the v2 suffix from the new CK Tile Tile Partitioner derived classes. - Updates Stream-K Kernel ops file to use /** commenting style. * Remove v2 from tile partitioner validation function names	2025-11-20 09:32:32 -07:00
asleepzzz	5adaa201ed	Revert "Add attn sink (#2892 )" (#3250 ) This reverts commit `9fa4e8d5ab`.	2025-11-20 07:55:15 -08:00
Linjun-AMD	9fa4e8d5ab	Add attn sink (#2892 ) * enable attn sink Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update attn_sink script Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix some error Signed-off-by: JL-underdog <Jun.Lin@amd.com> * clang-format Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd_kernel'mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update block_fmha_pipeline_qr_ks_vs.hpp Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix ci error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix format error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_bwd_pipeline_default_policy.hpp * Update fmha_fwd_runner.hpp * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * update splitkv_pipline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update splitkv&pagedkv pipeline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * add sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update attn_sink result log Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update smoke_test_fwd_sink.sh Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test file Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test script Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp * use constexpr kHasSink for sink in fmha pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update by pre-commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fmha_fwd.py * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove causal mask setting logic from mask.hpp Removed the mask setting logic for causal masks. * fix ci error that some usage of lamada not support in c++17 Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update remod.py * add smoke sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update fmha_pagedkv_prefill.py * Update FmhaFwdPipeline parameters in fmha_fwd.py * update block_fmha_pipeline_qr_ks_vs_async_trload.hpp Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix c++17 unsupprot error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp * Fix formatting of sink_seq_end assignment * Fix indentation for sink_seq_end assignment * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-20 19:24:05 +08:00
Yi DING	47e2ed838e	[CK_TILE] Add Flatmm MX FP8 (#3208 ) * Use async for flatmm mxfp4 * Fix preshuffle * Add flatmm mxfp8 * Thanks, Copilot * Thanks Copilot again~	2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal	1eb26460aa	[ck_tile] Pooling example - Improved tile sizes (#3233 ) * improved tile sizes - modified tile sizes for improved example performance * Update example/ck_tile/36_pooling/pool3d.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-11-19 15:30:18 +01:00

1 2 3 4 5 ...

1073 Commits