composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-15 19:44:39 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	749e83f2fd	Update to use BottomRight-Diagonal masking when seqlen_kv is bigger than seqlen_q	2026-01-26 13:45:42 +00:00
Qianfeng Zhang	1d4d925ba3	Fix in K-LdsBuffer and V-LdsBuffer over-lap checking	2025-12-27 05:43:11 +00:00
Qianfeng Zhang	d2dadc22a7	Remove un-needed constexpr checking for loading v_tiles in Gemm0 loop	2025-12-26 15:38:52 +00:00
Qianfeng Zhang	df902c6a06	Tiny fix in using v_tiles[] index	2025-12-25 15:37:22 +00:00
Qianfeng Zhang	2d53d67b6d	Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi350 to achieve better interleaving	2025-12-25 14:58:09 +00:00
Qianfeng Zhang	ddf0f1c8ed	Update the NumPrefetchK and NumPrefetchV in the softmax pipeline on mi300 to achieve better interleaving	2025-12-25 14:30:57 +00:00
Qianfeng Zhang	02cae85af5	Load Q directly from global memory to registers for BlockGemm	2025-12-20 14:08:55 +00:00
Qianfeng Zhang	3d90b5f90e	Remove un-used including from default policy file	2025-12-19 10:13:41 +00:00
Qianfeng Zhang	9e47664092	Move common codes to detail namespace from Problem class scope	2025-12-17 10:37:21 +00:00
Qianfeng Zhang	89daa890d1	Remove useless call of __builtin_amdgcn_s_waitcnt(0xc07f)	2025-12-17 07:47:17 +00:00
Qianfeng Zhang	1cf868026b	Add support of loading QK tiles of hdim96 without padding to hdim128	2025-12-16 16:39:40 +00:00
Qianfeng Zhang	588f573ee1	Change to the Q/K DramTile encoding and renaming in V/VShuffled DramTile	2025-12-16 15:03:57 +00:00
Qianfeng Zhang	179f0e857e	Rename WarpTile in fwd setting	2025-12-14 16:40:52 +00:00
Qianfeng Zhang	125934a966	Simplifying the codes in defining KDram and QDram tile distribution	2025-12-14 14:23:56 +00:00
Qianfeng Zhang	1ab5e9da93	Tiny update in GetMaxVectorSize()	2025-12-14 04:43:02 +00:00
Qianfeng Zhang	f79a29ac80	Rename and add scripts for testing hdim96	2025-12-12 16:16:43 +00:00
Qianfeng Zhang	b3d54477f1	Enable hdim96 instances	2025-12-12 16:16:23 +00:00
Qianfeng Zhang	18108d0d54	Fix with regard to define stride in MakeKLdsBlockDescriptor()	2025-12-12 09:55:53 +00:00
Qianfeng Zhang	db39b44bab	Update in the implementation of GetAlignmentQ/GetAlignmentK/GetAlignmentV	2025-12-11 10:47:54 +00:00
Qianfeng Zhang	8640ffe8eb	Further correction with regard to using n0_loops and k1_loops	2025-12-08 16:03:56 +00:00
Qianfeng Zhang	641dae10e8	Add kN0Sub to separate the n0_loop and k1_loop tile size for more flexible tuning	2025-12-08 13:07:42 +00:00
Qianfeng Zhang	3a89eb8857	Simplify the codes in block_gemm	2025-12-06 15:45:38 +00:00
Qianfeng Zhang	4731c8e519	Further clarification in using kSubQKHeaddim and kQKHeaddim	2025-12-03 09:46:44 +00:00
Qianfeng Zhang	2549bc1fee	Clarify the using of kSubQKHeaddim and kQKHeaddim	2025-12-03 08:57:57 +00:00
Qianfeng Zhang	7234b2fc1a	Simplifying the codes with regard to k_lds_wite_windows and k_lds_read_windows in the pipelines	2025-12-01 14:58:02 +00:00
Qianfeng Zhang	c1817464be	Tiny fix in GetQKBlockGemm	2025-11-30 14:04:48 +00:00
Qianfeng Zhang	f01e0ef37d	Enable the using of WarpTile-32x32x16 and add scripts to verify	2025-11-30 04:58:28 +00:00
Qianfeng Zhang	d99493606e	Add static_assert and comments in the with_softmax pipelines	2025-11-28 15:19:33 +00:00
Qianfeng Zhang	f952d3571c	Force both Gemm0 and Gemm1 to use mfma-16x16x32 on gfx950	2025-11-28 14:02:16 +00:00
Qianfeng Zhang	a0e4315d4e	Use 16x16x32 for Gemm1 on MI350 and adjust the NumPrefetchK for with_softmax trload pipeline	2025-11-27 15:30:53 +00:00
Qianfeng Zhang	69c97c06d7	Add hstu_attention_api.hpp to explicitly mark the API interfaces and update REAMD.md	2025-11-27 08:27:52 +00:00
Qianfeng Zhang	f9e8c5539f	Use explicit partition_index to ensure warp_id is allocated on vpgr when accessing LDS tile_window	2025-11-23 04:49:01 +00:00
Qianfeng Zhang	4f33eb5857	Merge branch 'develop' into hstu_attention_mi350_fwd_bwd	2025-11-23 04:20:53 +00:00
Emily Martins	2e4b8a8fc4	[CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202 ) * Remove old CK Tile Stream-K implementation The original CK Stream-K implementation was based on old CK's Stream-K block to C tile map. However, this implementation did not align with the original Stream-K paper. Thus, we implemented a new tile partitioner and associated Stream-K kernel, which was placed in the reboot namespace. Now that the new Stream-K implementation is ready, this change removes all artifacts of the old implementation. Specifically, the following changes were made: - Removes old Stream-K tile partitioner from CK Tile - Removes the reboot namespace such that the new implementation resides in the ck_tile namespace only. - Adds tests for bf8 and fp8 using the new implementation - Removes tests for the old implementation - Remove the v2 suffix from the new CK Tile Tile Partitioner derived classes. - Updates Stream-K Kernel ops file to use /** commenting style. * Remove v2 from tile partitioner validation function names	2025-11-20 09:32:32 -07:00
asleepzzz	5adaa201ed	Revert "Add attn sink (#2892 )" (#3250 ) This reverts commit `9fa4e8d5ab`.	2025-11-20 07:55:15 -08:00
Linjun-AMD	9fa4e8d5ab	Add attn sink (#2892 ) * enable attn sink Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update attn_sink script Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix some error Signed-off-by: JL-underdog <Jun.Lin@amd.com> * clang-format Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd_kernel'mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update block_fmha_pipeline_qr_ks_vs.hpp Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix ci error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix format error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_bwd_pipeline_default_policy.hpp * Update fmha_fwd_runner.hpp * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * update splitkv_pipline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update splitkv&pagedkv pipeline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * add sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update attn_sink result log Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update smoke_test_fwd_sink.sh Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test file Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test script Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp * use constexpr kHasSink for sink in fmha pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update by pre-commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fmha_fwd.py * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove causal mask setting logic from mask.hpp Removed the mask setting logic for causal masks. * fix ci error that some usage of lamada not support in c++17 Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update remod.py * add smoke sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update fmha_pagedkv_prefill.py * Update FmhaFwdPipeline parameters in fmha_fwd.py * update block_fmha_pipeline_qr_ks_vs_async_trload.hpp Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix c++17 unsupprot error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp * Fix formatting of sink_seq_end assignment * Fix indentation for sink_seq_end assignment * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-20 19:24:05 +08:00
Yi DING	47e2ed838e	[CK_TILE] Add Flatmm MX FP8 (#3208 ) * Use async for flatmm mxfp4 * Fix preshuffle * Add flatmm mxfp8 * Thanks, Copilot * Thanks Copilot again~	2025-11-20 10:35:15 +08:00
Yashvardhan Agarwal	1eb26460aa	[ck_tile] Pooling example - Improved tile sizes (#3233 ) * improved tile sizes - modified tile sizes for improved example performance * Update example/ck_tile/36_pooling/pool3d.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-11-19 15:30:18 +01:00
John Shumway	ad57f6ef0b	[CK_BUILDER] Put global CK functions in an the CK namespace (#3232 ) * Wrap ck host utitlies in CK namespace. The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace. Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace. There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library. * Add using declarations to test code. After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace. * Add using declarations to client examples.	2025-11-19 11:23:02 +01:00
Aviral Goel	ac70206b2c	feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle… (#3225 ) * feat: add support for bf16 for grouped_gemm & grouped_gemm_preshuffle kernel(s) along with unit test * docs: Update CHANGELOG.MD	2025-11-18 09:32:27 -05:00
Yi DING	b6720531de	[CK_TILE] MX Flatmm Split kernel instances (#3207 ) * [CK_TILE] MX Flatmm Split kernel instances * Fix flatmm example compile	2025-11-18 13:46:30 +08:00
Qianfeng Zhang	b75077475b	Remove useless codes in the two trload pipelines	2025-11-15 13:48:50 +00:00
Qianfeng Zhang	238b5c4f08	Separate Traits from Problem while being used for defining the pipeline	2025-11-14 16:42:06 +00:00
BingYuan.Zhou	4d629cd2b0	fix build error (#3195 ) Co-authored-by: root <root@hjbog-srdc-39.amd.com>	2025-11-14 09:46:13 +08:00
yinglu	2a73eb3bc0	Simulate TF32 with BF16x3 (#3142 ) * tf32:bf16x3:use bf16x3 emulate tf32 gemm * change blockwiseGemm to demo bf16x3 * temp push * self review * self review * fix multi-device compile error * bug fix * code refactor * limit to gfx950 * enhance gemm gfx942 threshold * lower change from blockwise to warpwise * refact codes * refact codes * error fix * change threshold * bug fix * fix threshold error * change host reference implement to same as device * bug fix * bug fix * code refact * fix clang-format fail * code refine	2025-11-13 16:21:09 -08:00
Qianfeng Zhang	95c1bb25e3	Remove the k_element_func and v_element_func from the pipeline since they are not used	2025-11-13 14:53:47 +00:00
Khushbu Agarwal	fb41a7b73b	fixing ambiguous shuffle definitions (#3175 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-11-12 23:44:12 -08:00
Cong Ma	6fd8ddabe7	[CK TILE GEMM] Refactor block_scale_gemm examples (#3181 ) * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * [CK TILE GEMM] Refactor block_scale_gemm examples - Set quant group size to (1, 1, 64) for targets excluding gfx950, where warp tile size (16, 16, 128) is incompatible.	2025-11-12 23:43:40 -08:00
Qianfeng Zhang	881ddc5741	Update to the two trload pipeline to load whole Q-tile once through LDS on mi350	2025-11-12 15:59:38 +00:00
Yashvardhan Agarwal	299c9bca1b	[CK_Tile] Pooling example readme update (#3174 ) * pooling example readme update - The updated readme explains the transformations of the pooling kernel using a mermaid diagram * Update example/ck_tile/36_pooling/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * resolve comments --------- Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-11-12 07:30:20 -08:00

1 2 3 4 5 ...

1061 Commits