composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
joye	b406b584d7	fix clang format check	2025-04-28 08:56:00 +08:00
joye	292f89c860	fix a transpose index issue	2025-04-27 17:39:52 +08:00
joye	bd091d9a88	fix a transpose index issue	2025-04-27 17:33:32 +08:00
joye	de9407ed93	fix 16x16 related dimension transpose	2025-04-27 02:44:49 -05:00
joye	57e8e34705	exchange the iteration order	2025-04-26 21:37:59 -05:00
joye	25ab38f913	fix transpose related codes	2025-04-27 09:55:29 +08:00
joye	a823e00dc0	delete unused variables	2025-04-25 14:47:49 +08:00
joye	a6564da629	update output tensor distribution	2025-04-25 14:45:45 +08:00
joye	119a8e0e16	hack for transpose 16x16	2025-04-25 00:56:59 -05:00
joye	e2f3c95d24	miss output tile distribution mapping	2025-04-24 21:55:03 -05:00
joye	6beb585dad	update tile transpose	2025-04-24 20:36:09 -05:00
joye	efa7243ee5	transpose load enable	2025-04-24 19:19:44 -05:00
joye	df9769afba	can pass; but no logic	2025-04-24 19:05:47 -05:00
joye	90a4501869	add some fix	2025-04-24 18:29:38 +08:00
joye	34040f43b6	update some codes	2025-04-24 17:53:19 +08:00
joye	8d75983536	fix a distribution issue	2025-04-24 02:24:31 -05:00
joye	6893165818	add some fixes	2025-04-24 01:20:35 -05:00
joye	afb1cec9c4	update transpose load logic	2025-04-24 11:14:07 +08:00
joye	3918a35870	Merge branch 'mi355_transpose_load_dev' of https://github.com/ROCm/composable_kernel into mi355_transpose_load_dev	2025-04-24 08:28:14 +08:00
joye	aaab4aacc5	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into mi355_transpose_load_dev	2025-04-24 08:27:53 +08:00
joye	c862437f05	fix some issues	2025-04-23 19:27:29 -05:00
carlushuang	5487289fc4	[CK_TILE] support gfx950 matrix core in 01_fmha fwd (#2110 ) * gfx950 01_fmha fwd * fix comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-04-23 12:40:18 -07:00
joye	acce2df3bf	fix some compile errors	2025-04-23 05:27:52 -05:00
joye	e14a16359f	add transpose load; no real logic	2025-04-23 16:55:13 +08:00
Gino Lu	504f563f78	[CK-Tile] warp-gemm support for using V_MFMA_F32_16x16x32_BF16 (#2073 ) * draft v_mfma_f32_16x16x32_bf16 * fix error config and add debug code. * Solve the CShuffle Problem * draft v_mfma_f32_16x16x32_bf16 * fix error config and add debug code. * Solve the CShuffle Problem * fix error while testing new command * Finished the feature of new mfma 161632 * Addressed the comment --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-04-22 15:52:36 -07:00
Thomas Ning	a738e43445	MFMA 16x16x32fp8 (#2103 ) * add mfma_16x16x32_fp8 * clang format code * Finished the fix for gemm basic * clang foramt * rebuild CI * recover gemm.hpp * add MFMA 161632bf8 --------- Co-authored-by: solin <bingzhou@amd.com>	2025-04-21 10:21:35 -07:00
solin	c318ec0778	fix CI build fail	2025-04-21 16:00:12 +08:00
BingYuan.Zhou	eaf1f0bf3b	[flatmm] implement basic fp16 flatmm (#2089 ) * [flatmm] implement basic fp16 flatmm * fix CI build fail --------- Co-authored-by: root <root@hjbog-srdc-50.amd.com> Co-authored-by: solin <bingzhou@amd.com>	2025-04-16 16:51:17 +08:00
Thomas Ning	269f4f6af5	Solve the Static Encoding Pattern compile error when the tile size is too small (#2079 )	2025-04-13 20:09:30 -07:00
jakpiase	6c61f4d237	[CK_TILE] Add 2:4 structured sparsity support for fp16 gemm (#1957 ) * add structured sparsity fp16 support for gemm * added reviewer suggestions * update changelog * update changelog * add reviewers suggestions * Minor fix * clang fix * fix doxygen	2025-04-11 12:18:26 +02:00
slippedJim	5f885d2b7a	add fmha fwd splitkv receipt for aiter c++ api (#2068 ) * add s_randval for c++ api * Fix bug of bias in splitkv --------- Co-authored-by: rocking <ChunYu.Lai@amd.com>	2025-04-10 23:21:13 +08:00
Illia Silin	572cd820ce	Split env.hpp header from the ck.hpp header. (#2049 ) * split env.hpp out of main headers * fix namespace logic	2025-04-03 15:30:21 -07:00
Adam Osewski	e5ad48a784	Basic docs for universal gemm & ck-tile gemm. (#2014 ) * Basic docs for universal gemm & ck-tile gemm. * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Reviewers suggestions. * Align tparam names in doc with class tparams. * More reviewers fine tuning ;) --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>	2025-04-02 11:03:40 +02:00
rocking	8a20b62e91	Reduce redundant space in bias tensor (#2024 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-03-28 21:58:06 +08:00
felix	a82f338fb9	hotfix fix sorting int64 (#2025 ) * fix sorting int64 * clang format * fix example issue * update WA issue # --------- Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: carlushuang <carlus.huang@amd.com>	2025-03-28 11:31:52 +08:00
ruanjm	d49abdaa87	[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861 ) * 50ms -> 28ms * Fix bug in non fuse_add_store cases * Fine tuned setting for 2 pass pipeline * adjust workload * remove unnecessary change * add layernorm * Adding output quant and unquant results at the same time. * fix test * fix format * tune for cases 128x640 and 128x1024 * bug ifx	2025-03-25 20:09:45 +08:00
MHYang-gh	c027637a8f	Fix A/B lds transform (#2007 )	2025-03-22 23:13:50 -07:00
BingYuan.Zhou	5a0d693b86	fix ck_tile/basic_gemm build error (#1988 )	2025-03-20 22:01:14 -07:00
jakpiase	0e91d32c61	[CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919 ) * switch to universal gemm for batched and grouped gemms * added reviewer comments * fixed grouped gemm tests	2025-03-20 11:17:04 +01:00
rocking	b819c217e4	Sync the kname with instance name (#1989 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-03-20 00:06:45 +08:00
carlushuang	3e81279d26	Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" … (#1971 ) * Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969) This reverts commit `8cbcd3e0d0`. * fix codegen problem * Update config.hpp --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-03-13 11:41:39 +08:00
Illia Silin	8cbcd3e0d0	Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 ) This reverts commit `7a93b16ff6`.	2025-03-11 10:40:18 -07:00
carlushuang	7a93b16ff6	[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 ) * support hdim=192/128 pair * remove useless print * update	2025-03-11 21:07:40 +08:00
Qianfeng	4f54fa3058	Ck tile/complete k prefetch (#1941 ) * Re-implement qr_ks_vs_async pipeline by using kLoadOnce * Remove last block_sync_lds() in the loop * Tiny adjustment in qr_ks_vs_async pipeline for better performance * Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline * Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline * Use un-rolled gemm for Gemm-0 * Use k0_loops small tile load/store to replace the big tile load/store for K * Remove the commented lines in qx_ks_vs_custom_policy.hpp * Tune the prefetching of V in qr_ks_vs_async pipeline * Move the codes for storing the first v_lds tile some later * Let BlockDropout reuse LDS with V * Switch to separate code blocks according to iteration index * Interleave code blocks for better performance * Move clear_tile(s_acc) for better interleaving * Move code interleaving * Use MakeQDramTileDistribution for q_dram_window * Roll-back to load Q directly from global memory instead of using LDS as intermediary stop * Let V reuse the LDS of K * Use array of tiles to represent Q in vgprs * Use QLoadOnce == false for qr_ks_vs_async pipeline * Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline * Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile() * Move the definition of v_tiles out from the loop * Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile() * Remove using KLoadOnce in qx_ks_vs_custom_policy * Remove un-used get_slice_tile() call * Move the code line of clear_tile(s_acc) * Tune the lines of codes to make them more tidy * Re-arrange the codes before the main-loop * Add comments * Unify the alignment to be 8 for Q/K/V Lds decriptors * Tuning to K pre-loading * Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false * Adjust the pipeline codes * Use NumPrefetchV to separate from NumVLdsBuffers * Tune the location of a scheduler barrier code line * Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths * Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel * Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline * Move the location for store_tile() of first v_tile * Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline * Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<> * Try to fix old bugs in qx_ks_vs_custom_policy * Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler * Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy * Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV) * Tiny Fix (#1888) * Ck tile/paged attention workaround (#1894) * Correction in GetRangeAlongX() * Work-around to solve the failures in test_paged_attention_ck in xformers * Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline * Remove one call of move_tile_window for q_dram_window * Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers() * Tiny fix in qr_ks_vs_whole_k_prefetch pipeline * Adjust the location of codes for storing the first V tile to LDS * Tiny fix and add comments * Change GetSmemKPackK size to improve performance * Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy * Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access * Adjustment in intermediate iteration codes for tiny performance improvement * Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation * Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue * Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer() * Remove useless AsyncopyV * Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used * Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy * Recover the qr_ks_vs_async pipeline * Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline * Revert "Try to fix old bugs in qx_ks_vs_custom_policy" This reverts commit `39b82ca194`. * Tiny fix with regard to whole_k_prefetch pipeline compiling * Update kPadSeqLenK setting in fmha_fwd_kernel * Use q_element_func and k_element_func * Use single q_tile rather than multiple sliced q_tiles * Codes refine according to the comments * Re-format one file * Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true	2025-03-07 14:19:51 +08:00
kylasa	66c5f5b0b6	Addressing (Post Merge) code review comments for PR 1845 (#1883 ) * Addressing code review comments. * Addressing code review comments. * Reorganized code for better readability. * add ck_tile gemms for new types in CI * fix jenkins syntax * fix script syntax * Add the test cases back * Address the review comments * Address review comments * clang format * Solve the merging issues * Addressed the comments * clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-03-06 11:40:30 -08:00
Illia Silin	a88bf76ecc	Replace buffer load/store intrinsics with builtins (#1876 ) * replace buffer load/store intrinsics with builtins * fix clang format * replace buffer load/store intrinsics with built-ins in ck_tile * fix clang format * add switch between buffer intrinsics and built-ins * change the builtins threshold to clang20 * fix clang format * fix some compilation errors * revert changes in ck_tile * revert changes in ck_tile * delete all root files and folders when CI completes * try changing the username in CI * fix groovy syntax * add user and group id info to ci dockers * change ownership of all files in CI to jenkins at the end * update changelog	2025-03-05 14:33:28 -08:00
Adam Osewski	4814db3905	[CK TILE] Fix KIterPerInnerLoop for block gemm. (#1934 ) * Fix KIterPerInnerLoop * Fix Kpack and KPerInnerLoop for block universal gemm. * Fix overlooked spelling bugs.	2025-03-05 14:17:44 -08:00
Bartłomiej Kocot	1bf29478cd	[CK TILE] Fix double lds in ck tile gemm (#1924 )	2025-02-28 08:07:53 -08:00
Bartłomiej Kocot	0356ee069e	[CK TILE] Gemm pk_int4_t permute B (#1907 ) * [CK TILE] Gemm pk_int4_t permute B * Fixes	2025-02-27 11:01:14 +01:00
Bartłomiej Kocot	bf1e17007e	[CK TILE] Block universal gemm lds<->vgpr optimizations (#1906 ) * [CK TILE] Block universal gemm lds<->vgpr optimizations * Rebase * Fixes	2025-02-27 10:36:28 +01:00

1 2 3 4

168 Commits