composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-03 05:37:34 +00:00

Author	SHA1	Message	Date
Clement Lin	3e61925277	Remove unused code	2025-04-09 15:08:09 +08:00
YC Lin	fd26846d61	[GEMM] Refactor block gemm, pipeline, and policy of instruction schedule opt	2025-04-09 03:19:10 +00:00
YC Lin	fe61498468	[Add] Add build option for generating assembly	2025-04-06 23:50:26 +00:00
YC Lin	aac02a92ac	[GEMM] Refactor block gemm and pipeline policy of instruction schedule	2025-04-06 23:37:29 +00:00
Clement Lin	9151a1fb42	Add flash_attention_fwd toy_example	2025-04-04 00:39:02 +08:00
mhYang	f28aabb42f	Update tile size and use slc	2025-04-02 12:23:23 +00:00
mhYang	04be0fb437	Make buffer coherence configurable in tensor view	2025-04-02 11:51:23 +00:00
mhYang	d5531ab9c9	Fix add flops calculation	2025-04-01 19:13:18 +00:00
ClementLinCF	04513ca683	Create README.md	2025-04-01 09:11:29 +08:00
mhYang	d1dbc69eda	Use mfma 16x16x32	2025-03-31 23:18:22 +00:00
mhYang	ee28e965f2	Fix KERNEL_D config	2025-03-31 17:53:57 +00:00
YC Lin	68cd6609eb	[GEMM] Add pragma message for different MFMA options	2025-03-30 20:05:35 +00:00
YC Lin	a8027a5b2f	[GEMM] Fix print typos	2025-03-30 19:55:13 +00:00
Clement Lin	5af7efdec5	Fix indentation typo	2025-03-30 15:06:07 +08:00
Clement Lin	de9385ba51	[GEMM] Fix MFMA condition checks	2025-03-30 14:02:30 +08:00
Clement Lin	5dd8e4ae0c	[GEMM] Add new macor options check	2025-03-30 10:07:21 +08:00
Clement Lin	7bc473835e	[GEMM] Add macros for multiple optimization options	2025-03-29 22:58:51 +08:00
YC Lin	428bcdeb40	[GEMM] default MFMA config	2025-03-29 21:11:29 +00:00
YC Lin	9a0d9dfc0a	git push test	2025-03-29 21:09:00 +00:00
root	a3c6ca1761	[GEMM] fix MFMA configurations	2025-03-29 21:05:21 +00:00
mhYang	16a4e1585a	Adjust mfma schedule order	2025-03-28 18:32:12 +00:00
Clement Lin	5428f17ca2	[GEMM] Replace const auto with constexpr index_t	2025-03-28 17:39:49 +08:00
Clement Lin	4eb246f20c	[GEMM] Update cache-aware wg schedule	2025-03-28 17:31:19 +08:00
bobofang	8f1aa6fc6f	Add MFMA M16N16K16 and M16N16K32 methods these two methods are default off	2025-03-28 16:31:53 +00:00
YC Lin	f1562d5911	[GEMM] remove a_col_major/b_row_majro case	2025-03-26 16:31:24 +00:00
root	6d2b55914e	[GEMM] modify if-else locations	2025-03-26 13:35:00 +00:00
mhYang	9ecde871a3	Fix AccDataType and CDataType 1. Fix AccDataType and CDataType 2. Remove indent 3. Align merge_transform for tutorial	2025-03-25 20:08:59 +00:00
mhYang	67072b3ba9	Fix build error	2025-03-25 18:55:34 +00:00
MHYang-gh	b0b5827673	Fix A/B lds transform (#2007 )	2025-03-25 18:52:41 +00:00
root	7d08f99b02	[GEMM] disable/enable instruction scheduling	2025-03-25 16:53:20 +00:00
mhYang	4494c54dcd	Fix missing message	2025-03-21 15:06:29 +00:00
mhYang	8f3b534d29	Fix xor transform dim.	2025-03-21 15:00:05 +00:00
Clement Lin	1f604e9b0a	[GEMM] Add cache-aware WG schedule and adjust block tile 113 -> 121.7 TFops	2025-03-21 09:15:17 +08:00
mhYang	93193e42ea	Add LDS bank conlict solutions	2025-03-20 21:36:56 +00:00
bobofang	d635209d59	Fix add accuracy issue 2673 GB/s -> 3271 GB/s Perf: 0.0512898 ms, 3271.06 GB/s	2025-03-19 12:26:30 +08:00
root	ff15e2da7a	[GEMM] use mfma k8 warp gemm	2025-03-17 16:01:29 +00:00
root	10033c1cdc	[GEMM] disable/enable prefetch	2025-03-17 14:22:49 +00:00
Clement Lin	803ecb93d8	[CK TILE] Toy example - basic gemm	2025-03-12 15:55:47 +08:00
Clement Lin	1afc32c59c	Adjust block shape 2673 GB/s -> 3647 GB/s	2025-03-12 14:58:06 +08:00
Clement Lin	58bc69aa99	Utilize vectorized memory access 1998.24 GB/s -> 2673 GB/s	2025-03-12 14:44:41 +08:00
Clement Lin	399cdb6f9f	Adjust the size of thread block 1968.42 GB/s -> 1998.24 GB/s	2025-03-12 14:33:36 +08:00
Clement Lin	712d96cef5	[CK TILE] Toy example - basic add	2025-03-11 16:23:12 +08:00
Mingtao Gu	0db7c8f0b2	Ck int4 moe develop (#1949 ) * Add Gemm fp8xint4 example and kernel, function pass. * Init Gemm_fp8xint4 Bpreshuffle * Added gemm_fp8xint4_Bpreshuffle files, function not checked yet * General fix. * fp8xint4 bpreshuffle function pass * fix. * init b preshuffle dequant in VGPR. * fix bug, function pass. * move b thread dequant copy to blockwise. * fix bug, function now passes. * modified the tile size to 256, 128x128x128. * fixed a bug. * Initial int4 moe, compile pass, function not check. * fix bug in moe_gemm1.cpp, now function pass. * test expert = 8 and function pass. * Added moe_pk_i4_gemm2, function pass. * Added b preshuffle pipeline v3 support. * fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass. * Split the blockwise pipeline for fp8xint4. * commit missing files * opt gemm2 to 2x2 wave * fix swizzle = false * update int4 moe with latest input changes. * update tile size. * enable pipeline v3. * fix nswizzle = true * commit a version for compiler debug. * Updated transfer_v3r1_gather to support pk_i4_t type. * for int4 moe2 for type_convert support. * remove some values between mfma instructions. * fix int4 moe * Updated transfer_v3r1_gather to support pk_i4_t type. * i4 support lds multiple shuffle * fixed int4 moe tflops calculation. * Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle * updated gemm2. * change int4 moe example names * fix and format code. * format. * format codes. * update fp8xint4 example tile size. * add <unordered_map> header * fixed. * format. * Added conditional compilation for int4 -> fp8 conversion kernels --------- Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: coderfeli <coderfeli@163.com>	2025-03-10 11:16:44 +08:00
Thomas Ning	c954bd0cfa	Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955 ) * tempsave, not selected * finish the feature and merge with develop --------- Co-authored-by: aska-0096 <haocwang@amd.com>	2025-03-07 13:44:06 -08:00
Thomas Ning	9d51d17dd0	Fix on the error (#1956 )	2025-03-07 13:43:52 -08:00
Illia Silin	0e8e711ec8	add missing headers (#1959 )	2025-03-07 11:11:30 -08:00
Max Podkorytov	9e132eb77c	refactor ck-tile kernel launch (#1925 )	2025-03-07 08:29:40 -08:00
Qianfeng	4f54fa3058	Ck tile/complete k prefetch (#1941 ) * Re-implement qr_ks_vs_async pipeline by using kLoadOnce * Remove last block_sync_lds() in the loop * Tiny adjustment in qr_ks_vs_async pipeline for better performance * Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline * Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline * Use un-rolled gemm for Gemm-0 * Use k0_loops small tile load/store to replace the big tile load/store for K * Remove the commented lines in qx_ks_vs_custom_policy.hpp * Tune the prefetching of V in qr_ks_vs_async pipeline * Move the codes for storing the first v_lds tile some later * Let BlockDropout reuse LDS with V * Switch to separate code blocks according to iteration index * Interleave code blocks for better performance * Move clear_tile(s_acc) for better interleaving * Move code interleaving * Use MakeQDramTileDistribution for q_dram_window * Roll-back to load Q directly from global memory instead of using LDS as intermediary stop * Let V reuse the LDS of K * Use array of tiles to represent Q in vgprs * Use QLoadOnce == false for qr_ks_vs_async pipeline * Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline * Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile() * Move the definition of v_tiles out from the loop * Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile() * Remove using KLoadOnce in qx_ks_vs_custom_policy * Remove un-used get_slice_tile() call * Move the code line of clear_tile(s_acc) * Tune the lines of codes to make them more tidy * Re-arrange the codes before the main-loop * Add comments * Unify the alignment to be 8 for Q/K/V Lds decriptors * Tuning to K pre-loading * Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false * Adjust the pipeline codes * Use NumPrefetchV to separate from NumVLdsBuffers * Tune the location of a scheduler barrier code line * Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths * Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel * Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline * Move the location for store_tile() of first v_tile * Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline * Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<> * Try to fix old bugs in qx_ks_vs_custom_policy * Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler * Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy * Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV) * Tiny Fix (#1888) * Ck tile/paged attention workaround (#1894) * Correction in GetRangeAlongX() * Work-around to solve the failures in test_paged_attention_ck in xformers * Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline * Remove one call of move_tile_window for q_dram_window * Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers() * Tiny fix in qr_ks_vs_whole_k_prefetch pipeline * Adjust the location of codes for storing the first V tile to LDS * Tiny fix and add comments * Change GetSmemKPackK size to improve performance * Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy * Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access * Adjustment in intermediate iteration codes for tiny performance improvement * Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation * Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue * Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer() * Remove useless AsyncopyV * Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used * Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy * Recover the qr_ks_vs_async pipeline * Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline * Revert "Try to fix old bugs in qx_ks_vs_custom_policy" This reverts commit `39b82ca194`. * Tiny fix with regard to whole_k_prefetch pipeline compiling * Update kPadSeqLenK setting in fmha_fwd_kernel * Use q_element_func and k_element_func * Use single q_tile rather than multiple sliced q_tiles * Codes refine according to the comments * Re-format one file * Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true	2025-03-07 14:19:51 +08:00
Illia Silin	43c90b5234	RE-enable DL and DPP instances by default. (#1954 ) * enable DL and DPP instances by default * fix cmake logic	2025-03-06 21:45:31 -08:00
Max Podkorytov	7a4a5d6c08	Update CODEOWNERS (#1953 ) Add @tenpercent	2025-03-06 17:38:29 -08:00

1 2 3 4 5 ...

1780 Commits