composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-24 14:54:47 +00:00

Author	SHA1	Message	Date
Aviral Goel	aed0f5880c	Label CMakeLists message() as DEBUG or STATUS for clean build output (#2301 ) * - elevate important build messages to log level STATUS - comment out the rest (temporarily) * - marked all low importance build messages as log_level=DEBUG	2025-06-10 10:46:47 -07:00
Qianfeng Zhang	08886e99d5	Enable BATCH_AS_FIRST_GRID_DIM grid-scheduling and use ASSUME_LEAST_VARIED_SEQLEN for building control	2025-06-10 15:44:29 +00:00
Qianfeng Zhang	4632d30cc0	Improve the VDramTileDistribution and VLds layout for better device loading and reduce bank-conflict	2025-06-08 13:25:57 +00:00
Qianfeng Zhang	84eb9adc71	Move GetKPackV() and GetAlignmentV() out of ck_tile fmha to hstu pipeline default policy for better visibility	2025-06-07 15:01:19 +00:00
Sami Remes	1c6f83df6c	[CK_TILE] Tileloop persistent gemm - resubmit (#2299 ) * Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293) This reverts commit `233e274077`. * Add missing header for kentry --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-06 14:18:49 -07:00
Qianfeng Zhang	b2db644dcd	Add assert(contextual_seqlen >= 0) in example	2025-06-06 15:56:04 +00:00
Qianfeng Zhang	d7930cd541	Update IsFulleTileInsideMask() for kUseLocal is true situtation	2025-06-06 15:55:26 +00:00
Qianfeng Zhang	9e6a24010a	Move all test and bench scripts to folder scripts	2025-06-06 08:22:38 +00:00
Qianfeng Zhang	2bb59dff0a	Add two scripts	2025-06-06 08:15:25 +00:00
Illia Silin	233e274077	Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 ) This reverts commit `ffb52783d0`.	2025-06-05 09:24:00 -07:00
Sami Remes	7ea1508b59	[CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222 ) * Add TailHandler for V3, V4 and Mem pipelines * Adapt examples and tests to use TailHandler * move tail-handling logic to pipeline in persistent grouped gemm * Fix Mem pipeline dispatching, add CompV4 dispatching * Use a macro for handling the many tails of Mem pipeline * Fix formatting again * Use const-ref RunFunction, remove unnecessary try_run	2025-06-04 11:50:21 +03:00
Sami Remes	ffb52783d0	[CK_TILE] Tile loop persistent gemm kernel (#2191 ) * Implement tile loop persistent gemm kernel * Enable timing * Add tests for persistent gemm * Fix formatting * Fix gemm_basic * Rename True/False to Persistent/NonPersistent * Use only one set of layouts for persistent tests * Fix gemm example persistent template parameter * Fix formatting	2025-06-04 11:46:28 +03:00
Khushbu Agarwal	59a85cb4bc	[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262 ) * debugging issue * debugging issue * debugging * debugging * reverting debugging code * clang formatted * updating default_config.json * fix ci failure * clang formatted	2025-06-03 20:16:10 -07:00
Khushbu Agarwal	2e38eb4f1c	Rotating buffer PR CI fix (#2257 ) * Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)" This reverts commit `bbdaf79a52`. * fix regression	2025-06-02 10:25:01 -07:00
Qianfeng Zhang	9582ae2dff	Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes	2025-05-31 06:22:54 +00:00
Qianfeng Zhang	bec35abafe	Tune the settings for hdim-256	2025-05-30 08:51:49 +00:00
Qianfeng Zhang	832747c58d	Add example parameter alpha to ease the testing	2025-05-30 08:51:24 +00:00
slippedJim	57f497452a	remove restriction of group mode hd192 no lse (#2252 ) Co-authored-by: Jim <jimguo12@amd.com>	2025-05-30 10:14:21 +08:00
Po Yen Chen	28cd0dffc9	[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251 ) * Add constraint on traits/tile/pipeline * Use kM0=128 if max_seqlen_q == 8192 * Re-format codegen script * Remove redundant attr name postix * Fix import error: default field in dataclass * Use kK0=64 & kK1=64 to hide latency * Use CU utilization to decide tile size	2025-05-29 18:36:33 +09:00
Qianfeng Zhang	781cba355a	Convert P to fp16/bf16 before doing second gemm in reference hstu implementation	2025-05-29 01:07:04 +00:00
Qianfeng Zhang	36a0f2020c	not-critical updates in example and block_masking codes	2025-05-29 01:06:32 +00:00
Qianfeng Zhang	68a5ab8ff8	Add init_qkv and dump_output example parameters for easier debugging	2025-05-29 01:05:49 +00:00
Illia Silin	bbdaf79a52	Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 ) This reverts commit `99857e10e6`.	2025-05-28 09:46:52 -06:00
Khushbu Agarwal	99857e10e6	[CK_tile] Add rotating buffer feature for universal gemm (#2200 ) * Add rotating buffer feature for universal gemm * adding changes in tile_engine * Updated code to merge kernel_launch * removing comments * Enable rotating buffer changes to flatmm * Created diff launch_kernel function for rotating buffer * Simplfied calculation using macros * merge code with new changes in tile_engine * clang formatted * Redefine macros	2025-05-27 23:00:58 -07:00
Aviral Goel	c52649ad57	Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) (#2234 ) * added catch statements to examples * clang format	2025-05-27 22:32:42 -07:00
Qianfeng Zhang	10c35125d2	Add example parameter max_seqlen and max_target	2025-05-27 14:43:43 +00:00
Qianfeng Zhang	c9e19351c7	Update to the method for calculating max_seqlen in the example	2025-05-27 10:38:52 +00:00
Qianfeng Zhang	dc0977faad	Use NRepetitions2DEpilogue for outputing o_acc tile	2025-05-26 14:43:30 +00:00
Zzz9990	ece38b9d7a	[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221 ) * fix splitkv compiler issue since lse is used to select kernel instances * bypass seqlen == 1 * add chunked prefill into mha varlen This reverts commit `aa9847e42d`. * skip compile when receipt 2-4 and add comments * fix --------- Co-authored-by: fsx950223 <fsx950223@outlook.com>	2025-05-26 19:17:18 +08:00
Illia Silin	8146e471f1	fix the buffer intrinsic names for clang >=20 (#2228 )	2025-05-23 14:58:25 -07:00
Illia Silin	1b846143c6	Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192 )" (#2227 ) This reverts commit `58f9e9ffbc`.	2025-05-22 15:41:17 -07:00
Qianfeng Zhang	81f7b139e0	Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting	2025-05-22 09:49:11 +00:00
SamiAario-AMD	380bca2b85	Fix 11_add_rmsnorm2d_rdquant (#2207 )	2025-05-20 15:15:28 -07:00
Sami Remes	d1e6f0982d	[CK_TILE] Grouped GEMM tile loop (#2146 ) * Add trait to use a persistent kernel and split the entrypoints in grouped gemm * Some helper functions for persistent kernel case * Get max occupancy grid using device properties * Implement tile loop in main entry point to grouped gemm * Enable GridSize() on device * Handle offset tile index using real current block index * Add persistent kernel choice to grouped gemm example * Use a for-loop for iterating over the group * Reduce VGPR spills by early-exit * Enable persistent kernel choice in grouped_gemm example * Add persistent kernel option to grouped_gemm test * Fix formatting with remod.py * Remove GridUpdateBlocks as blocks are now iteratively computed * Add comment about VGPR spilling * Fix formatting * Use CK_TILE_HOST instead of __host__ * Enable all Row/Col combinations in grouped gemm unit test * Add some KBatch=2 cases to grouped gemm tests * Fix SplitK for grouped gemm * Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm * Add type traits * Split examples to regular and tileloop * Formatting * Use hipExtStreamGetCUMask to get current active CUs for the given stream * Align test and example kernel config, and disable validation for splitk repeats * Remove debug options from CMakeLists.txt * Separate the code paths for persistent/non-persistent in test * Fix formatting * Address review comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-05-20 17:18:57 +03:00
Qianfeng Zhang	a1346aaf3e	Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm	2025-05-20 07:53:25 +00:00
Qianfeng Zhang	0a8ea6bd02	Adjust the threshold values for fp16/bf16 in the example	2025-05-20 07:52:37 +00:00
Qianfeng Zhang	29cf1610f1	Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt	2025-05-20 07:52:01 +00:00
Qianfeng Zhang	fac03abbda	Change do-while main-loop to while-do and remove early exiting check	2025-05-19 15:39:31 +00:00
Qianfeng Zhang	14ab6f154d	Adjust the codes before the main-loop	2025-05-19 13:59:57 +00:00
Qianfeng Zhang	f411d676f2	Move k_tile loading and v_tile loading earlier in the loop	2025-05-19 13:59:24 +00:00
Qianfeng Zhang	902b1c645c	Move k_tile loading in the loop earlier	2025-05-19 13:58:43 +00:00
Qianfeng Zhang	f582c21418	Replace s_acc and pcomp tile array by single tile object for simplification	2025-05-19 13:57:11 +00:00
Qianfeng Zhang	4e65469fe8	Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation	2025-05-18 16:20:47 +00:00
Qianfeng Zhang	e4e70f8b0a	Set the block_per_cu to 3 for hdim-128	2025-05-18 15:58:57 +00:00
Qianfeng Zhang	ff3415d97d	Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1	2025-05-18 15:26:31 +00:00
Qianfeng Zhang	694295a9d3	Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0	2025-05-18 15:25:43 +00:00
Qianfeng Zhang	afd7793e92	Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0	2025-05-18 15:25:05 +00:00
Qianfeng Zhang	7c0ac51b4b	Hack block_gemm_areg_bsmem_creg_v2 for gemm_1	2025-05-18 15:24:17 +00:00
Qianfeng Zhang	473fbc374b	Rename the hacked block_gemm_areg_bsmem_creg_v2	2025-05-18 15:23:40 +00:00
Qianfeng Zhang	58e45ec53a	Move the lambda for dividing by max_seqlen from kernel to pipeline	2025-05-18 07:58:52 +00:00

... 2 3 4 5 6 ...

415 Commits