composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-21 13:29:20 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	d7930cd541	Update IsFulleTileInsideMask() for kUseLocal is true situtation	2025-06-06 15:55:26 +00:00
Qianfeng Zhang	9e6a24010a	Move all test and bench scripts to folder scripts	2025-06-06 08:22:38 +00:00
Qianfeng Zhang	2bb59dff0a	Add two scripts	2025-06-06 08:15:25 +00:00
Andriy Roshchenko	00247e3c29	Optimized GEMMs for MX FP4/8 (#2294 ) Adds V3 GEMM pipeline for MX FP4 and MX FP8 Adds V3 GEMM pipeline for MX FP4 with preshuffling Adds MXFP4 GEMM tests (#2275) Adds MXFP4 GEMM examples Adds MXFP4 GEMMs to ckProfiler Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: OscarXu <huaiguxu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: Lin, Qun <qlin@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>	2025-06-05 13:54:15 -06:00
Illia Silin	233e274077	Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 ) This reverts commit `ffb52783d0`.	2025-06-05 09:24:00 -07:00
Sami Remes	7ea1508b59	[CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222 ) * Add TailHandler for V3, V4 and Mem pipelines * Adapt examples and tests to use TailHandler * move tail-handling logic to pipeline in persistent grouped gemm * Fix Mem pipeline dispatching, add CompV4 dispatching * Use a macro for handling the many tails of Mem pipeline * Fix formatting again * Use const-ref RunFunction, remove unnecessary try_run	2025-06-04 11:50:21 +03:00
Sami Remes	ffb52783d0	[CK_TILE] Tile loop persistent gemm kernel (#2191 ) * Implement tile loop persistent gemm kernel * Enable timing * Add tests for persistent gemm * Fix formatting * Fix gemm_basic * Rename True/False to Persistent/NonPersistent * Use only one set of layouts for persistent tests * Fix gemm example persistent template parameter * Fix formatting	2025-06-04 11:46:28 +03:00
Anton Gorenko	52b4860a30	WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230 ) * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Updates to support mixed precision * Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip * Added support for F8xF16xF16 to gemm_wmma_universal * Added support for F16xF8xF16 to gemm_wmma_universal * Added support for BF16xI4xBF16 to gemm_wmma_universal * Added support for F16xI4xF16 to gemm_wmma_universal * Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType * Added missing test class for FP16_KM_NK * Pre-commit hooks fixes * Added padding instances for f16xf16xf16 * Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8" * Fixed cmake build errors related to test_fp8 * Ammending changes for adding support for padding instances for f16xf16xf16 * Fixes for padding instances for f16xf16xf16 * Added padding instances for bf16xbf16, f8xf8 * Added packed instances for bf16xi4xbf16 * Added padding instances for f8xf16xf16 * Added padding instances for f16xf8xf16, f16xi4xf16 * Fixed typos for bf16xbf16xbf16 padding instances * Fixed typos for padded instances * Added tests for fp16, KM_KN and KM_NK * Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances. * Fixed typos * Updated the set of tests for FP16 * Updated the set of tests for FP16 * Fix typo * Moved f16xi4 test under the correct data layout group * example for gemm_universal_bf16 * Adding examples for gemm_wmma instances * Added the missing parameters * Fixed review comments and added executable to cmakeLists * Fixing clang format * Fixing build erros * Fixed compilation failure. * Modified some code as per gemm_universal_examples * Fixed the gemm specialization error * Fixed the build errors. * Fix strides of a/b_thread_desc The descriptors are larger than needed (even though the compiler don't alloc registers for unused values). * Load in M/NRepeat dims with thread copy's slice instead of a loop * Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation * Implement Intrawave and Interwave variants of pipeline v1 * Add instances for Interwave and Intrawave v1 * Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0 * Remove instances that are too slow (mostly because of register spilling) * Add a workaround for fp8/bf8->f32 packed conversion issue * Add instances for Interwave and Intrawave v1 * Enable profiling of mixed precision with f8 and int4 on WMMA * Fix segfault in profiler when B is pk_i4_t b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds. * Remove instances that are too slow (mostly because of register spilling) * Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations * Add test case for bf16_i4 * Add missing Regular tests * Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS They take more than 30 seconds * Fix a bug that fp16_i4 validation passes only with PermuteB A permutation required by conversion from pk_i4_t to half_t does not depend on PermuteB, they can be used independently. * Use PermuteB with f16_i4 in most instances (as xdl) Some instances use PermuteB = false for checking correctness. See also the previous commit. * Fix cache flushing for pk_i4 * Add mixed precision examples * Disable all tests and instances with f8 on gfx11 Even though f8_f16 and f16_f8 don't require f8 WMMA instructions, gfx11 still lacks hardware instructions for fast f8->f32 conversion. * Add FP16 KM_NK and KM_KN test suites for XDL These tests were added to common .inc for better testing of WMMA instances * Fix int8 DTYPES check for gemm_bilinear --------- Co-authored-by: Anca Hamuraru <anca@streamhpc.com> Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>	2025-06-04 12:22:33 +06:00
Khushbu Agarwal	59a85cb4bc	[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262 ) * debugging issue * debugging issue * debugging * debugging * reverting debugging code * clang formatted * updating default_config.json * fix ci failure * clang formatted	2025-06-03 20:16:10 -07:00
Aviral Goel	11f6c14e03	Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268 ) * add 0 as valid default arguement for strides * add 0 as valid default arguement for strides # Conflicts: # example/01_gemm/common.hpp	2025-06-03 07:26:58 -07:00
Illia Silin	b76fdbe47f	Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274 ) * upgrade to rocm6.4.1 and use gfx1x-generic targets * add rocm version parsing * fix the gfx10-3-generic syntax in cmake	2025-06-03 07:17:35 -07:00
Khushbu Agarwal	2e38eb4f1c	Rotating buffer PR CI fix (#2257 ) * Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)" This reverts commit `bbdaf79a52`. * fix regression	2025-06-02 10:25:01 -07:00
Qianfeng Zhang	9582ae2dff	Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes	2025-05-31 06:22:54 +00:00
Qianfeng Zhang	bec35abafe	Tune the settings for hdim-256	2025-05-30 08:51:49 +00:00
Qianfeng Zhang	832747c58d	Add example parameter alpha to ease the testing	2025-05-30 08:51:24 +00:00
slippedJim	57f497452a	remove restriction of group mode hd192 no lse (#2252 ) Co-authored-by: Jim <jimguo12@amd.com>	2025-05-30 10:14:21 +08:00
Po Yen Chen	28cd0dffc9	[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251 ) * Add constraint on traits/tile/pipeline * Use kM0=128 if max_seqlen_q == 8192 * Re-format codegen script * Remove redundant attr name postix * Fix import error: default field in dataclass * Use kK0=64 & kK1=64 to hide latency * Use CU utilization to decide tile size	2025-05-29 18:36:33 +09:00
Qianfeng Zhang	781cba355a	Convert P to fp16/bf16 before doing second gemm in reference hstu implementation	2025-05-29 01:07:04 +00:00
Qianfeng Zhang	36a0f2020c	not-critical updates in example and block_masking codes	2025-05-29 01:06:32 +00:00
Qianfeng Zhang	68a5ab8ff8	Add init_qkv and dump_output example parameters for easier debugging	2025-05-29 01:05:49 +00:00
Illia Silin	bbdaf79a52	Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 ) This reverts commit `99857e10e6`.	2025-05-28 09:46:52 -06:00
Khushbu Agarwal	99857e10e6	[CK_tile] Add rotating buffer feature for universal gemm (#2200 ) * Add rotating buffer feature for universal gemm * adding changes in tile_engine * Updated code to merge kernel_launch * removing comments * Enable rotating buffer changes to flatmm * Created diff launch_kernel function for rotating buffer * Simplfied calculation using macros * merge code with new changes in tile_engine * clang formatted * Redefine macros	2025-05-27 23:00:58 -07:00
Aviral Goel	c52649ad57	Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) (#2234 ) * added catch statements to examples * clang format	2025-05-27 22:32:42 -07:00
Qianfeng Zhang	10c35125d2	Add example parameter max_seqlen and max_target	2025-05-27 14:43:43 +00:00
Qianfeng Zhang	c9e19351c7	Update to the method for calculating max_seqlen in the example	2025-05-27 10:38:52 +00:00
Qianfeng Zhang	dc0977faad	Use NRepetitions2DEpilogue for outputing o_acc tile	2025-05-26 14:43:30 +00:00
Zzz9990	ece38b9d7a	[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221 ) * fix splitkv compiler issue since lse is used to select kernel instances * bypass seqlen == 1 * add chunked prefill into mha varlen This reverts commit `aa9847e42d`. * skip compile when receipt 2-4 and add comments * fix --------- Co-authored-by: fsx950223 <fsx950223@outlook.com>	2025-05-26 19:17:18 +08:00
Illia Silin	8146e471f1	fix the buffer intrinsic names for clang >=20 (#2228 )	2025-05-23 14:58:25 -07:00
Illia Silin	1b846143c6	Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192 )" (#2227 ) This reverts commit `58f9e9ffbc`.	2025-05-22 15:41:17 -07:00
Qianfeng Zhang	81f7b139e0	Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting	2025-05-22 09:49:11 +00:00
SamiAario-AMD	380bca2b85	Fix 11_add_rmsnorm2d_rdquant (#2207 )	2025-05-20 15:15:28 -07:00
Thomas Ning	1386924749	Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212 ) * Add small instance, add the bug fix, & improve the example CMake * clang format	2025-05-20 15:05:08 -07:00
Sami Remes	d1e6f0982d	[CK_TILE] Grouped GEMM tile loop (#2146 ) * Add trait to use a persistent kernel and split the entrypoints in grouped gemm * Some helper functions for persistent kernel case * Get max occupancy grid using device properties * Implement tile loop in main entry point to grouped gemm * Enable GridSize() on device * Handle offset tile index using real current block index * Add persistent kernel choice to grouped gemm example * Use a for-loop for iterating over the group * Reduce VGPR spills by early-exit * Enable persistent kernel choice in grouped_gemm example * Add persistent kernel option to grouped_gemm test * Fix formatting with remod.py * Remove GridUpdateBlocks as blocks are now iteratively computed * Add comment about VGPR spilling * Fix formatting * Use CK_TILE_HOST instead of __host__ * Enable all Row/Col combinations in grouped gemm unit test * Add some KBatch=2 cases to grouped gemm tests * Fix SplitK for grouped gemm * Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm * Add type traits * Split examples to regular and tileloop * Formatting * Use hipExtStreamGetCUMask to get current active CUs for the given stream * Align test and example kernel config, and disable validation for splitk repeats * Remove debug options from CMakeLists.txt * Separate the code paths for persistent/non-persistent in test * Fix formatting * Address review comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-05-20 17:18:57 +03:00
Qianfeng Zhang	a1346aaf3e	Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm	2025-05-20 07:53:25 +00:00
Qianfeng Zhang	0a8ea6bd02	Adjust the threshold values for fp16/bf16 in the example	2025-05-20 07:52:37 +00:00
Qianfeng Zhang	29cf1610f1	Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt	2025-05-20 07:52:01 +00:00
Aviral Goel	c4929225f6	remove debug statements from CMakeLists (#2204 )	2025-05-19 17:31:04 -07:00
jefyang1	f18170064d	Use new mfma instructions for FP8 on gfx950 (#2202 ) * Add logic to use new mfma instructions for fp8 bf8 * Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format * Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com> * Fix intrin_mfma f8 calls due to merge mistake --------- Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>	2025-05-19 17:29:51 -07:00
jefyang1	b8b12bb81e	Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203 ) * Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 * Run clang format	2025-05-19 14:25:50 -07:00
Qianfeng Zhang	fac03abbda	Change do-while main-loop to while-do and remove early exiting check	2025-05-19 15:39:31 +00:00
Qianfeng Zhang	14ab6f154d	Adjust the codes before the main-loop	2025-05-19 13:59:57 +00:00
Qianfeng Zhang	f411d676f2	Move k_tile loading and v_tile loading earlier in the loop	2025-05-19 13:59:24 +00:00
Qianfeng Zhang	902b1c645c	Move k_tile loading in the loop earlier	2025-05-19 13:58:43 +00:00
Qianfeng Zhang	f582c21418	Replace s_acc and pcomp tile array by single tile object for simplification	2025-05-19 13:57:11 +00:00
Qianfeng Zhang	4e65469fe8	Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation	2025-05-18 16:20:47 +00:00
Qianfeng Zhang	e4e70f8b0a	Set the block_per_cu to 3 for hdim-128	2025-05-18 15:58:57 +00:00
Qianfeng Zhang	ff3415d97d	Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1	2025-05-18 15:26:31 +00:00
Qianfeng Zhang	694295a9d3	Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0	2025-05-18 15:25:43 +00:00
Qianfeng Zhang	afd7793e92	Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0	2025-05-18 15:25:05 +00:00
Qianfeng Zhang	7c0ac51b4b	Hack block_gemm_areg_bsmem_creg_v2 for gemm_1	2025-05-18 15:24:17 +00:00

... 7 8 9 10 11 ...

1065 Commits