composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-02 04:31:25 +00:00

Author	SHA1	Message	Date
Haocong WANG	a5fdc663c8	fix async copytest bug (#2509 ) * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * change bit width check * refactor macros into constexpr functions which still get inlined * wrap s_waitcnt api * parameterize test * cleanup * cleanup fp8 stub * add fp8 test cases; todo which input parameters are valid? * replace n for fp8 in test cases * add large shapes; fp8 fails again * change input init * test sync/async * time the test * clang-format test * use float instead of bfloat to cover a 4-byte type * fix logic - arg sections should be 'or'd * make block_sync_lds_direct_load interface similar to old ck * fix a few comment typos * name common shapes * revert the example to original logic of not waiting lds * clang-format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-07-23 00:14:02 -07:00
Cong Ma	e62710e461	ck_tile kernel for gemm with groupwise quantized A tensor (#2473 ) * ck_tile kernel for gemm with groupwise quantized A or B tensor. This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. 1. fp8, fp8 -> f32 2. bf8, bf8 -> f32 3. i4, fp8 -> f32 4. i4, bf8 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers. * [Block Scale Gemm] Only run gemm quant examples on __gfx94__ - Only run gemm quant examples on __gfx94__ for usage of `v_cvt_pk_fp8_f32` - Format the code * [Block Scale Gemm] Remove Bquant Gemm BlockScale This cleanup is in preparation for future development of bquant. By isolating Aquant-related code, we can streamline the codebase and make it easier to add and maintain bquant functionality in subsequent updates. * [Block Scale Gemm] Format code with clang-format-12 The latest clang-format (v19) in ROCm 7.0 generate different result than clang-format-12 which is used in CK CI. Format code with clang-format-12 for consistency. * [Block Scale Gemm] Split the k direction loop - Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp to make the logic clearer. - Disable C transposition. * [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm * [Block Scale Gemm] Update copyright * test * Add TailHandler * Move TileDistributionEncodingPatternAQ * Refactor * refactor * fix bug * fix bug * help solve the PR comment * Format the code * [Block Scale Gemm] Add unit tests * [Block Scale Gemm] Add support to 16x16x32 MFMA - Add support to 16x16x32 MFMA - Fix a bug when exchange data crossing lanes --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com> Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-07-23 00:10:16 -07:00
Linjun-AMD	095393276a	h_dim256 fmha use async_qr pipeline (#2510 )	2025-07-18 09:59:38 +08:00
slippedJim	05b65d0c7c	update (#2519 )	2025-07-17 15:24:19 +08:00
Yi DING	f1d8ad2818	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>	2025-07-16 23:56:22 -07:00
Khushbu Agarwal	579bd73435	Fixing numerical error, and interchange preshuffle configs to match with flatmm (#2515 )	2025-07-16 22:33:03 -07:00
MHYangAMD	3499fe67ff	[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409 ) * Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass * Update rmsnorm2d_fwd_pipeline_model_sensitive_pass 1. Add BlockReduce2dTreeCrossWarpSync * Add Rmsnorm2dFusedModelSensitiveEnum * Update patch 1. Reverse generate.py 2. Remove comment in generate.py 3. Update tree cross warp reduce * Refactor RMSNorm model enum and introduce T5-like option * Update the n stage for cross warp reduce * Add new cmdline option in RMSNorm for new pipeline testing --------- Co-authored-by: Clement Lin <clement.lin@amd.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>	2025-07-16 14:05:26 +08:00
rahjain-amd	6b09f0823e	add missing condition for bf16 (#2502 ) Without this DataType = unknown - ``` sh Run Flatmm kernel with DataType = unknown M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.228837 ms, 187.687 TFlops, 341.374 GB/s, ``` after this change ```sh Run Flatmm kernel with DataType = bf16 M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.227029 ms, 189.181 TFlops, 344.092 GB/s, ```	2025-07-15 21:25:56 +05:30
carlushuang	cfe211cc60	[CK_TILE] moe sorting optimize local_token (#2469 ) * fix bug in loops that need use local tokens to compute * support extra chain local_token * update * update * refine some main * update * support dispatch_policy * fix 15 example	2025-07-15 09:42:18 +08:00
Khushbu Agarwal	d239b91fd5	Merge flatmm Operator with universal gemm (#2434 ) * Initial commit * Adding new tile partitioner to flatmm * intermediate changes * debugging kernels * Updating flatmm example to universal gemm example * updated flatmm kernel to run via gemmKernel * update universal gemm to incorporate flatmm * debug * Fix flatmm call * Fixing other kernels and tests for API changes * clang formatted * fixing gemm tests * added test for flatmm and simplify kernel arguments * adding flatmm test * fix test for flatmm * simplify gemm kernel with flatmm * remove flatmm related files * addressing review comments and code clean up * resolving empty file * resolving empty file * clang formatted * addressing review comments * enable persistent kernel for flatmm * reverted the removed files for flatmm * reverted the removed files for flatmm * changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example * some more renames * clang formatted	2025-07-11 08:27:55 -07:00
Andres Lugo	aadeffde18	Update FMHA recipe for Pytorch SDPA integration (#2480 ) * Add receipts in splitk and appendk * remove grouped * Remove logits --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-07-10 09:00:23 -07:00
Po Yen Chen	ad9863fe05	[CK_TILE] Low CU utilization optimization for fMHA fwd kernels (#2402 ) * Wrap tile size mapping as class method * Warp pipeline generating as class method * Add constraint as kernel dispatching criteria * Support mutltiple tile size for a (hdim, hdim_v) combination * Use smaller tile size if CU utilization is low * Use integar as the key of the tile size map * Fix type error * Simply override parent class method return value * Add attribute to eliminate warnging * Allow using environment variables to turn on/off custom factory * Unify param naming style * Add missing HIP runtime include directive * Fix os.environ.get() usage	2025-07-09 22:01:33 +08:00
Haocong WANG	5557eadce6	[CK TILE] Fix FA build filter (#2369 ) * Fix for fwd/bwd kernel build filter * fix bwd code * cmake depends & bwd filter order fix * revert unexpected reformat * Avoid change fmha bwd filter order for downstream compatibility * Revert unexpected changes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-07-08 10:42:07 +08:00
Thomas Ning	f240ae3248	Enable Async Copy for MI355 (#2425 ) * add for async load builtin * add async load api * fix some compiling errors * fix a compiling error * fix some compiling errors * add a pipeline which copies from v4 * add a new pipeline for async load * fix some compiling errors * add async load tests * fix some issues in async load * fix * fix async inline assembly * fix async inline assembly * add ignore header file * comment some not gfx950 codes * comment some not gfx950 codes * fix a error * update async load apis * fix lds descriptor * fix a compiling error * fix some compiling errors * fix a descriptor issue * update lds descriptor * change async pipeline's tile distribution pattern from thread to warp * fix clang format * update async policy * fix a CRTP issue * fix a typo error * change lds layout * fix some sync issues * improve codes * delete the async test * fix a commented format issue * avoid compiling device functions when compile host * make gemm run * add the copy kernel support * finish the feature * Address comment * add the support for buffer_builtin * solved the merging problem * Comment Addressed --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: joyeamd <John.Ye@amd.com>	2025-07-07 10:08:49 -07:00
rahjain-amd	ad593c286f	Fixing Debug build (#2404 ) Failed to build `tile_example_fmha_bwd` due to below error ``` /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare] 358 \| assert(slopes.size() == nhead); \| ~~~~~~~~~~~~~ ^ ~~~~~ /usr/include/assert.h:103:27: note: expanded from macro 'assert' 103 \| (static_cast <bool> (expr) \ \| ^~~~ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:989:16: note: in instantiation of function template specialization 'run<FmhaBwdFp16>' requested here 989 \| return run<FmhaBwdFp16>(arg_parser) ? 0 : -2; \| ^ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare] 358 \| assert(slopes.size() == nhead); \| ~~~~~~~~~~~~~ ^ ~~~~~ /usr/include/assert.h:103:27: note: expanded from macro 'assert' 103 \| (static_cast <bool> (expr) \ \| ^~~~ /home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:993:16: note: in instantiation of function template specialization 'run<FmhaBwdBf16>' requested here 993 \| return run<FmhaBwdBf16>(arg_parser) ? 0 : -2; \| ^ 2 errors generated when compiling for gfx942. ``` Fixed with proper cast	2025-07-07 14:46:22 +05:30
ltqin	9f4c5d7372	ck tile pagedkv prefill (#2405 ) * add prefetching physical block id for pagedkv * start add pagedkv prefill * rename pipeline * add kernel for pagedkv * add an init version pagedkv prefill * fix redefine issue * add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args * generate dispatch code * add body generating code * comipling pass * remove dropout from pagedkv * set lse to false in generating code * start changing qr kernel to pagedkv * init version of kernerl with pagedkv * change names of file that are generated * chang host validation for pagedkv prefill * using iglp to change blockgemm * add kernel files to op head file * show parameters * rewrite print parameter fun * add fwd * remove default parameter of GridSize * format * fix nhead issue and add seqlen_k_ptr to batch mode * format code * remove no-longer used code * format * fix some comments --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-07-07 16:16:54 +08:00
Mateusz Ozga	394e5be10d	[CK-TILE DOC] Ck-tile grouped GEMM documentation (#1939 ) * Ck-tile readme * After review * Review: part1 * Review part 3	2025-07-04 02:56:42 -07:00
Thomas Ning	e03293ebce	[CK Tile] Int8 Support on CK Tile GEMM (#2267 ) * updates to support int8 in 03_gemm example * added comments, using aliases, helper functions * test(gemm_universal): add test cases for int8 gemm pipeline * fix(test_gemm): fix for failing test unit test for int8 * test(ck_tile): add int8 unit test for gemm universal * refactor(gemm_universal): GPU reference verification for GEMM code improved * style(gemm_universal): removed extra comments and did clang format * merging recent changes to universal gemm to tile_engine * ck tile engine integration work * feat(tile_engine): add int8 support to tile engine ops/gemm * feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8 * style: Format code with clang-format-12 * refactor(tile_engine): address review comments * style: removed unhelpful comments & unused variables. * build: tile engine uses default config * feat: add int8 support for CK_TILE GEMM * style: added trailing commas to codegen_utils.py * refactor: tile engine * refactor: formatting and code review * refactor: code formatting for python files * fix: suppress build warning * add support for gfx950 * refactor:KWarpTile size in gemms util * Fix the branch and wrap up the k warp tile * Add bf8 integration * refactor: clang format and rebase --------- Co-authored-by: zjli2013 <leezhengjiang@gmail.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>	2025-06-25 08:20:35 -07:00
linqunAMD	37e1a27537	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-25 01:07:45 -07:00
Po Yen Chen	50fad03524	[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403 ) * Rename batch_prerfill interface * Add min_seqlen_q parameter in MakeKargs()	2025-06-25 15:19:21 +08:00
Anton Gorenko	77123600ee	Improve fmha_bwd tests performance (#2376 ) * Avoid passing indices (std::vector) by value to host tensor's operator() Each access requires 2 allocations and copies of the vector. * Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification * Compute ds_hp_host_ref in parallel This sequntial ForEach is the slowest part of validation and it benefits from parallel computation. * Do not use ForEach for simple copy and conversion of large tensors These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices.	2025-06-24 07:45:24 -07:00
Yi DING	b8212864cf	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-06-24 01:33:31 +08:00
Linjun-AMD	61eb622e85	update the way to compute fmha fwd tflop, include mask type (#2386 ) * update the way to compute fwd tflop, include mask type Signed-off-by: JL-underdog <Jun.Lin@amd.com> * remove unneccessary comment * add necessary comment * remove some comment --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Co-authored-by: root <root@GT-SC-DI16-08.dh144.dcgpu>	2025-06-23 15:53:58 +08:00
Po Yen Chen	7d669440a6	[CK_TILE] Fix compilation errors introduced in #2320 , #2219 and #2214 (#2388 ) * Fix compilation errors * Fix more ck_tile example compilation errors	2025-06-23 12:29:15 +08:00
Bartłomiej Kocot	cebdee4d9e	[CK TILE] Grouped Convolution Forward Kernel (#2188 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * rebase fixes * fixes * fixes	2025-06-20 15:44:36 -07:00
Thomas Ning	df6023e305	fix the mi350 error (#2378 )	2025-06-20 12:50:13 -07:00
joyeamd	a2f01141aa	transpose load api development (#2177 ) * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * change directory name * delete the duplicated directory * update cmakelists file * delete the unused codes * update function names * update transpose policy * update code after remod.py * update codes * add some comment * Polish the instr infrastructure * build up the fixed instr * redesign the transpose api, currently it has numerical error * add the bf16 transpose * fix some issues * add some comments * update document * Finished the refactor of API and pass through the verification * fix the merging issue --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-06-18 01:28:34 -07:00
linqunAMD	7aeec9a901	[CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd (#2243 ) * [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd * fix error with the latest develop code.	2025-06-17 21:37:59 -07:00
carlushuang	a4e1248dba	[CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335 ) * support local_token for hipgraph * update README * fix comment * fix fmoe example	2025-06-18 10:49:43 +08:00
linqunAMD	0eb8974502	[CK_TILE] Support multi-config in tile_example_gemm_universal (#2240 ) * [CK_TILE] Support multi-config in tile_example_gemm_universal Add GemmConfig in run_gemm_example to support multiple tile config. - It is useful when use you need compare gemm perf with different tile/pipeline config - we also can use it simplify the code for wmma support in the furture. * [CK_TILE] Support multi-config in tile_example_gemm_universal Address review comments * rebase code and fix clang format. * fix clang format * support pipeline v5. * fix merge conflict * address review comment * add missing file * address review comment v2 * fix build error	2025-06-17 17:27:46 -07:00
John Afaganis	df54667102	Add missing copyright headers (#2359 ) * Add missing copyright headers * empty commit	2025-06-17 14:29:45 -07:00
Satyanvesh Dittakavi	4c57157d50	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2025-06-17 11:54:30 -07:00
Thomas Ning	3c4cdfac4f	Fix the CK Tile related operators (#2356 ) * fix the flatmm * Fix the pipeline * address the comment	2025-06-16 17:38:52 -07:00
Illia Silin	5523df4b2d	Revert "fix the flatmm (#2349 )" (#2352 ) This reverts commit `d996bc78be`.	2025-06-16 07:54:55 -07:00
Thomas Ning	d996bc78be	fix the flatmm (#2349 )	2025-06-16 02:17:53 -07:00
Mateusz Ozga	bd96ac9742	[CK_TILE] Multiple-D GEMM example (#2219 ) * Multiple d, initial commit * Check Ds Layout * Readme and clang format * Update branch & conflicts * Multiple D - fix clang-formatter * Rename elemetwise_op * Fix CI * Code review part1 * Remove printf * Remove unnecessary comment * Add new tests with Col layout * Review part 2 * Added support for Multiple D GEMM * Update comment * Remove maybe_unused * Clang-format * Review part 3 * Add comment to function * Add comment to function: another * Take number of params for a refrence function * Remove additional d param for 0 tensor * Change name of function * Fix CI fails	2025-06-13 19:39:11 +02:00
kylasa	5f1ad09b61	Code drop for 2 warp ping pong scheduler along K dimension. (#2276 ) * Code drop for 2 warp ping pong scheduler along K dimension. * Addressing code review comments. * Addressing Clang formatting issues. * Addressing build issues. * Addressing build issues of other GEMM pipelines with ping pong scheduler code drop. * Fix for LDS memory size for GEMM pipelines. * Addressing code review feedback comments. * Change log update. * Addressing code review comments and build issues. * Added new policy for pipeline specific logic about LDS needs. * Clang Fix during build.	2025-06-12 18:24:02 -07:00
Aviral Goel	e5ece14467	fix(gemm_universal): Update gemm_utils.hpp so it builds successfully for memory pipeline (#2336 )	2025-06-12 15:27:14 -07:00
Khushbu Agarwal	bd270fe4bc	fix flatmm kernel for bigger size for fp16 datatype (#2302 )	2025-06-10 11:13:40 -07:00
Aviral Goel	aed0f5880c	Label CMakeLists message() as DEBUG or STATUS for clean build output (#2301 ) * - elevate important build messages to log level STATUS - comment out the rest (temporarily) * - marked all low importance build messages as log_level=DEBUG	2025-06-10 10:46:47 -07:00
Sami Remes	1c6f83df6c	[CK_TILE] Tileloop persistent gemm - resubmit (#2299 ) * Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293) This reverts commit `233e274077`. * Add missing header for kentry --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-06-06 14:18:49 -07:00
Illia Silin	233e274077	Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 ) This reverts commit `ffb52783d0`.	2025-06-05 09:24:00 -07:00
Sami Remes	7ea1508b59	[CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222 ) * Add TailHandler for V3, V4 and Mem pipelines * Adapt examples and tests to use TailHandler * move tail-handling logic to pipeline in persistent grouped gemm * Fix Mem pipeline dispatching, add CompV4 dispatching * Use a macro for handling the many tails of Mem pipeline * Fix formatting again * Use const-ref RunFunction, remove unnecessary try_run	2025-06-04 11:50:21 +03:00
Sami Remes	ffb52783d0	[CK_TILE] Tile loop persistent gemm kernel (#2191 ) * Implement tile loop persistent gemm kernel * Enable timing * Add tests for persistent gemm * Fix formatting * Fix gemm_basic * Rename True/False to Persistent/NonPersistent * Use only one set of layouts for persistent tests * Fix gemm example persistent template parameter * Fix formatting	2025-06-04 11:46:28 +03:00
Khushbu Agarwal	59a85cb4bc	[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262 ) * debugging issue * debugging issue * debugging * debugging * reverting debugging code * clang formatted * updating default_config.json * fix ci failure * clang formatted	2025-06-03 20:16:10 -07:00
Khushbu Agarwal	2e38eb4f1c	Rotating buffer PR CI fix (#2257 ) * Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)" This reverts commit `bbdaf79a52`. * fix regression	2025-06-02 10:25:01 -07:00
slippedJim	57f497452a	remove restriction of group mode hd192 no lse (#2252 ) Co-authored-by: Jim <jimguo12@amd.com>	2025-05-30 10:14:21 +08:00
Po Yen Chen	28cd0dffc9	[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251 ) * Add constraint on traits/tile/pipeline * Use kM0=128 if max_seqlen_q == 8192 * Re-format codegen script * Remove redundant attr name postix * Fix import error: default field in dataclass * Use kK0=64 & kK1=64 to hide latency * Use CU utilization to decide tile size	2025-05-29 18:36:33 +09:00
Illia Silin	bbdaf79a52	Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 ) This reverts commit `99857e10e6`.	2025-05-28 09:46:52 -06:00
Khushbu Agarwal	99857e10e6	[CK_tile] Add rotating buffer feature for universal gemm (#2200 ) * Add rotating buffer feature for universal gemm * adding changes in tile_engine * Updated code to merge kernel_launch * removing comments * Enable rotating buffer changes to flatmm * Created diff launch_kernel function for rotating buffer * Simplfied calculation using macros * merge code with new changes in tile_engine * clang formatted * Redefine macros	2025-05-27 23:00:58 -07:00

1 2 3 4 5 ...

300 Commits