composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 02:27:57 +00:00

Author	SHA1	Message	Date
Haocong WANG	4ed2dda658	fix async copytest bug (#2509 ) * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * change bit width check * refactor macros into constexpr functions which still get inlined * wrap s_waitcnt api * parameterize test * cleanup * cleanup fp8 stub * add fp8 test cases; todo which input parameters are valid? * replace n for fp8 in test cases * add large shapes; fp8 fails again * change input init * test sync/async * time the test * clang-format test * use float instead of bfloat to cover a 4-byte type * fix logic - arg sections should be 'or'd * make block_sync_lds_direct_load interface similar to old ck * fix a few comment typos * name common shapes * revert the example to original logic of not waiting lds * clang-format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `a5fdc663c8`]	2025-07-23 00:14:02 -07:00
Cong Ma	baf244000e	ck_tile kernel for gemm with groupwise quantized A tensor (#2473 ) * ck_tile kernel for gemm with groupwise quantized A or B tensor. This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. 1. fp8, fp8 -> f32 2. bf8, bf8 -> f32 3. i4, fp8 -> f32 4. i4, bf8 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers. * [Block Scale Gemm] Only run gemm quant examples on __gfx94__ - Only run gemm quant examples on __gfx94__ for usage of `v_cvt_pk_fp8_f32` - Format the code * [Block Scale Gemm] Remove Bquant Gemm BlockScale This cleanup is in preparation for future development of bquant. By isolating Aquant-related code, we can streamline the codebase and make it easier to add and maintain bquant functionality in subsequent updates. * [Block Scale Gemm] Format code with clang-format-12 The latest clang-format (v19) in ROCm 7.0 generate different result than clang-format-12 which is used in CK CI. Format code with clang-format-12 for consistency. * [Block Scale Gemm] Split the k direction loop - Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp to make the logic clearer. - Disable C transposition. * [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm * [Block Scale Gemm] Update copyright * test * Add TailHandler * Move TileDistributionEncodingPatternAQ * Refactor * refactor * fix bug * fix bug * help solve the PR comment * Format the code * [Block Scale Gemm] Add unit tests * [Block Scale Gemm] Add support to 16x16x32 MFMA - Add support to 16x16x32 MFMA - Fix a bug when exchange data crossing lanes --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com> Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `e62710e461`]	2025-07-23 00:10:16 -07:00
Yi DING	bb1a05cc87	[CK_TILE] Fix tile_example_moe_sorting broke in #2436 (#2525 ) [ROCm/composable_kernel commit: `f0a8c18017`]	2025-07-17 22:50:58 -07:00
Mateusz Ozga	96e2d5f86a	Fix CI clang-format (#2521 ) [ROCm/composable_kernel commit: `7fc000d7b3`]	2025-07-17 14:41:29 +02:00
Haocong WANG	894750fd59	fix mfma32x32 dispatch (#2490 ) [ROCm/composable_kernel commit: `28072adc3a`]	2025-07-17 15:24:12 +08:00
Yi DING	cee5776046	[CK_TILE] Use read_tr in universal gemm (#2436 ) * Use read_tr in universal gemm * Enable all instances back * Revert example37 changes * Resolve comments * resolve comments 2 * Fix assertion msg * fix the gemm basic * change index_t to bool for preshuffle variable * Solve the comment --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> [ROCm/composable_kernel commit: `f1d8ad2818`]	2025-07-16 23:56:22 -07:00
Po Yen Chen	e75bbb6a13	Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389 )" (#2514 ) This reverts commit c5014a86765c598215bf2b58f89c3a7c70f92bac. [ROCm/composable_kernel commit: `722c22fb15`]	2025-07-17 10:09:01 +08:00
linqunAMD	348dec0d0c	Fix build errors on windows (#2456 ) * Fix build errors on windows * correct clang format --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> [ROCm/composable_kernel commit: `6e76b82059`]	2025-07-16 07:58:23 -07:00
MHYangAMD	ff8d3d0d13	[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409 ) * Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass * Update rmsnorm2d_fwd_pipeline_model_sensitive_pass 1. Add BlockReduce2dTreeCrossWarpSync * Add Rmsnorm2dFusedModelSensitiveEnum * Update patch 1. Reverse generate.py 2. Remove comment in generate.py 3. Update tree cross warp reduce * Refactor RMSNorm model enum and introduce T5-like option * Update the n stage for cross warp reduce * Add new cmdline option in RMSNorm for new pipeline testing --------- Co-authored-by: Clement Lin <clement.lin@amd.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com> [ROCm/composable_kernel commit: `3499fe67ff`]	2025-07-16 14:05:26 +08:00
carlushuang	34ca5f6a68	[CK_TILE] moe sorting optimize local_token (#2469 ) * fix bug in loops that need use local tokens to compute * support extra chain local_token * update * update * refine some main * update * support dispatch_policy * fix 15 example [ROCm/composable_kernel commit: `cfe211cc60`]	2025-07-15 09:42:18 +08:00
Gino Lu	b5517fb522	[CK_TILE] Add pk_fp4 data type (#2422 ) * [draft] Add pk_fp4 and test * Add hw conversion for fp4 * Refine test code and pk_fp4 constructor. * fix test indent * modify according to comment. * fix clang-format * modify according comments. --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `141bf2d54d`]	2025-07-14 20:35:06 +08:00
Khushbu Agarwal	f3120e7526	Merge flatmm Operator with universal gemm (#2434 ) * Initial commit * Adding new tile partitioner to flatmm * intermediate changes * debugging kernels * Updating flatmm example to universal gemm example * updated flatmm kernel to run via gemmKernel * update universal gemm to incorporate flatmm * debug * Fix flatmm call * Fixing other kernels and tests for API changes * clang formatted * fixing gemm tests * added test for flatmm and simplify kernel arguments * adding flatmm test * fix test for flatmm * simplify gemm kernel with flatmm * remove flatmm related files * addressing review comments and code clean up * resolving empty file * resolving empty file * clang formatted * addressing review comments * enable persistent kernel for flatmm * reverted the removed files for flatmm * reverted the removed files for flatmm * changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example * some more renames * clang formatted [ROCm/composable_kernel commit: `d239b91fd5`]	2025-07-11 08:27:55 -07:00
Qianfeng	337126469c	Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487 ) * Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline * i_nhead_ conversion type to prevent overflow --------- Co-authored-by: ltqin <letaoqin@amd.com> [ROCm/composable_kernel commit: `45904b8fd7`]	2025-07-11 18:14:47 +08:00
shay-li77	4f08a02dae	support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338 ) * mask support ratio for y axis * format code * add notes for param y_ratio * fix comments error * support template and mdiv for ratio mask * refactor y-ratio mask constructor * optimize coordinate calculation * add SimplifiedRatioAttentionMask [ROCm/composable_kernel commit: `d814fefe18`]	2025-07-09 23:18:55 +08:00
Yi DING	9f5cf4f49d	[CK_TILE] Avoid compile kernel in host pass (#2475 ) [ROCm/composable_kernel commit: `032ca60015`]	2025-07-09 22:27:54 +08:00
Haocong WANG	7c04d93083	[CK TILE] Fix FA build filter (#2369 ) * Fix for fwd/bwd kernel build filter * fix bwd code * cmake depends & bwd filter order fix * revert unexpected reformat * Avoid change fmha bwd filter order for downstream compatibility * Revert unexpected changes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com> [ROCm/composable_kernel commit: `5557eadce6`]	2025-07-08 10:42:07 +08:00
Illia Silin	99cf9b9cae	fix compilation errors with clang20 (#2464 ) [ROCm/composable_kernel commit: `e033a1b4bf`]	2025-07-07 19:40:30 -07:00
Po Yen Chen	a71dc1245f	Eliminate warning caused by failed to meet occupancy requirement (#2389 ) Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `b2dea90116`]	2025-07-08 09:17:25 +08:00
Thomas Ning	1129c9dc4e	Enable Async Copy for MI355 (#2425 ) * add for async load builtin * add async load api * fix some compiling errors * fix a compiling error * fix some compiling errors * add a pipeline which copies from v4 * add a new pipeline for async load * fix some compiling errors * add async load tests * fix some issues in async load * fix * fix async inline assembly * fix async inline assembly * add ignore header file * comment some not gfx950 codes * comment some not gfx950 codes * fix a error * update async load apis * fix lds descriptor * fix a compiling error * fix some compiling errors * fix a descriptor issue * update lds descriptor * change async pipeline's tile distribution pattern from thread to warp * fix clang format * update async policy * fix a CRTP issue * fix a typo error * change lds layout * fix some sync issues * improve codes * delete the async test * fix a commented format issue * avoid compiling device functions when compile host * make gemm run * add the copy kernel support * finish the feature * Address comment * add the support for buffer_builtin * solved the merging problem * Comment Addressed --------- Co-authored-by: joye <joye@amd.com> Co-authored-by: joyeamd <John.Ye@amd.com> [ROCm/composable_kernel commit: `f240ae3248`]	2025-07-07 10:08:49 -07:00
ltqin	ba133fe9b7	ck tile pagedkv prefill (#2405 ) * add prefetching physical block id for pagedkv * start add pagedkv prefill * rename pipeline * add kernel for pagedkv * add an init version pagedkv prefill * fix redefine issue * add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args * generate dispatch code * add body generating code * comipling pass * remove dropout from pagedkv * set lse to false in generating code * start changing qr kernel to pagedkv * init version of kernerl with pagedkv * change names of file that are generated * chang host validation for pagedkv prefill * using iglp to change blockgemm * add kernel files to op head file * show parameters * rewrite print parameter fun * add fwd * remove default parameter of GridSize * format * fix nhead issue and add seqlen_k_ptr to batch mode * format code * remove no-longer used code * format * fix some comments --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `9f4c5d7372`]	2025-07-07 16:16:54 +08:00
carlushuang	8e15d99ddc	default skip y point to r (#2457 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `0aecb5ab68`]	2025-07-06 23:54:34 -07:00
carlushuang	4ed061c05d	[CK_TILE][CORE] enhance slice_tile api (#2430 ) * support slice cross p * fix some bug in y_len * more case * fix a bug when R exist * support -1 to hint end of current length * format * change commit [ROCm/composable_kernel commit: `a8742f7e31`]	2025-07-06 20:13:12 -07:00
Max Podkorytov	70f959ba12	[CK-TILE] File-level documentation for static encoding pattern (#2433 ) * add file-level comment * Finished the write-up --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `158ddeb8ce`]	2025-07-04 02:26:18 -07:00
Thomas Ning	90add28587	[CK Tile] Int8 Support on CK Tile GEMM (#2267 ) * updates to support int8 in 03_gemm example * added comments, using aliases, helper functions * test(gemm_universal): add test cases for int8 gemm pipeline * fix(test_gemm): fix for failing test unit test for int8 * test(ck_tile): add int8 unit test for gemm universal * refactor(gemm_universal): GPU reference verification for GEMM code improved * style(gemm_universal): removed extra comments and did clang format * merging recent changes to universal gemm to tile_engine * ck tile engine integration work * feat(tile_engine): add int8 support to tile engine ops/gemm * feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8 * style: Format code with clang-format-12 * refactor(tile_engine): address review comments * style: removed unhelpful comments & unused variables. * build: tile engine uses default config * feat: add int8 support for CK_TILE GEMM * style: added trailing commas to codegen_utils.py * refactor: tile engine * refactor: formatting and code review * refactor: code formatting for python files * fix: suppress build warning * add support for gfx950 * refactor:KWarpTile size in gemms util * Fix the branch and wrap up the k warp tile * Add bf8 integration * refactor: clang format and rebase --------- Co-authored-by: zjli2013 <leezhengjiang@gmail.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: Khushbu Agarwal <khuagarw@amd.com> [ROCm/composable_kernel commit: `e03293ebce`]	2025-06-25 08:20:35 -07:00
linqunAMD	d2ec53a74e	[CK_TILE] Refine fp8 support in flatmm (#2239 ) * [CK_TILE] Refine fp8 in flatmm 1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr 2. Add an additional const check to avoid build error in HotLoopScheduler 3. Refine shuffleb to support both tile 32x32 and 16x16 4. Support command option -init 5. Move Gemm warp defintion to a separate struct * fix clang format * fix clang format * keep default bhavior unchanged (warp tile = 16x16) * fix tile engine build error * fix a typo in codegen_utils.py * address review comments * address review comments --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `37e1a27537`]	2025-06-25 01:07:45 -07:00
Po Yen Chen	b62e551ccb	[CK_TILE] Add missing parameter 'min_seqlen_q' to the FMHA fwd kernel MakeKargs() interface (#2403 ) * Rename batch_prerfill interface * Add min_seqlen_q parameter in MakeKargs() [ROCm/composable_kernel commit: `50fad03524`]	2025-06-25 15:19:21 +08:00
Yi DING	c4ba466332	Fix unmatched K size of WarpGemmMfmaBf16Bf16F32M16N16K32TransposedCDistribution on gfx950 (#2393 ) [ROCm/composable_kernel commit: `c5d9181e1b`]	2025-06-24 16:35:54 -07:00
Anton Gorenko	193c84af34	Improve fmha_bwd tests performance (#2376 ) * Avoid passing indices (std::vector) by value to host tensor's operator() Each access requires 2 allocations and copies of the vector. * Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification * Compute ds_hp_host_ref in parallel This sequntial ForEach is the slowest part of validation and it benefits from parallel computation. * Do not use ForEach for simple copy and conversion of large tensors These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and can be copied/converted without complex computations of linear indices. [ROCm/composable_kernel commit: `77123600ee`]	2025-06-24 07:45:24 -07:00
Yi DING	cba904aeff	[CK_TILE] FMHA Support hdim_v to as a Multiple of 32 (#2114 ) * 160+192 * Add splitkv d160 * cleanup * fix * Add change log * Fix CHANGELOG * Use static_cast * Update ignored instance --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `b8212864cf`]	2025-06-24 01:33:31 +08:00
Po Yen Chen	4456552543	[CK_TILE] Fix compilation errors introduced in #2320 , #2219 and #2214 (#2388 ) * Fix compilation errors * Fix more ck_tile example compilation errors [ROCm/composable_kernel commit: `7d669440a6`]	2025-06-23 12:29:15 +08:00
Max Podkorytov	eb96164495	Update for xformers (#2372 ) * update api * update kernel api * clang-format [ROCm/composable_kernel commit: `0366fb2abc`]	2025-06-22 00:28:30 -07:00
Bartłomiej Kocot	2567f5e538	[CK TILE] Grouped Convolution Forward Kernel (#2188 ) * [CK TILE] Grouped Convolution Forward Kernel * custom vector size * fixes * refactor * rebase fixes * fixes * fixes [ROCm/composable_kernel commit: `cebdee4d9e`]	2025-06-20 15:44:36 -07:00
Thomas Ning	7e4994ac35	Transpose builtin macro defense (#2374 ) * add the macro defense * add the static assert check [ROCm/composable_kernel commit: `107e3623c7`]	2025-06-20 11:24:54 -07:00
Max Podkorytov	1beaead90c	Reland fix default epilogue (#2367 ) * Revert "Revert "Fix default epilogue (#2358)" (#2364)" This reverts commit c0a58748ade1dc4d5405894c41ca33f923e480c6. * add operator() with old signature [ROCm/composable_kernel commit: `11eb9f1c77`]	2025-06-19 10:39:30 -07:00
joyeamd	fdfcee3b98	transpose load api development (#2177 ) * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * add transpose load; no real logic * fix some compile errors * fix some issues * update transpose load logic * add some fixes * fix a distribution issue * update some codes * add some fix * can pass; but no logic * transpose load enable * update tile transpose * miss output tile distribution mapping * hack for transpose 16x16 * update output tensor distribution * delete unused variables * fix transpose related codes * update transpose load example * exchange the iteration order * fix 16x16 related dimension transpose * fix a transpose index issue * fix a transpose index issue * fix clang format check * update load tile transpose related codes * fix compile errors and pass 16x16 tests * fix a typo * update logic * check other data types * add transpose load api * update transpose load api * fix clang format check * change file name * refactor codes * update code name * delete some unused codes * delete the unused oob flag for transpose load * update tensor view api for transpose load * update for testing * fix a typo error * move transpose ops to example directory * update transpose api * update include file * fix for pr review * fix compile errors * change directory name * delete the duplicated directory * update cmakelists file * delete the unused codes * update function names * update transpose policy * update code after remod.py * update codes * add some comment * Polish the instr infrastructure * build up the fixed instr * redesign the transpose api, currently it has numerical error * add the bf16 transpose * fix some issues * add some comments * update document * Finished the refactor of API and pass through the verification * fix the merging issue --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `a2f01141aa`]	2025-06-18 01:28:34 -07:00
Thomas Ning	ffafdec4d8	Revert "Fix default epilogue (#2358 )" (#2364 ) This reverts commit f1e842455a66c955a3df38a9fc383141c4bcc62f. [ROCm/composable_kernel commit: `64a2fda713`]	2025-06-17 22:43:05 -07:00
carlushuang	8660f6ef22	[CK_TILE] moe_sorting support "local_tokens" feature for EP case (#2335 ) * support local_token for hipgraph * update README * fix comment * fix fmoe example [ROCm/composable_kernel commit: `a4e1248dba`]	2025-06-18 10:49:43 +08:00
Max Podkorytov	9a342df600	Fix default epilogue (#2358 ) * [ck-tile] fix default epilogue in gemm universal * argument validation needs vector size D * operator() needs to specify dram windows * copy/paste from cshuffle epilogue * clang-format * mark unused argument --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `cd606f72c1`]	2025-06-17 17:30:21 -07:00
linqunAMD	ca0c6fb660	[CK_TILE] Support multi-config in tile_example_gemm_universal (#2240 ) * [CK_TILE] Support multi-config in tile_example_gemm_universal Add GemmConfig in run_gemm_example to support multiple tile config. - It is useful when use you need compare gemm perf with different tile/pipeline config - we also can use it simplify the code for wmma support in the furture. * [CK_TILE] Support multi-config in tile_example_gemm_universal Address review comments * rebase code and fix clang format. * fix clang format * support pipeline v5. * fix merge conflict * address review comment * add missing file * address review comment v2 * fix build error [ROCm/composable_kernel commit: `0eb8974502`]	2025-06-17 17:27:46 -07:00
Satyanvesh Dittakavi	a4517b0a9d	Do not use warpSize as compile time constant as it is removed (#2320 ) * Do not use warpSize as compile time constant as it is removed * Update tile_image_to_column_shape.hpp update warpSize usage. * clean-up all use of warpSize, make sure code builds * fix --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `4c57157d50`]	2025-06-17 11:54:30 -07:00
Thomas Ning	cb631cd5b1	Fix the CK Tile related operators (#2356 ) * fix the flatmm * Fix the pipeline * address the comment [ROCm/composable_kernel commit: `3c4cdfac4f`]	2025-06-16 17:38:52 -07:00
Illia Silin	fba2ac98ae	Revert "fix the flatmm (#2349 )" (#2352 ) This reverts commit 37f25427f2ec6c907bf12030338f69945691f54c. [ROCm/composable_kernel commit: `5523df4b2d`]	2025-06-16 07:54:55 -07:00
Thomas Ning	6f158242b6	fix the flatmm (#2349 ) [ROCm/composable_kernel commit: `d996bc78be`]	2025-06-16 02:17:53 -07:00
ruanjm	1f77d58ae9	Add support for specifying valid flag when fetching elements for tile_scatter_gather (#2332 ) * Add support for specifying valid flag when fetching elements for tile_scatter_gather Add constexpr for operator[] of TrueGenerator * Use different path when valid is enabled [ROCm/composable_kernel commit: `b34c234f51`]	2025-06-16 17:17:03 +08:00
carlushuang	d68fdea428	hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem (#2348 ) [ROCm/composable_kernel commit: `fb97f75099`]	2025-06-15 22:49:04 -07:00
Mateusz Ozga	044a8560f7	[CK_TILE] Multiple-D GEMM example (#2219 ) * Multiple d, initial commit * Check Ds Layout * Readme and clang format * Update branch & conflicts * Multiple D - fix clang-formatter * Rename elemetwise_op * Fix CI * Code review part1 * Remove printf * Remove unnecessary comment * Add new tests with Col layout * Review part 2 * Added support for Multiple D GEMM * Update comment * Remove maybe_unused * Clang-format * Review part 3 * Add comment to function * Add comment to function: another * Take number of params for a refrence function * Remove additional d param for 0 tensor * Change name of function * Fix CI fails [ROCm/composable_kernel commit: `bd96ac9742`]	2025-06-13 19:39:11 +02:00
kylasa	10498656ef	Code drop for 2 warp ping pong scheduler along K dimension. (#2276 ) * Code drop for 2 warp ping pong scheduler along K dimension. * Addressing code review comments. * Addressing Clang formatting issues. * Addressing build issues. * Addressing build issues of other GEMM pipelines with ping pong scheduler code drop. * Fix for LDS memory size for GEMM pipelines. * Addressing code review feedback comments. * Change log update. * Addressing code review comments and build issues. * Added new policy for pipeline specific logic about LDS needs. * Clang Fix during build. [ROCm/composable_kernel commit: `5f1ad09b61`]	2025-06-12 18:24:02 -07:00
Thomas Ning	592ef9daaf	OCP FP8 Macro restructure (#2331 ) * solved the problem [ROCm/composable_kernel commit: `f59b8c7d3d`]	2025-06-12 09:46:33 -07:00
carlushuang	5d7302c240	[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel (#2327 ) * refactor subtoken logic to let more kernel pickup mp kernel * typo [ROCm/composable_kernel commit: `8aff45a8af`]	2025-06-12 11:44:22 +08:00
Thomas Ning	2350191009	Epilogue cshuffle Improvement (#2312 ) * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * add cshuffle's mxdlperwavepershuffle support, not finished * add epilogue functions * update cshuffle logic * update cshuffle_logics * add some change within review * update some codes following the code review * update epilogue logic * remove from problem * update codes following review. * fix some issues * solve the previous PR error, refine the code * Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Comment addressed * handling tile_engine failing case * handling tile_engine failing case --------- Co-authored-by: joyeamd <John.Ye@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: khushbu agarwal <khuagarw@amd.com> [ROCm/composable_kernel commit: `06e0b8436c`]	2025-06-10 22:44:50 -07:00

1 2 3 4 5 ...

289 Commits