composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-07 00:04:37 +00:00

Author	SHA1	Message	Date
aska-0096	7bdf6a7eef	merge develop and solve conflicts	2025-08-22 03:15:51 +00:00
aska-0096	f21e916a8c	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue	2025-08-22 03:15:21 +00:00
Po Yen Chen	0db21053e6	[CK_TILE] Allow switching between SGPR/VGPR get_warp_id() return values (#2669 ) * Allow return VGPR get_warp_id() value * Avoid using SALU in async_load_raw()	2025-08-22 10:17:05 +08:00
Po Yen Chen	4a7ecce096	[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549 ) * Support async load dwordx4 * Enlarge load size on gfx950	2025-08-22 10:13:47 +08:00
Yi DING	4cfa2c7158	[CK_TILE] FMHA BWD Fix Compilation with Bias (#2682 ) * [CK_TILE] FMHA BWD Fix Compilation with Bias * Fix appendkv kApplyRoPE	2025-08-22 10:01:10 +08:00
Bartłomiej Kocot	4212bbc170	[CK Tile] Grouped convolution backward data (#2652 ) * base working version for single groupped conv bwd data * Fix 2d descriptor * fix groups * Add 3d support * fixes * fixes * fixes --------- Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>	2025-08-20 05:29:57 -07:00
Emily Martins	071165919f	[CK Tile] Stream K GEMM Kernel HostArgs and Kernel Classes (#2681 ) * CK Tile Stream K Device Ops Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The StreamKKernel class injects Universal Gemm and includes functions to facilitate kernel preparation for the GPU. * Stream K Device Ops Fixes - Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to ensure we get size needed for accumulation buffers and semaphores. - Pass in num_sk_blocks into TilePartitioner constructor - Update documentation * Add WarpTile dimensions to GetName function in StreamKKernel class * Fix typos in StreamKHostArgs class description. Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Apply clang format on updated comment for StreamKHostArgs * Explicitly specify type for StreamKReductionStrategy enum * Remove unecessary scopes * Unify the commenting style to inline comments * Add explicit casts for occupancy and num_cu in MakeKernelArgs function Both the static functions Occupancy and NumCU in the StreamKKernel class use functions from the HIP API that result in the returned occupancy and num_cu types being type int. The TilePartitioner interface for stream K will have occupancy and num_cu being type ck_tile::index_t which is int32_t. Thus, to be safe, this change ensures that both occupancy and num_cu are cast to int32_t. * Fix use of kentry due to interface update PR #2594 updated the interface for the kentry function in include/ck_tile/host/kernel_launch.hpp. As a result, the static function Occupancy was updated to work correctly with the new interface. PR #2594 also changed UniversalGemmKernel's KernelBlockSize static variable to kBlockSize, so the StreamKKernel class was updated to reflect this change. * Switch type of num_sk_blocks from uint32_t to int32_t This change switches the type of num_sk_blocks to type ck_tile::index_t which is int32_t. This was done because parallel work for the CK Tile StreamK TilePartitioner's constructor will have num_sk_blocks as ck_tile::index_t. Thus, this change will help unify the interfaces to avoid any type conversion errors. --------- Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2025-08-19 15:08:52 -06:00
joyeamd	a1589a9667	fix grouped gemm example when wave32 enabled (#2707 ) 1, delete some unused variables 2, fix BlockSize when wave32 enabled	2025-08-19 16:20:43 +08:00
Ding, Yi	a7dd2a7d13	update bwd	2025-08-19 02:17:43 +00:00
Kevin Choi	380aa8f311	add restrict to applicable functions	2025-08-18 19:37:40 +00:00
linqunAMD	9fcc1ee9fd	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-08-18 10:08:31 -07:00
Tianyuan Wu	7310830d14	Fix CI build error (#2695 ) Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>	2025-08-18 01:45:40 -07:00
Kevin Choi	75c4b9372f	remove noinline attr as it causes a lot more s_waitcnt's	2025-08-16 08:15:23 +00:00
Tianyuan Wu	68134b60e4	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit `07a79e797d`. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit `ceee918007`. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit `7a7241085e`. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-08-15 16:22:27 -07:00
Kevin Choi	598e3fec41	remove innerloop, move restrict parameters to mainloop and add noinline attribute.	2025-08-14 12:11:17 +00:00
Yashvardhan Agarwal	7f14772406	CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) (#2632 ) * CK_TILE: Implement two-stage split-K GEMM with reduction - Added split-K GEMM with reduction example * comment resolutions	2025-08-14 10:18:52 +02:00
Gino Lu	e5623d3825	fix wrong nan producion. (#2640 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-08-14 15:12:31 +08:00
Kevin Choi	3340408537	Create inner lambda with restrict parameters, add restrict to some parameters	2025-08-14 07:06:51 +00:00
aska-0096	3bc45ecbc7	save for debug	2025-08-14 03:43:54 +00:00
Aviral Goel	8a698c7445	Minor Improvements in CK TILE memory copy EXAMPLE (#2678 ) * Rename vector to ThreadTile * more notes on tile encoding * remove number<> from tuple of make_tile_window * add script to stress test the copy example	2025-08-13 15:24:16 -07:00
joyeamd	bcc38deff7	[CK_TILE]fix elementwise example in gfx11/12 (#2676 ) * fix elementwise examples * improve the robust * fix ck_tile's elementwise test * update elementwise test	2025-08-13 15:21:46 -07:00
SamiAario-AMD	28a97865f5	Cleanups (#2631 ) * Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp * Simplify two templated operator calls by having the templated types deduced automatically * Simplify two GemmPipeline calls * Fix GemmPipelineAgBgCrCompV4::GetName * Refactor use of ArgParser in CK tile GEMM examples * Update args in README.md to match the implementation in create_args * Remove some unnecessary include statements * Rename two variables * Factor out common code * Factor out do_verify * Add and use type aliases for memory operation integral constants * In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-08-13 10:12:08 +02:00
Haocong WANG	05a6e92705	Re-enable optimization for gfx950 fmha fwd (#2671 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format * refactor blockgemm change, isolate to v2; --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-08-13 14:57:43 +08:00
Cong Ma	452791a3ba	Preshuffle AQ matrix in block scale gemm (#2624 ) * Preshuffle AQ matrix in block scale gemm * turns the output to fp16. Increase the repetition time. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-08-12 21:32:51 -07:00
aska-0096	108abf00e0	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-13 02:14:26 +00:00
Thrupti Raj Lakshmana Gowda	3f57ec3d2d	GEMM Multi D for CK Tile Engine (#2660 ) * Readme for GEMM Multi D * GEMM Multi D partial Progress * GEMM Multi D partial Progress! * CK Tile Engine GEMM Multi D : All Python files generated * Partial Progress * Partial Progress * Partial Progress * Partial Progress : Incorrect Result * Partial Progress : Debugging * Partial Progress : Correct Results * Partial Progress - Incorrect Results * Partial Progress - Commenting Passthrough bypass logic * Changing Passthrough to MultiplyMultiply * Correct Results! * Fix and debug the pass through feature * Sample commit * Correct Results : MultiplyMultiply * Code Cleanup * Removing Failed Instances * Working code before Unary element support * Custom Elementwise Function support and working implementation for Mul and Add * Updating README * Working for Passthrough * Review Comments : Minor Fixes * Review Comments : Minor Fixes * Readme Updated * Partial Changes after Rebase * Working Code : Changes after Rebase * Updating Jenkins file * Removing default value changed while testing * Configuration changes in config files * Tile Handler changes in GEMM Multi D Tile Engine * Tile Handler changes in GEMM Multi D Example * Change log for Gemm Multi D in CK Tile Engine * Configuration changes in config files --------- Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-12 16:05:05 -07:00
joyeamd	0856b3f4a2	[CK_TILE]fix ck_tile's moe_sorting example in gfx11 (#2667 ) * fix ck_tile's moe_sorting example in gfx11 * fix clang format --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-08-12 12:33:56 -07:00
aska-0096	0810799e25	refactor blockgemm change, isolate to v2;	2025-08-12 14:25:50 +00:00
asleepzzz	5b39de4bb6	Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 ) This reverts commit `b7322a521a`.	2025-08-12 20:27:10 +08:00
Haocong WANG	b7322a521a	Optimize fmha fwd decode & prefill for gfx950 (#2641 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-08-12 19:43:14 +08:00
aska-0096	75f6f6bac4	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-12 09:04:41 +00:00
Yi DING	8e1eb0c1ee	[CK_TILE] FMHA BWD Decode Pipeline (#2643 ) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2	2025-08-12 17:02:52 +08:00
aska-0096	bcc05eee62	Fix the bug	2025-08-12 08:46:06 +00:00
aska-0096	96d24497f5	fix conflict. disable all v-col instance for fmha fwd	2025-08-12 04:02:41 +00:00
aska-0096	1716171be4	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-12 03:52:34 +00:00
Yi DING	4fde1646e5	[CK_TILE] FMHA BWD Optimization For GFX950 (#2628 ) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp	2025-08-12 11:11:55 +08:00
aska-0096	1c98007901	clang format	2025-08-12 01:53:31 +00:00
aska-0096	3868ddd708	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-11 15:59:40 +00:00
aska-0096	b86f7786e2	tempsave, update the blocksync functions	2025-08-11 14:21:09 +00:00
Yashvardhan Agarwal	191c62967b	Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 ) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format	2025-08-11 15:01:33 +02:00
aska-0096	7b8052d7ca	fix bug in pki4	2025-08-10 06:00:51 +00:00
aska-0096	76cbbb84a2	fix bugs in gemm	2025-08-09 03:25:12 +00:00
aska-0096	efb8549279	fix bug	2025-08-08 17:53:19 +00:00
aska-0096	729e8785fb	fix bugs	2025-08-08 15:42:15 +00:00
aska-0096	250dc13c75	fix clangformat with 18.1.3	2025-08-08 09:31:01 +00:00
aska-0096	78edd7303b	bug fix, clang format;	2025-08-08 09:04:02 +00:00
aska-0096	3b9fb6af38	Remove unnecessary changes	2025-08-08 08:08:03 +00:00
aska-0096	6bb57c2c57	Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa	2025-08-08 07:50:12 +00:00
aska-0096	1ecee378d5	remove unnecessary files; rename some files	2025-08-08 06:19:31 +00:00
aska-0096	b4640a9de6	merge fa_decode pipeline into fmha_fwd api	2025-08-08 05:46:18 +00:00

1 2 3 4 5 ...

395 Commits