composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-01 04:07:56 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	cfe5e448db	Fix splitk autodeduce for grouped conv bwd weight (#2742 )	2025-08-27 12:35:42 +02:00
Cong Ma	245467f359	[CK TILE] Fix bugs in AQuant preshuffle (#2700 ) * [CK TILE] Fix bugs in AQuant preshuffle - Make Aquant works with block Mx64x256. `M` could be 16, 32, 64 - Make Aquant works with warp 16x16x32 and 32x32x16. * [CK TILE] Rename Preshuffle to PreshuffleQuant The new name, PreshuffleQuant, explicitly states the function's purpose: to preshuffle the quantization matrix. * [CK TILE Block Scale] Use GemmConfig to save tile properties - Remove specialization of GemmQuantTypeConfig - Pass GemmConfig around which contains tile properties. Stop using hard coded tile properties in `gemm_calc_aquant()` * [CK TILE Block Scale] Rename GemmConfig used in block scale - Remove unused GemmConfig - Rename GemmConfig used in block scale --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-08-27 00:05:54 -07:00
linqunAMD	95e4a4efcb	Fix merge mfma_wmma (part 1) regression (#2749 ) root cause: a typo in GetGfx11InputBlkIdx, const ia added by mistake.	2025-08-26 22:49:34 -07:00
John Afaganis	508e7912f9	Revert "[CK-TILE] Default epilogue, adding support for D (#2629 )" (#2746 ) This reverts commit `d43228fbca`.	2025-08-26 09:48:49 -07:00
Mateusz Ozga	d43228fbca	[CK-TILE] Default epilogue, adding support for D (#2629 ) * Extend 2d-epilogue, D support * Added tests & update * Remove unused attribute * Extend tests --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-08-25 19:29:35 -07:00
Yi DING	de61e55493	[CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715 ) * FMHA avoid unnecessary vmcnt0 Squashed commit of the following: commit `7bdf6a7eef` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:51 2025 +0000 merge develop and solve conflicts commit `f21e916a8c` Merge: `a7dd2a7d1` `0db21053e` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 22 03:15:21 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into vmcnt0issue commit `a7dd2a7d13` Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 02:17:43 2025 +0000 update bwd commit `380aa8f311` Author: Kevin Choi <kevin.choi@amd.com> Date: Mon Aug 18 19:36:38 2025 +0000 add restrict to applicable functions commit `b85daba2a3` Author: Ding, Yi <yi.ding@amd.com> Date: Mon Aug 18 02:07:03 2025 +0000 bwd filter commit `75c4b9372f` Author: Kevin Choi <kevin.choi@amd.com> Date: Sat Aug 16 08:15:23 2025 +0000 remove noinline attr as it causes a lot more s_waitcnt's commit `598e3fec41` Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 12:11:17 2025 +0000 remove innerloop, move restrict parameters to mainloop and add noinline attribute. commit `3340408537` Author: Kevin Choi <kevin.choi@amd.com> Date: Thu Aug 14 07:06:51 2025 +0000 Create inner lambda with restrict parameters, add restrict to some parameters commit `3bc45ecbc7` Author: aska-0096 <haocwang@amd.com> Date: Thu Aug 14 03:43:54 2025 +0000 save for debug commit `de4db6c4c5` Merge: `108abf00e` `68694cb78` Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:15:22 2025 +0000 Merge branch 'wip-async-tr-fa' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `108abf00e0` Merge: `0810799e2` `0f42a92fc` Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 13 02:14:26 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `68694cb781` Merge: `0810799e2` `20288caa2` Author: asleepzzz <hanwen.chang@amd.com> Date: Wed Aug 13 00:34:11 2025 +0800 Merge branch 'develop' into wip-async-tr-fa commit `0810799e25` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 14:25:50 2025 +0000 refactor blockgemm change, isolate to v2; commit `fd1eb323af` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:26:13 2025 +0000 clang format commit `75f6f6bac4` Merge: `bcc05eee6` `8e1eb0c1e` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 09:04:41 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `bcc05eee62` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 08:46:06 2025 +0000 Fix the bug commit `96d24497f5` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 04:02:41 2025 +0000 fix conflict. disable all v-col instance for fmha fwd commit `1716171be4` Merge: `1c9800790` `4fde1646e` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 03:52:34 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `1c98007901` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:53:31 2025 +0000 clang format commit `f43e903b1d` Merge: `3868ddd70` `a7badc6ec` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 12 01:52:52 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `3868ddd708` Merge: `498d234ab` `191c62967` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:59:40 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `498d234ab8` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 15:37:37 2025 +0000 change the warp setting for hdim32 fmha fwd commit `b86f7786e2` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 11 14:21:09 2025 +0000 tempsave, update the blocksync functions commit `7b8052d7ca` Author: aska-0096 <haocwang@amd.com> Date: Sun Aug 10 06:00:51 2025 +0000 fix bug in pki4 commit `76cbbb84a2` Author: aska-0096 <haocwang@amd.com> Date: Sat Aug 9 03:25:12 2025 +0000 fix bugs in gemm commit `8c101ccb88` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 18:35:53 2025 +0000 fix bug on non-gfx950 commit `efb8549279` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 17:53:19 2025 +0000 fix bug commit `729e8785fb` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 15:42:15 2025 +0000 fix bugs commit `250dc13c75` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:31:01 2025 +0000 fix clangformat with 18.1.3 commit `106edeecd9` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:07:40 2025 +0000 remove non-necessary change commit `78edd7303b` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 09:04:02 2025 +0000 bug fix, clang format; commit `3b9fb6af38` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 08:08:03 2025 +0000 Remove unnecessary changes commit `6bb57c2c57` Merge: `1ecee378d` `ab2602683` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 07:50:12 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `1ecee378d5` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 06:19:31 2025 +0000 remove unnecessary files; rename some files commit `b4640a9de6` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 8 05:46:18 2025 +0000 merge fa_decode pipeline into fmha_fwd api commit `fe63a646a4` Author: aska-0096 <haocwang@amd.com> Date: Wed Aug 6 05:58:43 2025 +0000 add __restrict__ to tr load commit `414cad667b` Author: aska-0096 <haocwang@amd.com> Date: Tue Aug 5 07:23:51 2025 +0000 Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug commit `0d12fc944f` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:27:42 2025 +0000 Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA commit `4f31847de1` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 10:02:17 2025 +0000 add vmcnt guard before load ktile commit `746f4ccb99` Author: aska-0096 <haocwang@amd.com> Date: Mon Aug 4 06:49:01 2025 +0000 Load Q through lds, implement xor; commit `2d4e73d2b4` Author: aska-0096 <haocwang@amd.com> Date: Fri Aug 1 10:44:54 2025 +0000 small refactor commit `a28b6e67fe` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 10:25:37 2025 +0000 upgrade prefill pipeline; simple iglp; consistent data produce and consume order commit `75cba48682` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 31 05:13:27 2025 +0000 enable larger tile size; upgrade xor pattern commit `69890afc98` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 12:25:33 2025 +0000 remove all lds bankconflict with xor layouts commit `8dacc35c4c` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 30 03:51:06 2025 +0000 enable prefill overload operator(). commit `13bcc913de` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 25 07:10:01 2025 +0000 fix the lds alignment caused performance regression commit `af28123cec` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 23 09:05:57 2025 +0000 remove unnecessary features commit `14e0ab70c6` Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 22 08:04:05 2025 +0000 tempsave. asynccopy+trload sanity checked commit `1b468bac0b` Author: aska-0096 <haocwang@amd.com> Date: Mon Jul 21 05:55:55 2025 +0000 tempsave, trload+asyncload done commit `afd96d8180` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 10:04:34 2025 +0000 compile pass commit `5616551115` Merge: `ae39c84f5` `095393276` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:17:27 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa commit `ae39c84f55` Author: aska-0096 <haocwang@amd.com> Date: Fri Jul 18 05:16:39 2025 +0000 tempsave commit `94b6430489` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 10:06:09 2025 +0000 temp save commit `7e330553dc` Merge: `18669925c` `804f77dce` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 07:24:32 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline commit `804f77dce5` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 03:10:46 2025 +0000 move test_copy into test commit `21627d7ca7` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:41:31 2025 +0000 remove unnecessary output commit `287792c44a` Merge: `a4221db30` `21fd7e953` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:13 2025 +0000 Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix commit `a4221db304` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 17 02:26:10 2025 +0000 add input validation and bug fix commit `21fd7e9538` Merge: `d6df7bf85` `6e76b8205` Author: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Date: Wed Jul 16 11:23:57 2025 -0700 Merge branch 'develop' into test_copy_fix commit `d6df7bf851` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:55:50 2025 +0000 fix vmcnt shift commit `40e039e4e4` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 08:37:07 2025 +0000 Improve s_waitcnt_imm calculation commit `c30f8b709b` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 05:39:50 2025 +0000 fix the s_waitcnt_imm calculation commit `ec0a45b29f` Merge: `e5cc4af80` `6b09f0823` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:57:57 2025 +0000 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix commit `e5cc4af808` Author: aska-0096 <haocwang@amd.com> Date: Wed Jul 16 03:54:33 2025 +0000 Add block_sync_lds_direct_load utility commit `eea58629cf` Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 15 09:39:03 2025 +0000 fix async copytest bug commit `18669925cc` Author: aska-0096 <haocwang@amd.com> Date: Thu Jul 10 04:29:33 2025 +0000 temp save, change all instance to 1wave commit `18686cfe5b` Author: aska-0096 <haocwang@amd.com> Date: Tue Jul 8 08:37:20 2025 +0000 tempsave, fmha_decode commit `47565f21a5` Author: aska-0096 <haocwang@amd.com> Date: Sat Jun 21 15:02:57 2025 +0000 temp save, waiting for debug commit `e0a634ef97` Author: aska-0096 <haocwang@amd.com> Date: Thu Jun 19 05:11:52 2025 +0000 save an example for __bf16 type commit `4bd5fd4a3c` Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 07:27:24 2025 +0000 fix bwd code commit `69809d9513` Author: aska-0096 <haocwang@amd.com> Date: Wed Jun 18 06:37:16 2025 +0000 Fix for fwd/bwd kernel build filter commit d5ec3d0e5768aafed7f77151b2a835e87b9f95ba Author: Ding, Yi <yi.ding@amd.com> Date: Tue Aug 19 08:13:18 2025 +0000 Add restrict to avoid unnecessary vmcnt --------- Co-authored-by: aska-0096 <haocwang@amd.com> * Add comments for c-stype cast * Better comments --------- Co-authored-by: aska-0096 <haocwang@amd.com>	2025-08-25 20:55:12 +08:00
John Shumway	c71d7ddd74	Remove unsupported use of c++20 concept. (#2719 ) Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.	2025-08-24 21:29:23 -07:00
linqunAMD	d6e49c5fde	Extend XDL kernel to Support RDNA3/4 - Part 1 (#2606 )	2025-08-22 17:46:30 -04:00
Po Yen Chen	0db21053e6	[CK_TILE] Allow switching between SGPR/VGPR get_warp_id() return values (#2669 ) * Allow return VGPR get_warp_id() value * Avoid using SALU in async_load_raw()	2025-08-22 10:17:05 +08:00
Po Yen Chen	4a7ecce096	[CK_TILE][FMHA] Enable dwordx4 loading in async_load_tile_raw() (#2549 ) * Support async load dwordx4 * Enlarge load size on gfx950	2025-08-22 10:13:47 +08:00
Yi DING	4cfa2c7158	[CK_TILE] FMHA BWD Fix Compilation with Bias (#2682 ) * [CK_TILE] FMHA BWD Fix Compilation with Bias * Fix appendkv kApplyRoPE	2025-08-22 10:01:10 +08:00
Bartłomiej Kocot	4212bbc170	[CK Tile] Grouped convolution backward data (#2652 ) * base working version for single groupped conv bwd data * Fix 2d descriptor * fix groups * Add 3d support * fixes * fixes * fixes --------- Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>	2025-08-20 05:29:57 -07:00
Emily Martins	071165919f	[CK Tile] Stream K GEMM Kernel HostArgs and Kernel Classes (#2681 ) * CK Tile Stream K Device Ops Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The StreamKKernel class injects Universal Gemm and includes functions to facilitate kernel preparation for the GPU. * Stream K Device Ops Fixes - Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to ensure we get size needed for accumulation buffers and semaphores. - Pass in num_sk_blocks into TilePartitioner constructor - Update documentation * Add WarpTile dimensions to GetName function in StreamKKernel class * Fix typos in StreamKHostArgs class description. Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Apply clang format on updated comment for StreamKHostArgs * Explicitly specify type for StreamKReductionStrategy enum * Remove unecessary scopes * Unify the commenting style to inline comments * Add explicit casts for occupancy and num_cu in MakeKernelArgs function Both the static functions Occupancy and NumCU in the StreamKKernel class use functions from the HIP API that result in the returned occupancy and num_cu types being type int. The TilePartitioner interface for stream K will have occupancy and num_cu being type ck_tile::index_t which is int32_t. Thus, to be safe, this change ensures that both occupancy and num_cu are cast to int32_t. * Fix use of kentry due to interface update PR #2594 updated the interface for the kentry function in include/ck_tile/host/kernel_launch.hpp. As a result, the static function Occupancy was updated to work correctly with the new interface. PR #2594 also changed UniversalGemmKernel's KernelBlockSize static variable to kBlockSize, so the StreamKKernel class was updated to reflect this change. * Switch type of num_sk_blocks from uint32_t to int32_t This change switches the type of num_sk_blocks to type ck_tile::index_t which is int32_t. This was done because parallel work for the CK Tile StreamK TilePartitioner's constructor will have num_sk_blocks as ck_tile::index_t. Thus, this change will help unify the interfaces to avoid any type conversion errors. --------- Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2025-08-19 15:08:52 -06:00
jefyang1	6ba9289b26	Fix pk i4 v3 example test regression on gfx942 (#2706 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-08-19 09:58:28 -07:00
joyeamd	a1589a9667	fix grouped gemm example when wave32 enabled (#2707 ) 1, delete some unused variables 2, fix BlockSize when wave32 enabled	2025-08-19 16:20:43 +08:00
mirchen-amd	60320e90c1	Mirchen/gemm blockscale wp segfault fix (#2638 ) * Add stride validation to prevent segfault in blockscale GEMM * run clang-format * Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * added stride length checking to more gemm examples in ckprofiler * ran clang format * added validation header and implement in core gemm operations * remove ck_tile transpose and gemm stages from CI (#2646) * update CK build instruction step 4 (#2563) Co-authored-by: Aviral Goel <aviral.goel@amd.com> * Fixes to "General 2D Reduction Kernel" (#2535) (#2656) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format * enable aiter test_mha in daily CI (#2659) * feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * [CK_TILE] FMHA BWD Optimization For GFX950 (#2628) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp * Fix num_byte calculations to use nhead_k for K & V size (#2653) Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300. Before: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s ``` After: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s ``` * [CK_TILE] FMHA BWD Decode Pipeline (#2643) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 * fix (#2668) * Optimize fmha fwd decode & prefill for gfx950 (#2641) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> * Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670) This reverts commit `b7322a521a`. * added batch stride checking to batched gemm ops in profiler * removed batch stride validation * removed batched stride validation again * Update include/ck/library/utility/profiler_validation_common.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * refactor function names * added gemm stride checking to more profiler gemm operations * run clang format * add stride checkign to 01 gemm example * rename from profiler to validation common, used for examples and profiler * build of ckProfiler success * update file headers --------- Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> Co-authored-by: Cameron Shinn <camerontshinn@gmail.com> Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com> Co-authored-by: Haocong WANG <haocwang@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-08-19 01:19:17 -07:00
linqunAMD	9fcc1ee9fd	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-08-18 10:08:31 -07:00
Sami Remes	26d3300930	Add other layouts for FP8 block scaled gemm (#2665 ) * Start adding other layouts for gemm_ab_scale * Add some instances * Create tensor descriptors for A/B scales depending on A/B layout * Fix formatting * Revert some comments * Revert commented instances in CMakeLists.txt * Add some more instances for col-row gemm * enable more row,row instances * Use occupancy=1 for col,row layout to avoid spills	2025-08-18 01:46:10 -07:00
Tianyuan Wu	7310830d14	Fix CI build error (#2695 ) Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>	2025-08-18 01:45:40 -07:00
Tianyuan Wu	68134b60e4	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit `07a79e797d`. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit `ceee918007`. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit `7a7241085e`. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-08-15 16:22:27 -07:00
jefyang1	d7c95dd491	Add gemm universal f8 f8 bf16 instances on gfx950 (#2662 )	2025-08-14 13:25:24 -07:00
Yashvardhan Agarwal	7f14772406	CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) (#2632 ) * CK_TILE: Implement two-stage split-K GEMM with reduction - Added split-K GEMM with reduction example * comment resolutions	2025-08-14 10:18:52 +02:00
Gino Lu	e5623d3825	fix wrong nan producion. (#2640 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-08-14 15:12:31 +08:00
Aviral Goel	8a698c7445	Minor Improvements in CK TILE memory copy EXAMPLE (#2678 ) * Rename vector to ThreadTile * more notes on tile encoding * remove number<> from tuple of make_tile_window * add script to stress test the copy example	2025-08-13 15:24:16 -07:00
joyeamd	bcc38deff7	[CK_TILE]fix elementwise example in gfx11/12 (#2676 ) * fix elementwise examples * improve the robust * fix ck_tile's elementwise test * update elementwise test	2025-08-13 15:21:46 -07:00
Enrico Degregori	a6f4029276	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2675 ) Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-14 00:21:09 +02:00
SamiAario-AMD	28a97865f5	Cleanups (#2631 ) * Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp * Simplify two templated operator calls by having the templated types deduced automatically * Simplify two GemmPipeline calls * Fix GemmPipelineAgBgCrCompV4::GetName * Refactor use of ArgParser in CK tile GEMM examples * Update args in README.md to match the implementation in create_args * Remove some unnecessary include statements * Rename two variables * Factor out common code * Factor out do_verify * Add and use type aliases for memory operation integral constants * In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-08-13 10:12:08 +02:00
Haocong WANG	05a6e92705	Re-enable optimization for gfx950 fmha fwd (#2671 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format * refactor blockgemm change, isolate to v2; --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-08-13 14:57:43 +08:00
Cong Ma	452791a3ba	Preshuffle AQ matrix in block scale gemm (#2624 ) * Preshuffle AQ matrix in block scale gemm * turns the output to fp16. Increase the repetition time. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-08-12 21:32:51 -07:00
Thrupti Raj Lakshmana Gowda	3f57ec3d2d	GEMM Multi D for CK Tile Engine (#2660 ) * Readme for GEMM Multi D * GEMM Multi D partial Progress * GEMM Multi D partial Progress! * CK Tile Engine GEMM Multi D : All Python files generated * Partial Progress * Partial Progress * Partial Progress * Partial Progress : Incorrect Result * Partial Progress : Debugging * Partial Progress : Correct Results * Partial Progress - Incorrect Results * Partial Progress - Commenting Passthrough bypass logic * Changing Passthrough to MultiplyMultiply * Correct Results! * Fix and debug the pass through feature * Sample commit * Correct Results : MultiplyMultiply * Code Cleanup * Removing Failed Instances * Working code before Unary element support * Custom Elementwise Function support and working implementation for Mul and Add * Updating README * Working for Passthrough * Review Comments : Minor Fixes * Review Comments : Minor Fixes * Readme Updated * Partial Changes after Rebase * Working Code : Changes after Rebase * Updating Jenkins file * Removing default value changed while testing * Configuration changes in config files * Tile Handler changes in GEMM Multi D Tile Engine * Tile Handler changes in GEMM Multi D Example * Change log for Gemm Multi D in CK Tile Engine * Configuration changes in config files --------- Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-12 16:05:05 -07:00
joyeamd	0856b3f4a2	[CK_TILE]fix ck_tile's moe_sorting example in gfx11 (#2667 ) * fix ck_tile's moe_sorting example in gfx11 * fix clang format --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2025-08-12 12:33:56 -07:00
asleepzzz	5b39de4bb6	Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 ) This reverts commit `b7322a521a`.	2025-08-12 20:27:10 +08:00
Haocong WANG	b7322a521a	Optimize fmha fwd decode & prefill for gfx950 (#2641 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2025-08-12 19:43:14 +08:00
Yi DING	8e1eb0c1ee	[CK_TILE] FMHA BWD Decode Pipeline (#2643 ) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2	2025-08-12 17:02:52 +08:00
Yi DING	4fde1646e5	[CK_TILE] FMHA BWD Optimization For GFX950 (#2628 ) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp	2025-08-12 11:11:55 +08:00
Yashvardhan Agarwal	191c62967b	Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 ) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format	2025-08-11 15:01:33 +02:00
Max Podkorytov	ab26026835	[CK-tile] add more tests for batched transpose testing the rectangular block tile sizes (#2634 ) * add failing tests * swap out and reference * add constraint assert to transpose input distribution * test both pipelines with rectangular block tile * print mismatched indices * add a smaller failing test for old pipeline * print grid and block * fill output before operating on it * swap m/n tile sizes and make one test pass * add device syncs * add one more flipped test case * flip block tile at host arg init * fix tiles for lds pipeline * clang-format * rename tests * roll back error check * remove device syncs * reduce large test case's size	2025-08-07 16:51:53 -07:00
Gino Lu	5d6d236b25	Add e8m0 scaled convert into CK_TILE (#2617 ) * first commit * remove redundent code * modify according to comments. * fix type_convert error with scaled_type_convert	2025-08-07 21:37:28 +08:00
Yi DING	b0a97498b0	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad	2025-08-07 21:24:43 +08:00
Sami Remes	ffdee5e774	[CK_TILE] Enable printing more structures in CK-Tile (#2443 ) * Add more printing to core cktile * Revert other changes in static encoding pattern * Refactor to using a free print() function * Remove loops and print just the containers * Print tuple with better formatting, fix sequence compilation * Add some tests for print utility * Add print utility header * Print for static_encoding_pattern * add buffer_view printing * Align vector_traits * Fix formatting * Lower-case enum strings Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Remove empty comment lines * Fix test with lower-case too * Reduce repeated code in print tests, move helper function closer to type definition, test X&Y * Add test_print_common.hpp * add print.hpp in core.hpp --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-08-07 15:45:27 +03:00
Enrico Degregori	21e9983913	Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 )" (#2637 ) This reverts commit `2203b0ddfe`. Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-07 12:30:08 +02:00
Bartłomiej Kocot	5328b232b2	Grouped Convolution Forward Infer Bias Bnorm Activ (#2621 ) * Grouped Convolution Forward Infer Bias Bnorm Activ * 3d	2025-08-07 08:36:47 +02:00
Yashvardhan Agarwal	4750b293fe	General 2D Reduction Kernel (#2535 ) * General 2D Reduction Kernel * Move the reduction kernel from the example * Split the code and add the necessary policy, problem, shape files as per ck_tile convention * Add/modify the headers * Modified the example to work with the 'new' kernel * Added tests for the kernel * N-D refernce reduce * Added support for N-D input with transform to 2D * Added padding to support various input sized tensors * Bug fix in the thread buffer constructor * Some comments to explain the reduce2d block kernel * comments resolution * clang-format * comments resolution * clang-format * clang-format * comments resolution * clang-format	2025-08-06 15:36:59 +02:00
Adam Osewski	2622ff06cb	Remove unused lds direct load instruction. (#2573 ) This functionality is replaced by amd_async_buffer_load Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2025-08-06 15:16:12 +02:00
Enrico Degregori	2203b0ddfe	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 ) * Add padding 1x1Stride1Pad0 conv specialization * Add gridwise checks for conv cshufflev3 * Merge padding with previous transforms * Apply transform changes for padding to default specialization as well --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2025-08-05 15:23:19 +02:00
Thomas Ning	cbfecf8d7a	Persistent grouped gemm CompV4 Enablement & Polish (#2605 ) * enable the persistent kernel for CompV4 * polish the example and clang format * fix the non-persistent kernel error --------- Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-04 23:43:01 -07:00
Jinchao Xu	15eb493152	Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611 ) Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances by splitting debug information into separate .dwo files. Add explicit cast to avoid signed/unsigned comparison warning.	2025-08-04 11:26:08 -07:00
Illia Silin	788e8a878e	update the switch condition for buffer built-ins (#2602 )	2025-08-01 14:30:07 -07:00
Thomas Ning	7c44a763fa	Fix the GFX 950 Universal GEMM (#2597 ) * solve the gfx950 error * clang format * fix a typo error --------- Co-authored-by: ThomasNing <thomasning@amd.com>	2025-08-01 09:32:24 -07:00
lalala-sh	bb5c478295	fix weight index out of range (#2414 )	2025-08-01 17:50:02 +08:00

... 8 9 10 11 12 ...

1436 Commits