composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 18:17:44 +00:00

Author	SHA1	Message	Date
Cong Ma	1ed315638e	[CK TILE GEMM] Fix building issues (#2772 ) - Add `WarpGemmMfma_f32_16x16x128_[fp8\|bf8]_[fp8\|bf8]_CTransposed` - Replace `__gfx950__` with `CK_GFX950_SUPPORT` [ROCm/composable_kernel commit: `e1ab460d2d`]	2025-09-02 22:40:18 -07:00
Po Yen Chen	884ecea071	[CK_TILE] Add FAv3 fwd pipeline (#2731 ) * Add FAv3 fwd pipeline * Unpack v_pk_mul to hide v_mov * Avoid compiler moving l compute across phase * Sync sched_group_barrier() setting for masking cases [ROCm/composable_kernel commit: `d876e87fe4`]	2025-09-01 09:16:45 +08:00
Aviral Goel	f71b16afc0	chore(gemm): clang format to pass CI (#2758 ) [ROCm/composable_kernel commit: `fcff0043ae`]	2025-08-29 00:38:46 -07:00
Vijay Krish	bad7262507	ck_tile kernel for gemm with groupwise quantized B tensor. (#2663 ) * This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers. Scale tensor data, BQ is spliced across threads in registers and not stored in LDS. Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats. fp8, fp8 -> f32 bf8, bf8 -> f32 fp8, i4 -> f32 bf8, i4 -> f32 Group size can go down to as low as K length of underlying WarpGemm primitive. * Solve merge conflict * [CK TILE] Update CHANGELOG.md --------- Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Cong Ma <congma13@amd.com> [ROCm/composable_kernel commit: `4208e28988`]	2025-08-28 23:43:02 -07:00
Cong Ma	16e29b0fe3	Support transposed C tile in Aquant (#2679 ) The performance of Aquant has increased after enabling transposed C. Do not need to exchange AQ elements among lanes after enabling transposed C as one thread only holds data from one row. [ROCm/composable_kernel commit: `428090f749`]	2025-08-28 13:28:09 -07:00
asleepzzz	154b02423f	Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 )" (#2757 ) This reverts commit 4eb18aab71ddb0cf7b63fb161424034615e2bdf5. [ROCm/composable_kernel commit: `038ea82315`]	2025-08-28 22:50:42 +08:00
linqunAMD	cd6d731322	[Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728 ) * fix copy basic build error * fix other ck tile test build error [ROCm/composable_kernel commit: `4a49dac7c6`]	2025-08-28 20:30:30 +08:00
Yi DING	8b537fb883	[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741 ) * 16x192 * Use buffer_load_lds for lse/d * Dispatch & cleanup * Avoid zeroing dq & fix * fix [ROCm/composable_kernel commit: `ead4447b20`]	2025-08-28 18:54:18 +08:00
Aviral Goel	5256e754cc	feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types (#2691 ) * feat(check_err): add a variable to adjust number of incorrect values to print * feat(host_tensor): add printing capability for fp8 bf8 int8 int4 * fix(gemm_utils): update acceptable data type * fix(host_tensor): print both 4 bit ints in pk_int4_t * refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp * feat(host_tensor): add print first n elements functions [ROCm/composable_kernel commit: `f5f795c4d6`]	2025-08-27 18:17:24 -07:00
Cong Ma	73657622b1	[CK TILE GEMM] Fix a merge conflict (#2753 ) * Fixed a merge conflict in 43e7d549 * Foramt the code [ROCm/composable_kernel commit: `cd53e2e57e`]	2025-08-27 11:08:09 -07:00
Cong Ma	a6d612d20c	[CK TILE] Fix bugs in AQuant preshuffle (#2700 ) * [CK TILE] Fix bugs in AQuant preshuffle - Make Aquant works with block Mx64x256. `M` could be 16, 32, 64 - Make Aquant works with warp 16x16x32 and 32x32x16. * [CK TILE] Rename Preshuffle to PreshuffleQuant The new name, PreshuffleQuant, explicitly states the function's purpose: to preshuffle the quantization matrix. * [CK TILE Block Scale] Use GemmConfig to save tile properties - Remove specialization of GemmQuantTypeConfig - Pass GemmConfig around which contains tile properties. Stop using hard coded tile properties in `gemm_calc_aquant()` * [CK TILE Block Scale] Rename GemmConfig used in block scale - Remove unused GemmConfig - Rename GemmConfig used in block scale --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `245467f359`]	2025-08-27 00:05:54 -07:00
Tianyuan Wu	802a5e7373	[CK_TILE] Fix the Wrong Output Generated by Gemm Examples on GFX11/12 (#2713 ) * Introduce macro CK_TILE_USE_WMMA Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Make CK_TILE_USE_WMMA global for all examples Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Remove CK_TILE_USE_WMMA from config.hpp Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> [ROCm/composable_kernel commit: `e9605ed36d`]	2025-08-25 12:55:35 -07:00
Yi DING	d732e1e379	[CK_TILE] FMHA BWD Fix Compilation with Bias (#2682 ) * [CK_TILE] FMHA BWD Fix Compilation with Bias * Fix appendkv kApplyRoPE [ROCm/composable_kernel commit: `4cfa2c7158`]	2025-08-22 10:01:10 +08:00
Bartłomiej Kocot	3e8a6dfb9c	[CK Tile] Grouped convolution backward data (#2652 ) * base working version for single groupped conv bwd data * Fix 2d descriptor * fix groups * Add 3d support * fixes * fixes * fixes --------- Co-authored-by: Jakub Piasecki <jakpia21@gmail.com> [ROCm/composable_kernel commit: `4212bbc170`]	2025-08-20 05:29:57 -07:00
Haocong WANG	2f6735906e	[CK_TILE] Update the fmhafwd dispatch logic (#2698 ) * update the fmhafwd dispatch logic * Fix fmha test scripts * Fix bash --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> [ROCm/composable_kernel commit: `81b265cf91`]	2025-08-20 16:24:43 +08:00
joyeamd	188e5dc70b	fix grouped gemm example when wave32 enabled (#2707 ) 1, delete some unused variables 2, fix BlockSize when wave32 enabled [ROCm/composable_kernel commit: `a1589a9667`]	2025-08-19 16:20:43 +08:00
mirchen-amd	b31811538f	Mirchen/gemm blockscale wp segfault fix (#2638 ) * Add stride validation to prevent segfault in blockscale GEMM * run clang-format * Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * added stride length checking to more gemm examples in ckprofiler * ran clang format * added validation header and implement in core gemm operations * remove ck_tile transpose and gemm stages from CI (#2646) * update CK build instruction step 4 (#2563) Co-authored-by: Aviral Goel <aviral.goel@amd.com> * Fixes to "General 2D Reduction Kernel" (#2535) (#2656) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format * enable aiter test_mha in daily CI (#2659) * feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * [CK_TILE] FMHA BWD Optimization For GFX950 (#2628) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp * Fix num_byte calculations to use nhead_k for K & V size (#2653) Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300. Before: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s ``` After: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s ``` * [CK_TILE] FMHA BWD Decode Pipeline (#2643) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 * fix (#2668) * Optimize fmha fwd decode & prefill for gfx950 (#2641) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> * Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670) This reverts commit 327bf408dd05b4e4bfb7b72f63f8710f35efa9a4. * added batch stride checking to batched gemm ops in profiler * removed batch stride validation * removed batched stride validation again * Update include/ck/library/utility/profiler_validation_common.hpp Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> * refactor function names * added gemm stride checking to more profiler gemm operations * run clang format * add stride checkign to 01 gemm example * rename from profiler to validation common, used for examples and profiler * build of ckProfiler success * update file headers --------- Co-authored-by: rahjain-amd <Rahul.Jain@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Co-authored-by: Yi DING <yi.ding@amd.com> Co-authored-by: Cameron Shinn <camerontshinn@gmail.com> Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com> Co-authored-by: Haocong WANG <haocwang@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `60320e90c1`]	2025-08-19 01:19:17 -07:00
Max Podkorytov	123f9b5077	invoke script directly (#2687 ) [ROCm/composable_kernel commit: `f38751fc2a`]	2025-08-19 00:23:07 -07:00
linqunAMD	807f7510b5	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `9fcc1ee9fd`]	2025-08-18 10:08:31 -07:00
Tianyuan Wu	ec7ee5b7b7	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `68134b60e4`]	2025-08-15 16:22:27 -07:00
Thomas Ning	42d775e488	Preshuffle Decode Prefill config fix (#2693 ) * feat(gemm_wp): add two new configs for wp * delete the unnecessary files * fix the config error * update the config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> [ROCm/composable_kernel commit: `5ada85ec04`]	2025-08-15 15:49:07 -07:00
Aviral Goel	d04f80c7ea	feat(gemm_wp): add two new configs for gemm weight preshuffle in gemm_utils.h (#2690 ) * feat(gemm_wp): add two new configs for wp * delete the unnecessary files --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `c06e8b4a66`]	2025-08-15 15:00:25 -07:00
Yashvardhan Agarwal	64ef69f7de	CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) (#2632 ) * CK_TILE: Implement two-stage split-K GEMM with reduction - Added split-K GEMM with reduction example * comment resolutions [ROCm/composable_kernel commit: `7f14772406`]	2025-08-14 10:18:52 +02:00
Jimniu	7a59980e9f	Jimniu/tile_example_flatmm_basic fix (#2680 ) * Add stride_b validation * run clang-format [ROCm/composable_kernel commit: `753b6227c5`]	2025-08-13 16:06:08 -07:00
Aviral Goel	00b044e4a0	Minor Improvements in CK TILE memory copy EXAMPLE (#2678 ) * Rename vector to ThreadTile * more notes on tile encoding * remove number<> from tuple of make_tile_window * add script to stress test the copy example [ROCm/composable_kernel commit: `8a698c7445`]	2025-08-13 15:24:16 -07:00
joyeamd	fbefd916f0	[CK_TILE]fix elementwise example in gfx11/12 (#2676 ) * fix elementwise examples * improve the robust * fix ck_tile's elementwise test * update elementwise test [ROCm/composable_kernel commit: `bcc38deff7`]	2025-08-13 15:21:46 -07:00
Haocong WANG	1d13107873	fix for aiter consume (#2677 ) [ROCm/composable_kernel commit: `3142562c22`]	2025-08-13 19:06:22 +08:00
SamiAario-AMD	a2e850acbc	Cleanups (#2631 ) * Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp * Simplify two templated operator calls by having the templated types deduced automatically * Simplify two GemmPipeline calls * Fix GemmPipelineAgBgCrCompV4::GetName * Refactor use of ArgParser in CK tile GEMM examples * Update args in README.md to match the implementation in create_args * Remove some unnecessary include statements * Rename two variables * Factor out common code * Factor out do_verify * Add and use type aliases for memory operation integral constants * In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `28a97865f5`]	2025-08-13 10:12:08 +02:00
Haocong WANG	cfd8e1b303	Re-enable optimization for gfx950 fmha fwd (#2671 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format * refactor blockgemm change, isolate to v2; --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `05a6e92705`]	2025-08-13 14:57:43 +08:00
Cong Ma	7a49dd7d7c	Preshuffle AQ matrix in block scale gemm (#2624 ) * Preshuffle AQ matrix in block scale gemm * turns the output to fp16. Increase the repetition time. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `452791a3ba`]	2025-08-12 21:32:51 -07:00
Thomas Ning	651512cbc2	Finish the grouped gemm restructure with fp8 data type (#2655 ) * Finish the grouped gemm restructure with data type * restore gemm_utils.hpp * Update example/ck_tile/17_grouped_gemm/run_grouped_gemm_example.inc Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Comment Addressed --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0f42a92fc1`]	2025-08-12 18:23:34 -07:00
Thrupti Raj Lakshmana Gowda	1d1d6717d2	GEMM Multi D for CK Tile Engine (#2660 ) * Readme for GEMM Multi D * GEMM Multi D partial Progress * GEMM Multi D partial Progress! * CK Tile Engine GEMM Multi D : All Python files generated * Partial Progress * Partial Progress * Partial Progress * Partial Progress : Incorrect Result * Partial Progress : Debugging * Partial Progress : Correct Results * Partial Progress - Incorrect Results * Partial Progress - Commenting Passthrough bypass logic * Changing Passthrough to MultiplyMultiply * Correct Results! * Fix and debug the pass through feature * Sample commit * Correct Results : MultiplyMultiply * Code Cleanup * Removing Failed Instances * Working code before Unary element support * Custom Elementwise Function support and working implementation for Mul and Add * Updating README * Working for Passthrough * Review Comments : Minor Fixes * Review Comments : Minor Fixes * Readme Updated * Partial Changes after Rebase * Working Code : Changes after Rebase * Updating Jenkins file * Removing default value changed while testing * Configuration changes in config files * Tile Handler changes in GEMM Multi D Tile Engine * Tile Handler changes in GEMM Multi D Example * Change log for Gemm Multi D in CK Tile Engine * Configuration changes in config files --------- Co-authored-by: ThomasNing <thomasning@amd.com> [ROCm/composable_kernel commit: `3f57ec3d2d`]	2025-08-12 16:05:05 -07:00
slippedJim	8a12cdd385	remove bad pipeline codegen (#2673 ) [ROCm/composable_kernel commit: `20288caa2f`]	2025-08-13 00:23:40 +08:00
asleepzzz	9161cb590d	Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 ) This reverts commit 327bf408dd05b4e4bfb7b72f63f8710f35efa9a4. [ROCm/composable_kernel commit: `5b39de4bb6`]	2025-08-12 20:27:10 +08:00
Haocong WANG	8a20b06f54	Optimize fmha fwd decode & prefill for gfx950 (#2641 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `b7322a521a`]	2025-08-12 19:43:14 +08:00
Mateusz Ozga	eab0dc96f6	fix (#2668 ) [ROCm/composable_kernel commit: `c0c2ded566`]	2025-08-12 13:02:10 +02:00
Yi DING	0afd7af89c	[CK_TILE] FMHA BWD Decode Pipeline (#2643 ) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 [ROCm/composable_kernel commit: `8e1eb0c1ee`]	2025-08-12 17:02:52 +08:00
Cameron Shinn	b48fe39bf5	Fix num_byte calculations to use nhead_k for K & V size (#2653 ) Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300. Before: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s ``` After: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s ``` [ROCm/composable_kernel commit: `352f87e684`]	2025-08-12 13:44:01 +08:00
Yi DING	ef32abc4cc	[CK_TILE] FMHA BWD Optimization For GFX950 (#2628 ) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp [ROCm/composable_kernel commit: `4fde1646e5`]	2025-08-12 11:11:55 +08:00
Aviral Goel	4d263f468e	feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582 ) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `a7badc6ec5`]	2025-08-11 10:54:37 -07:00
Yi DING	84a17e2f94	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad [ROCm/composable_kernel commit: `b0a97498b0`]	2025-08-07 21:24:43 +08:00
Yashvardhan Agarwal	b6f0e98da6	General 2D Reduction Kernel (#2535 ) * General 2D Reduction Kernel * Move the reduction kernel from the example * Split the code and add the necessary policy, problem, shape files as per ck_tile convention * Add/modify the headers * Modified the example to work with the 'new' kernel * Added tests for the kernel * N-D refernce reduce * Added support for N-D input with transform to 2D * Added padding to support various input sized tensors * Bug fix in the thread buffer constructor * Some comments to explain the reduce2d block kernel * comments resolution * clang-format * comments resolution * clang-format * clang-format * comments resolution * clang-format [ROCm/composable_kernel commit: `4750b293fe`]	2025-08-06 15:36:59 +02:00
Yi DING	65e18fa783	[CK_TILE] Fix FMHA qr_async causing errors in FA (#2627 ) [ROCm/composable_kernel commit: `15e8b6ccf7`]	2025-08-06 20:04:23 +08:00
Thomas Ning	f466e04554	delete all slp compilation flag in CK Tile (#2625 ) [ROCm/composable_kernel commit: `07469142cb`]	2025-08-06 00:34:39 -07:00
Thomas Ning	7c9d0d0435	Persistent grouped gemm CompV4 Enablement & Polish (#2605 ) * enable the persistent kernel for CompV4 * polish the example and clang format * fix the non-persistent kernel error --------- Co-authored-by: ThomasNing <thomasning@amd.com> [ROCm/composable_kernel commit: `cbfecf8d7a`]	2025-08-04 23:43:01 -07:00
Aviral Goel	4f9a78fbf0	Integration of a new pipeline for weight preshuffle into gemm examples (#2516 ) * something khushbu can help with * v1 v2 works with flatmm develop * v0 v1 v2 numerical error gone * Fixing numerical error, and interchange preshuffle configs to match with flatmm * Refactor GEMM pipeline configurations and integrate preshuffle support - Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3). - Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations. - Removed obsolete code and comments * clang format * fix vectorloadsize bug * add the Preshuffle3 * update kwarp calculation in gemm utils * update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch * fix: add CK_GFX950_SUPPORT macro for gfx950 detection * default disable rotating buffer * docs(CHANGELOG): update changelog for rocm 7.0 * Revert "docs(CHANGELOG): update changelog for rocm 7.0" This reverts commit `2bc16fff84`. * Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files. * revert example/ck_tile/flatmm to its original state * remove comment added by second author * switch to xor ALDSDescriptor * modify the MakeALdsDescriptor() * temporary profiling script * getting rid of line marker compiler error * UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy * add a minor fix for the config * typo fix * Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2 * revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp * revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp * reenable the GemmSpatiallyLocalTilePartitioner * make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline * remove hardcoded true for preshuffle bool template argument * rename script * remove gemm_profilie.sh script * merge conflict resolve * clang formatted * typo fix * Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp * Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy * Fix missing newline at end of file in run_gemm_example.inc * Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1 * addressing review comments * removing debug code * addressing review comments * Revert "addressing review comments" This reverts commit `29c45192ba`. * updating tile_engine code * addressing review comments --------- Co-authored-by: amd-khushbu <khuagarw@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `1441a0a7ee`]	2025-08-01 00:04:54 -07:00
Khushbu Agarwal	464c6f459e	[CK_Tile] Updating gpu timer when doing flush cache (#2593 ) * Missed updating function names in example * updating timer * code cleanup * addressing review comments * updating tile_engine code * addressing review comments [ROCm/composable_kernel commit: `88d72178d6`]	2025-07-31 16:43:33 -07:00
Khushbu Agarwal	7af5e043b1	Update to gpu_timer for rotating_buffer (#2524 ) * update gpu_timer for rotating buffer as hipblasLt's implementation * timing fix * Updating gpu timer for old ck as well * Revert "Updating gpu timer for old ck as well" This reverts commit `958cd1bc99`. * code clean up with runtime argument; function rename * code cleanup * general timer fixes * bug fix * clang formatted * addressing reveiew comments * clang formatted * Addressing review comments * CI fix --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `61e21f5567`]	2025-07-29 15:21:05 -07:00
rocking	0fb487ad63	set default optdim (#2580 ) [ROCm/composable_kernel commit: `01642ca8b1`]	2025-07-29 13:44:10 +08:00
Yi DING	a592107cb9	[CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130 ) * Fix shuffle_tile * Add fmha bwd d160 * CHANGELOG * Use static_cast * Update --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `1926cd0cb8`]	2025-07-29 09:31:14 +08:00

1 2 3 4 5 ...

679 Commits