composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-19 02:01:01 +00:00

Author	SHA1	Message	Date
Geo Min	9503dc6af0	[TheRock CI] Adding presubmit check for CK (#2688 ) * Adding presubmit check for CK * Adding exclusion * Enable forks [ROCm/composable_kernel commit: `b4f3487d84`]	2025-08-18 14:16:31 -07:00
Illia Silin	687c09f3ab	Build ckProfiler package for all architectures. (#2701 ) * stash ckprofiler package built for all targets * build the lib for all instances in newer docker * make sure packages get posted [ROCm/composable_kernel commit: `8b55afcd93`]	2025-08-18 11:16:25 -07:00
linqunAMD	807f7510b5	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `9fcc1ee9fd`]	2025-08-18 10:08:31 -07:00
Sami Remes	22f5a997fb	Add other layouts for FP8 block scaled gemm (#2665 ) * Start adding other layouts for gemm_ab_scale * Add some instances * Create tensor descriptors for A/B scales depending on A/B layout * Fix formatting * Revert some comments * Revert commented instances in CMakeLists.txt * Add some more instances for col-row gemm * enable more row,row instances * Use occupancy=1 for col,row layout to avoid spills [ROCm/composable_kernel commit: `26d3300930`]	2025-08-18 01:46:10 -07:00
Tianyuan Wu	609fdf3322	Fix CI build error (#2695 ) Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> [ROCm/composable_kernel commit: `7310830d14`]	2025-08-18 01:45:40 -07:00
Tianyuan Wu	ec7ee5b7b7	[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466 ) * WMMA GEMM F16 Implementation Signed-off-by: root <tianyuwu@amd.com> * Self-review Signed-off-by: root <tianyuwu@amd.com> * ASIC check minor tweak Signed-off-by: root <tianyuwu@amd.com> * add missing include file * Set GPU_TARGETS to gfx11/12 generic Signed-off-by: root <tianyuwu@amd.com> * INT8 GFX12 Signed-off-by: root <tianyuwu@amd.com> * add int8x16 branch * Fix CI script Signed-off-by: root <tianyuwu@amd.com> * Fix typo Signed-off-by: root <tianyuwu@amd.com> * Add CK_Tile WMMA example Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Fix CI Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * fix clang format * Set M/N_Warp Back to Constant Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> * Use GemmConfigComputeV3 by default Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Remove CK_Tile wmma gemm examples from the CI list Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add atomic add fallback method for gfx11 Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Omit copyright year Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Support non-square cases Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Add get_device_ip() Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Add atomic add fallback method for gfx11" This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746. Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12" This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5. * Revise method name and typos Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Try fix CI Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Revert "Try fix CI" This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902. * clang-format Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> * Fix typo caused by merge Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> * Fix typo caused by merging Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> --------- Signed-off-by: root <tianyuwu@amd.com> Signed-off-by: Tianyuan Wu <tianyuwu@amd.com> Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com> Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com> Co-authored-by: joye <joye@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `68134b60e4`]	2025-08-15 16:22:27 -07:00
Thomas Ning	42d775e488	Preshuffle Decode Prefill config fix (#2693 ) * feat(gemm_wp): add two new configs for wp * delete the unnecessary files * fix the config error * update the config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> [ROCm/composable_kernel commit: `5ada85ec04`]	2025-08-15 15:49:07 -07:00
Aviral Goel	d04f80c7ea	feat(gemm_wp): add two new configs for gemm weight preshuffle in gemm_utils.h (#2690 ) * feat(gemm_wp): add two new configs for wp * delete the unnecessary files --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `c06e8b4a66`]	2025-08-15 15:00:25 -07:00
Thrupti Raj Lakshmana Gowda	b21e1f74ee	Variable name correction in Jenkins file (#2686 ) [ROCm/composable_kernel commit: `1c2078066b`]	2025-08-14 13:35:55 -07:00
jefyang1	1809580db3	Add gemm universal f8 f8 bf16 instances on gfx950 (#2662 ) [ROCm/composable_kernel commit: `d7c95dd491`]	2025-08-14 13:25:24 -07:00
Emily Martins	a8947e1be6	[CK_Tile] Refactor Permute and MOE Smoothquant ctests to gtests (#2622 ) * Refactor CK tile permute ctests to gtests * Refactor CK tile MOE smoothquant ctests to gtests * fix typo in comment Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update invalid case in else clause for get_precision_string * Refactor permute gtests to use templated versions of matrix_core_swizzle and permute functions --------- Co-authored-by: root <root@splinter-126-wr-c2.aus.dcgpu> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `10395fc895`]	2025-08-14 11:01:54 -07:00
Emily Martins	70f25296e6	[CK_Tile] Refactor MOE Sorting and Smoothquant ctests to gtests (#2596 ) * refactor moe_sorting ctests to use gtest framework * Refactor ctests for smoothquant to gtests * fix clang format to use version 18 * Print local_eid in MOE sorting gtests * Remove extra space in smoothquant output [ROCm/composable_kernel commit: `70dce4e0c6`]	2025-08-14 10:54:57 -07:00
Yashvardhan Agarwal	64ef69f7de	CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) (#2632 ) * CK_TILE: Implement two-stage split-K GEMM with reduction - Added split-K GEMM with reduction example * comment resolutions [ROCm/composable_kernel commit: `7f14772406`]	2025-08-14 10:18:52 +02:00
Gino Lu	4794dd8372	fix wrong nan producion. (#2640 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `e5623d3825`]	2025-08-14 15:12:31 +08:00
Jimniu	7a59980e9f	Jimniu/tile_example_flatmm_basic fix (#2680 ) * Add stride_b validation * run clang-format [ROCm/composable_kernel commit: `753b6227c5`]	2025-08-13 16:06:08 -07:00
Aviral Goel	00b044e4a0	Minor Improvements in CK TILE memory copy EXAMPLE (#2678 ) * Rename vector to ThreadTile * more notes on tile encoding * remove number<> from tuple of make_tile_window * add script to stress test the copy example [ROCm/composable_kernel commit: `8a698c7445`]	2025-08-13 15:24:16 -07:00
joyeamd	fbefd916f0	[CK_TILE]fix elementwise example in gfx11/12 (#2676 ) * fix elementwise examples * improve the robust * fix ck_tile's elementwise test * update elementwise test [ROCm/composable_kernel commit: `bcc38deff7`]	2025-08-13 15:21:46 -07:00
Enrico Degregori	aa7950b72d	Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2675 ) Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `a6f4029276`]	2025-08-14 00:21:09 +02:00
JH-Leon-KIM-AMD	4bc6c568bd	CSV-driven convolution test pipeline (#2581 ) * Add CSV-driven convolution test pipeline - Add test_grouped_convnd_fwd_dataset_xdl.cpp with CSV reader functionality - Add complete dataset generation toolchain in test_data/ - Add Jenkins integration with RUN_CONV_COMPREHENSIVE_DATASET parameter - Ready for comprehensive convolution testing with scalable datasets * Update convolution test dataset generation pipeline * add 2d, 3d dataset csv files * Remove CSV test dataset files from repository * Update generate_test_dataset.sh * Fix channel division for MIOpen to CK conversion * Remove unnecessary test files * Fix clang-format-18 formatting issues --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `b963478759`]	2025-08-13 16:24:34 +02:00
Haocong WANG	1d13107873	fix for aiter consume (#2677 ) [ROCm/composable_kernel commit: `3142562c22`]	2025-08-13 19:06:22 +08:00
SamiAario-AMD	a2e850acbc	Cleanups (#2631 ) * Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp * Simplify two templated operator calls by having the templated types deduced automatically * Simplify two GemmPipeline calls * Fix GemmPipelineAgBgCrCompV4::GetName * Refactor use of ArgParser in CK tile GEMM examples * Update args in README.md to match the implementation in create_args * Remove some unnecessary include statements * Rename two variables * Factor out common code * Factor out do_verify * Add and use type aliases for memory operation integral constants * In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `28a97865f5`]	2025-08-13 10:12:08 +02:00
Haocong WANG	cfd8e1b303	Re-enable optimization for gfx950 fmha fwd (#2671 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format * refactor blockgemm change, isolate to v2; --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `05a6e92705`]	2025-08-13 14:57:43 +08:00
Cong Ma	7a49dd7d7c	Preshuffle AQ matrix in block scale gemm (#2624 ) * Preshuffle AQ matrix in block scale gemm * turns the output to fp16. Increase the repetition time. --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `452791a3ba`]	2025-08-12 21:32:51 -07:00
Thomas Ning	651512cbc2	Finish the grouped gemm restructure with fp8 data type (#2655 ) * Finish the grouped gemm restructure with data type * restore gemm_utils.hpp * Update example/ck_tile/17_grouped_gemm/run_grouped_gemm_example.inc Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Comment Addressed --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0f42a92fc1`]	2025-08-12 18:23:34 -07:00
Thrupti Raj Lakshmana Gowda	1d1d6717d2	GEMM Multi D for CK Tile Engine (#2660 ) * Readme for GEMM Multi D * GEMM Multi D partial Progress * GEMM Multi D partial Progress! * CK Tile Engine GEMM Multi D : All Python files generated * Partial Progress * Partial Progress * Partial Progress * Partial Progress : Incorrect Result * Partial Progress : Debugging * Partial Progress : Correct Results * Partial Progress - Incorrect Results * Partial Progress - Commenting Passthrough bypass logic * Changing Passthrough to MultiplyMultiply * Correct Results! * Fix and debug the pass through feature * Sample commit * Correct Results : MultiplyMultiply * Code Cleanup * Removing Failed Instances * Working code before Unary element support * Custom Elementwise Function support and working implementation for Mul and Add * Updating README * Working for Passthrough * Review Comments : Minor Fixes * Review Comments : Minor Fixes * Readme Updated * Partial Changes after Rebase * Working Code : Changes after Rebase * Updating Jenkins file * Removing default value changed while testing * Configuration changes in config files * Tile Handler changes in GEMM Multi D Tile Engine * Tile Handler changes in GEMM Multi D Example * Change log for Gemm Multi D in CK Tile Engine * Configuration changes in config files --------- Co-authored-by: ThomasNing <thomasning@amd.com> [ROCm/composable_kernel commit: `3f57ec3d2d`]	2025-08-12 16:05:05 -07:00
Geo Min	0d5324f88d	[TheRock CI] Adding TheRock CI gate check (#2648 ) * Adding initial TheRock CI * Adding composable kernel link * Adding correct repo for rocm-libraries * Adding entire rocm-libraries checkout * Adding correct flag * Adding correct flag for fetch sources * Fixing git health * Removing patch * Removing patching * Removing manual check * PR comments * testing without dist * Removing test branch * PR comments * PR comments * PR comment * Adding test_runs_on [ROCm/composable_kernel commit: `30dafe8281`]	2025-08-12 14:13:01 -07:00
joyeamd	5b318ecee5	[CK_TILE]fix ck_tile's moe_sorting example in gfx11 (#2667 ) * fix ck_tile's moe_sorting example in gfx11 * fix clang format --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `0856b3f4a2`]	2025-08-12 12:33:56 -07:00
Illia Silin	5b5ae5f81d	fix builds with mainline/staging compilers (#2674 ) [ROCm/composable_kernel commit: `bbf41b27f2`]	2025-08-12 10:23:08 -07:00
slippedJim	8a12cdd385	remove bad pipeline codegen (#2673 ) [ROCm/composable_kernel commit: `20288caa2f`]	2025-08-13 00:23:40 +08:00
asleepzzz	9161cb590d	Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 ) This reverts commit 327bf408dd05b4e4bfb7b72f63f8710f35efa9a4. [ROCm/composable_kernel commit: `5b39de4bb6`]	2025-08-12 20:27:10 +08:00
Haocong WANG	8a20b06f54	Optimize fmha fwd decode & prefill for gfx950 (#2641 ) * Fix for fwd/bwd kernel build filter * fix bwd code * save an example for __bf16 type * temp save, waiting for debug * tempsave, fmha_decode * temp save, change all instance to 1wave * fix async copytest bug * Add block_sync_lds_direct_load utility * fix the s_waitcnt_imm calculation * Improve s_waitcnt_imm calculation * fix vmcnt shift * add input validation and bug fix * remove unnecessary output * move test_copy into test * temp save * tempsave * compile pass * tempsave, trload+asyncload done * tempsave. asynccopy+trload sanity checked * remove unnecessary features * fix the lds alignment caused performance regression * enable prefill overload operator(). * remove all lds bankconflict with xor layouts * enable larger tile size; upgrade xor pattern * upgrade prefill pipeline; simple iglp; consistent data produce and consume order * small refactor * Load Q through lds, implement xor; * add vmcnt guard before load ktile * Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA * Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug * add __restrict__ to tr load * merge fa_decode pipeline into fmha_fwd api * remove unnecessary files; rename some files * Remove unnecessary changes * bug fix, clang format; * remove non-necessary change * fix clangformat with 18.1.3 * fix bugs * fix bug * fix bug on non-gfx950 * fix bugs in gemm * fix bug in pki4 * tempsave, update the blocksync functions * change the warp setting for hdim32 fmha fwd * clang format * fix conflict. disable all v-col instance for fmha fwd * Fix the bug * clang format --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `b7322a521a`]	2025-08-12 19:43:14 +08:00
Mateusz Ozga	eab0dc96f6	fix (#2668 ) [ROCm/composable_kernel commit: `c0c2ded566`]	2025-08-12 13:02:10 +02:00
Yi DING	0afd7af89c	[CK_TILE] FMHA BWD Decode Pipeline (#2643 ) * Fix distr * Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr * decode 16x16 o2 [ROCm/composable_kernel commit: `8e1eb0c1ee`]	2025-08-12 17:02:52 +08:00
Cameron Shinn	b48fe39bf5	Fix num_byte calculations to use nhead_k for K & V size (#2653 ) Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300. Before: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s ``` After: ``` ./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1 [bf16\|batch\|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s ``` [ROCm/composable_kernel commit: `352f87e684`]	2025-08-12 13:44:01 +08:00
Yi DING	ef32abc4cc	[CK_TILE] FMHA BWD Optimization For GFX950 (#2628 ) * simplify fmha_bwd_kernel MakeKargs & dq_dram_window * simply duplicate * trload pipeline * Try two-stage * add prefetch * optimize & iglp [ROCm/composable_kernel commit: `4fde1646e5`]	2025-08-12 11:11:55 +08:00
Aviral Goel	4d263f468e	feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582 ) * feat(copy_kernel): add basic copy kernel example with documentation * docs(CHANGELOG): Updated changelog * chore: performed clang format * Update example/ck_tile/39_copy/copy_basic.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * Update example/ck_tile/39_copy/README.md Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> * fix(terminology): follow amd terms * extract elementwise copy to a new kernel * fix(copy_kernel): bug in verification * add comments about vgpr usage * lint and nits * add notes and comments * print hostTensor via stream * print hostTensor via stream --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> [ROCm/composable_kernel commit: `a7badc6ec5`]	2025-08-11 10:54:37 -07:00
Illia Silin	3c0626e2c1	enable aiter test_mha in daily CI (#2659 ) [ROCm/composable_kernel commit: `6bfef63414`]	2025-08-11 09:50:33 -07:00
Yashvardhan Agarwal	78662c8ad5	Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 ) * fix reduce2d - revret the combine_partial_results() chnages - remove auto from function def * clang-format [ROCm/composable_kernel commit: `191c62967b`]	2025-08-11 15:01:33 +02:00
geozhai	ffaa6ba3b3	update CK build instruction step 4 (#2563 ) Co-authored-by: Aviral Goel <aviral.goel@amd.com> [ROCm/composable_kernel commit: `1e1ee758fa`]	2025-08-11 00:26:13 -04:00
Illia Silin	f5b69dbdc2	remove ck_tile transpose and gemm stages from CI (#2646 ) [ROCm/composable_kernel commit: `8613aa1e40`]	2025-08-08 10:48:44 -07:00
Illia Silin	45758022ee	Add daily AITER tests on gfx942. (#2639 ) * add option to select aiter branch, add tests on gfx942 [ROCm/composable_kernel commit: `7ac850ac72`]	2025-08-08 09:30:46 -07:00
Max Podkorytov	c8ff442803	[CK-tile] add more tests for batched transpose testing the rectangular block tile sizes (#2634 ) * add failing tests * swap out and reference * add constraint assert to transpose input distribution * test both pipelines with rectangular block tile * print mismatched indices * add a smaller failing test for old pipeline * print grid and block * fill output before operating on it * swap m/n tile sizes and make one test pass * add device syncs * add one more flipped test case * flip block tile at host arg init * fix tiles for lds pipeline * clang-format * rename tests * roll back error check * remove device syncs * reduce large test case's size [ROCm/composable_kernel commit: `ab26026835`]	2025-08-07 16:51:53 -07:00
Sami Remes	a180ebf3e7	[CK_TILE] Enable persistent kernel and tail handler in tile_engine (#2300 ) * Enable persistent kernel in tile_engine and use tail handler * Fix formatting * Add persistent to default_config.json * Remove extra newlines and add persistent also to user config * Reduce instances from default_config.json * add persistent to benchmark.json and custom_ci_config.json * changed the config file to have few instances --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: ThomasNing <thomasning@amd.com> [ROCm/composable_kernel commit: `3c9400471d`]	2025-08-07 16:03:49 -07:00
Gino Lu	f21e432ea1	Add e8m0 scaled convert into CK_TILE (#2617 ) * first commit * remove redundent code * modify according to comments. * fix type_convert error with scaled_type_convert [ROCm/composable_kernel commit: `5d6d236b25`]	2025-08-07 21:37:28 +08:00
Yi DING	84a17e2f94	[CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550 ) * Remove unnecessary pssk * Add BlockFmhaBwdDQDKDVPipeline wrapper * Resolve copilot comments & Remove kpad & fix * Remove spad [ROCm/composable_kernel commit: `b0a97498b0`]	2025-08-07 21:24:43 +08:00
Sami Remes	6474014ee7	[CK_TILE] Enable printing more structures in CK-Tile (#2443 ) * Add more printing to core cktile * Revert other changes in static encoding pattern * Refactor to using a free print() function * Remove loops and print just the containers * Print tuple with better formatting, fix sequence compilation * Add some tests for print utility * Add print utility header * Print for static_encoding_pattern * add buffer_view printing * Align vector_traits * Fix formatting * Lower-case enum strings Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> * Remove empty comment lines * Fix test with lower-case too * Reduce repeated code in print tests, move helper function closer to type definition, test X&Y * Add test_print_common.hpp * add print.hpp in core.hpp --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `ffdee5e774`]	2025-08-07 15:45:27 +03:00
Enrico Degregori	9e5f065191	Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610 )" (#2637 ) This reverts commit 5c58e5593464484b0573ec1b77aa06d36be04d54. Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `21e9983913`]	2025-08-07 12:30:08 +02:00
Bartłomiej Kocot	7c838b7fbe	Fix clang format after conv changes (#2636 ) [ROCm/composable_kernel commit: `54c7e08a2f`]	2025-08-07 10:00:09 +02:00
Bartłomiej Kocot	b12eb7db4f	Grouped Convolution Forward Infer Bias Bnorm Activ (#2621 ) * Grouped Convolution Forward Infer Bias Bnorm Activ * 3d [ROCm/composable_kernel commit: `5328b232b2`]	2025-08-07 08:36:47 +02:00
Max Podkorytov	40a8c5c855	modernize scripts for running cmake and clang-format (#2503 ) Co-authored-by: Aviral Goel <aviral.goel@amd.com> [ROCm/composable_kernel commit: `1824d65758`]	2025-08-06 10:15:44 -07:00

1 2 3 4 5 ...

2213 Commits