composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
Yi DING	8e3d84aba3	[CK_TILE] ABQuant New Preshuffle (#3638 ) * Refactor * Gemm quant improvement * Change preshuffle * Fix * Fix grouped gemm ut * Fix --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-01-27 23:46:49 -08:00
Bartłomiej Kocot	3d67e6c492	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt	2026-01-27 11:04:11 +02:00
ltqin	67f0b74ec6	Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 )" (#3635 ) This reverts commit `de5a1d730d`. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-01-23 09:03:22 -08:00
Po Yen Chen	de5a1d730d	Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 ) This reverts commit `dd0b4294af`.	2026-01-22 21:21:19 -08:00
ltqin	dd0b4294af	Fp8 block scale quantization for fmha fwd (#3330 ) * add block scale parameters to kernel * add block scale to kernel * add smoke test * format * Revert "format" This reverts commit `356c3c9706`. * only format my code * format py * fix auto not allowd in function prototype * change instance tttt to ttff * fix structured binding issue * change s_acc elementwise op * async pipeline add block scale * add quantation P using shift exp2 * precompute (m - shift) once per row * change blk scale seqstrt ptr name * fix some name * fix for deduction guide * fix some comments * add P scale to qr_ksvs_pipeline * add comment to idx_identity * change the method of calculating descale block index * unify naming style: use block_scale_ as name prefix * unify naming style * update the CHANGELOG.md * Add FP8 block scale quantization support for FMHA forward kernel --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-01-21 20:58:26 -08:00
Cong Ma	0517d43d31	[CK TILE] remove dependency on std chrono (#3599 ) * [CK TILE] remove dependency on std chrono * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-19 15:31:02 -08:00
Thomas Ning	00c46785a8	Shuffle fix for gfx950 (#3491 ) * solve compiler issue * solve the gfx950 mfma shuffle regression * refactor jenkinsfile to handle arch name better * [CK TILE] set divisor to count of thread along k dimension * fix the compiler error * solve degradation * Finish the multiplies fix * fix the scales * solve compilation error * solve the composes * solve the error of tile sweeper * fix the test and example * fix for gfx950 --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Cong Ma <congma13@amd.com>	2026-01-13 09:21:29 -08:00
damien-lejeune	4216d43da8	Dlejeune/ck tile 2d multiple reductions (#3147 ) * WIP * Add Unit tests for the Multi Reduction Kernel * clang format * Rename multiblock to threadwise * Multiblock WIP * Fix multi reduce multi block unit tests * Multi Reduce Tile Engine: WIP * refactoring + try addressing precision error * Fix multiops examples * Cleanup * Clean up tile engine's reduce op * Update changelog * Fix remod/clang * Fix dates * Fix documentation & missing file * Fix comments * Use the update_tile api in the multi-block kernel * Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests * Add TileParitioner * Cleanup * Add warning when no data to process, in the example * Refactoring Reduce kernel Tile Partioner + cleanup * Move the tile partioner to its own file * Add missing includes * Fix copyright header with update_amd_copyright_headers.py * Fix change of interface in Reduce2dProblem --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-01-09 11:16:37 +01:00
John Afaganis	ec23be0b9d	Update unsigned long literals and format specifiers to work correctly in Windows (#3483 ) Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows. The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit. This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.	2026-01-02 22:16:41 -07:00
yadaish	dae85ead64	[CK_TILE] support split-k a16w4 gemm1 (#3389 ) * initial version to support moe gemm1 split-k * add missing args * fix build warning * update reference * for split-k disable bias and weight * remove debug log * fix format * fix div by zero errors * fix cmake config * update * resolve conflicts * remove useless changes * reformat * fix * remove useless changes * fix ci --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu>	2025-12-29 23:05:35 +08:00
kensclin	7f68f3c4fa	Enable padding blockscale for abquant (#3453 ) * Enable padding blockscale for abquant * run clang-format * Reduce unnecessary testing * remove cout	2025-12-24 09:12:40 -08:00
kensclin	0500fcc017	Support A/B Quantization in Blockscale GEMM (#3343 ) * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Support A/B Quantization in Blockscale GEMM * Implement review suggested changes * Implement review suggested changes * Sync with develop * fix pre-commit error * Add unit tests for blockscale AB-Quantization * fix pre-commit error * fix pre-commit error * fix compile error * fix compile error * fix clang-format * fix clang-format * fix enumeration values not handled in switch * rebase file * Add missing enums to data_type_sizeof (#3430) Fixes broken build on gfx942. This was some test code that got merged at the same time. * [CK_BUILDER] CK Tile header installation for builder, algorithm concept improvements (#3419) * Added install of CK_Tile headers when using CK_EXPERIMENTAL_BUILDER. MIOpen needs this since the builder uses features from CK Tile and the CK Tile install is excluded when doing a narrow build for MIOpen * Changed algorithm concept type checks to be concepts instead of constexpr bool functions. This improves compiler error messages when using these concepts in static_asserts --------- Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com> * Add build trace diagnostics to CI. (#3432) * generate and visualize build traces for all archs * generate build traces in all cases * fix jenkins logic * fix typo * use more threads for parsing dependency map * add script to parse ninja traces and issue warnings * fix python script syntax and header * fix python syntax one more time * fix python syntax * Support A/B Quantization in Blockscale GEMM * Implement review suggested changes * Sync with develop * Add unit tests for blockscale AB-Quantization * fix enumeration values not handled in switch * rebase file * rebase file --------- Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: DarylHawkinsAMD <Daryl.Hawkins@amd.com> Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-17 07:13:47 -08:00
linqunAMD	6d7299ff78	[ck_tile] remove duplicate functions in ck_tile (#3311 ) * [ck_tile] remove duplicated shuffle_b and shuffle_b_permuteN * [ck_tile] move get_k_warp to gemm_shape * resolve code rebase error	2025-12-15 07:13:00 -08:00
Linjun-AMD	f5573f56d9	Add attention sink support for FMHA FWD (#3368 ) * Revert "Revert "Add attn sink (#2892)" (#3250)" This reverts commit `5adaa201ed`. * fix conflict Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Add F_sink parameter to FmhaFwdPipeline * Update tile_fmha_traits.hpp * Refactor pipeline creation in fmha_fwd.py Updated the pipeline creation logic to include 'sink' parameter in product combinations and adjusted the FmhaFwdPipeline calls accordingly. * Update fmha_fwd.py * Update fmha_fwd.py * Update example/ck_tile/01_fmha/script/correct_test_fwd_sink.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update CHANGELOG.md Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update CHANGELOG with new features and support * Update fmha_fwd.hpp * Update CHANGELOG.md * Update smoke_test_fwd_sink.sh * Update correct_test_fwd_sink.sh * Update smoke_test_fwd_sink.sh --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-12-15 12:21:59 +08:00
Enrico Degregori	21f06aa47d	CK Tile: Enable padding blockscale example (#3417 ) * Fix host code padding * restructure the ref code * clean up * Fix compilation error --------- Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-12-14 10:25:47 -08:00
eliotwang	715671e419	Bf16fp4 gemm (#2801 ) support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * support bf16mxfp4 gemm rebase bf16fp4 example to develop branch Clean up commented debug code in GEMM kernel * rename example folder * rebase to new develop * fix clang format * update code according to reviewer's comment * Update README.md * update code according to reviewer's comment * update code according to reviewer's comment * Update CMakeLists.txt * Update README.md * Update CMakeLists.txt * Delete files * Delete files * Add unit tests * Update test_gemm_quant_base.hpp * merge bf16fp4 example to develop branch fix clang format * fix clang format * Update CMakeLists.txt * fix ci test * fix clang format * resolve conflicts --------- Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu> Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-11 07:20:29 -08:00
Ville Pietilä	fc22320d78	[CK_TILE] Split-K autodeduction (#3351 ) * First version of split-K autodeduction. * Fix circular dependency and kernel construction. * Fix tolerance calculation for bwd weight example. * Simplify kernel construction. * Fix kernel launching bug for split-K autodeduce. * Add split-K autodeduction support for the two stage example. * Fix a corner case. * Fix clang-format. * Fix clang-format for inc files. * Add missing header. * Prevent too large split-K values. * Fix formatting. * Add unit tests for IsSupportedArgument in grouped bwd conv. * clang-format. * Fix merge conflicts. * Address feedback from code review. * clang-format * Fix new tests after merge. --------- Co-authored-by: Ville Pietilä <>	2025-12-10 09:30:30 +02:00
Yi DING	c1c2e41a03	[CK_TILE] Generate random tensor values with multiple threads (#3324 )	2025-12-09 11:02:33 +08:00
Khushbu Agarwal	6b1bceca7b	[CK_Tile] Enable PreshuffleB for 2d block scale Gemm (#3298 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * fixing bq tensor calculation --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-12-05 09:57:52 -08:00
msaffari-amd	2d3020e5b0	[CK Tile] batched contraction kernel generalizing (#3126 ) * Add help for example * Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings * Add stride-aware reference for batched contraction with independent D tensor layouts * Add -num_d argument for runtime D tensor count selection in batched contraction * Add stride vector arguments in example code for testing non-contiguous batched contraction inputs * Add descriptor-based architecture for batched contraction multi-dimensional stride support * Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0 * Add complete multi-dimensional stride support via descriptors * Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm * Clean up batched contraction: remove old UniversalGemmKernel path * Clean up batched contraction: remove legacy paths and finalize docs * Optimize batched contraction example: pass dimension sizes not vectors * correct the reference calculation, unsigned int to int * Fix batched_contraction C++17 build errors for gfx90a CI	2025-12-02 13:30:27 +01:00
Gino Lu	ba6af9fe7c	[CK_TILE] Add unit test for fp4 warp gemm (#2817 ) This update includes a unit test for warp GEMM	2025-12-01 13:56:48 +08:00
Matthias Gehre	678298d4c7	Add support for gfx1153 (#3306 )	2025-11-27 08:48:00 +01:00
Aviral Goel	de6466481f	chore(copyright): update copyright header for include directory (#3293 )	2025-11-26 11:00:05 -07:00
Khushbu Agarwal	8111572785	[CK_Tile] Support for preshuffle weight(B) quant tensor for block scale gemm (#3165 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * addressing review comments * fixing CI issue * addressing reveiw comments * formatting * formatting * fixing aquant operator overlaoding * formatting --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-11-24 07:48:42 -08:00
rocking	5948dbffe4	Support fp8 dynamic quantization for fmha (#3206 ) * Support qscale for dynamic quant, remove static quant * Support hdim=256 * Remove bias test case for fp8 --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-11-24 16:28:25 +08:00
asleepzzz	5adaa201ed	Revert "Add attn sink (#2892 )" (#3250 ) This reverts commit `9fa4e8d5ab`.	2025-11-20 07:55:15 -08:00
Linjun-AMD	9fa4e8d5ab	Add attn sink (#2892 ) * enable attn sink Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update attn_sink script Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix some error Signed-off-by: JL-underdog <Jun.Lin@amd.com> * clang-format Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd_kernel'mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update block_fmha_pipeline_qr_ks_vs.hpp Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix ci error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix format error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_bwd_pipeline_default_policy.hpp * Update fmha_fwd_runner.hpp * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * update splitkv_pipline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update splitkv&pagedkv pipeline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * add sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update attn_sink result log Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update smoke_test_fwd_sink.sh Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test file Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test script Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp * use constexpr kHasSink for sink in fmha pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update by pre-commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fmha_fwd.py * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove causal mask setting logic from mask.hpp Removed the mask setting logic for causal masks. * fix ci error that some usage of lamada not support in c++17 Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update remod.py * add smoke sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update fmha_pagedkv_prefill.py * Update FmhaFwdPipeline parameters in fmha_fwd.py * update block_fmha_pipeline_qr_ks_vs_async_trload.hpp Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix c++17 unsupprot error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp * Fix formatting of sink_seq_end assignment * Fix indentation for sink_seq_end assignment * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-20 19:24:05 +08:00
Yi DING	47e2ed838e	[CK_TILE] Add Flatmm MX FP8 (#3208 ) * Use async for flatmm mxfp4 * Fix preshuffle * Add flatmm mxfp8 * Thanks, Copilot * Thanks Copilot again~	2025-11-20 10:35:15 +08:00
linqunAMD	1b1c46e508	[CK_TILE] Fix gemm_quant (#3186 )	2025-11-11 08:23:57 -08:00
Cong Ma	5abe4109e0	Introduces the new partitioner to implement the reduction StreamK kernel. (#3107 ) * Introduces the new partitioner to implement the reduction StreamK kernel * Add more doc text to functions * Add persistent-dp option to streamk example * Update example/ck_tile/40_streamk_gemm/README.md	2025-11-04 10:32:17 -07:00
Sami Remes	16e85cf179	[CK_TILE] B matrix 2D block scale gemm (#3074 ) * Refactor quant group size to be configurable for M/N/K, not just K * add some asserts for configurations not implemented * start setting of group size for N dimension * enable 2d for reference quant gemm * WIP: trying to figure out tile dstr and/or indexing for scale matrix * WIP * Fix handling of n dim blocks in tile windows etc * remove commented code and enable all tests again * fix formatting * Add more specialized tile distributions * Enable NWarps replication for bquant tile dstr * fix formatting * fix format * Fix some issues from the merge * fix formatting * one more fix to tile dstr, and revert debug initialization * Remove commented code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * simplify conditions that are needed for tile distributions * only enable the working group sizes in tests * fix formatting * Update tile distribution for 2D bquant * add some documentation and 2d block scale example * fix formatting * Add in Changlog and restructure the quant 2d example * fix CMake * support the change for blockscale 2d * fix the test file --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-11-02 16:49:20 -08:00
Enrico Degregori	4ebc48a3cd	WMMA gemm_add_relu_add_layernorm (#2989 ) * Summary: - Refactor epilogue (with CShuffle) to support fused operations: - EpilogueCShuffleBase holds common parts - EpilogueCShuffle: runs CShuffle and write out - EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out - Extend thread transfer v7r3: - Support for intermediate data type different from src and dst type - New functionality to write to dst buffer and keep data (to be able to use them for additional operations) * Adress review comments	2025-10-31 11:19:26 -07:00
Yi DING	e135dd518d	[CK_TILE] Add mxfp4 flatmm (#3080 ) * Squashed commit of the following: commit 3e1a851dad834776efbe4fe365ac82c4ed312010 Author: Ding, Yi <yi.ding@amd.com> Date: Thu Oct 23 06:10:54 2025 +0000 Fix & clean after rebase commit 1edf485092f44411da9a1796a4a6b72d5cdb67c6 Author: Ding, Yi <yi.ding@amd.com> Date: Wed Oct 22 10:46:13 2025 +0000 Squashed commit of the following: commit `0b6b9dbd1b` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 22 02:04:27 2025 -0500 fix bandwidth calculation commit `9aebf53bb7` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 22 00:58:59 2025 -0500 updates commit `62607de56c` Author: mtgu0705 <mtgu@amd.com> Date: Fri Sep 19 00:39:46 2025 -0500 fix a bug, set the A DS_read preload size to 4 for MXFP4 commit `92ad6fcc0a` Author: mtgu0705 <mtgu@amd.com> Date: Thu Sep 18 01:19:03 2025 -0500 fix a_wrap preload issue for large MPerBlock. commit `f2db44710f` Author: mtgu0705 <mtgu@amd.com> Date: Wed Sep 17 21:34:03 2025 -0500 optimized the VGPR repack issue for MXFP4 commit `346a400027` Author: Gino Lu <gino.lu@amd.com> Date: Wed Sep 17 04:19:44 2025 -0500 fix time error commit `80c1743034` Author: mtgu0705 <mtgu@amd.com> Date: Wed Sep 17 03:58:00 2025 -0500 updated, function passed. commit `ce26d9071e` Author: mtgu0705 <mtgu@amd.com> Date: Tue Sep 16 22:21:39 2025 -0500 fix, function partially passed commit `0a89ed13a5` Author: mtgu0705 <mtgu@amd.com> Date: Tue Sep 16 03:01:12 2025 -0500 fix, reference function passed, next check kernel function commit `ec9bcef591` Author: Gino Lu <gino.lu@amd.com> Date: Tue Sep 16 02:29:01 2025 -0500 let pack/unpack return pk_fp4_t commit `a333206929` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 15 20:50:26 2025 -0500 fix commit `3893c06540` Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 15 05:51:06 2025 -0500 fix bug commit `8052bea019` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 15 04:02:05 2025 -0500 fix core dump issue, function is not correct. commit `9ceb3fd508` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 15 03:03:02 2025 -0500 updates, build pass commit `cc94eb6045` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 15 00:05:18 2025 -0500 updates commit `22586c3135` Author: Gino Lu <gino.lu@amd.com> Date: Sun Sep 14 23:40:28 2025 -0500 fix bug commit `e92e67b8dd` Author: Gino Lu <gino.lu@amd.com> Date: Fri Sep 12 03:28:50 2025 -0500 fix interface commit `8b1dd60c08` Author: Gino Lu <gino.lu@amd.com> Date: Fri Sep 12 02:53:50 2025 -0500 add interface in warp_gemm_impl commit `c6135f6abe` Author: mtgu0705 <mtgu@amd.com> Date: Wed Sep 10 05:03:08 2025 -0500 updates some fixes. commit `b0d71b8d19` Author: mtgu0705 <mtgu@amd.com> Date: Tue Sep 9 04:37:42 2025 -0500 fix after merge ginolu/add_wgmfma_dispatcher commit `f119c30317` Merge: `c5030e602` `72c8ef856` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 8 22:09:15 2025 -0500 Merge remote-tracking branch 'origin/ginolu/add_wgmfma_dispatcher' into mtgu/cktile_mxfp4_flatmm_dev commit `c5030e602e` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 8 21:42:47 2025 -0500 update mx flatmm tail pipeline commit `72c8ef8567` Merge: `9661bb400` `e4a772890` Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 8 19:10:23 2025 -0500 Merge branch 'develop' into ginolu/add_wgmfma_dispatcher commit `9661bb400b` Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 8 19:09:55 2025 -0500 fix type error commit `0509597f55` Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 8 04:01:40 2025 -0500 update hotloop pipeline commit `754ae0461b` Merge: `15d44406e` `83f607e2a` Author: Gino Lu <gino.lu@amd.com> Date: Fri Sep 5 04:22:26 2025 -0500 Merge branch 'develop' into ginolu/add_wgmfma_dispatcher commit `15d44406e5` Author: Gino Lu <gino.lu@amd.com> Date: Fri Sep 5 04:21:26 2025 -0500 fix clang format commit `146963d62a` Author: mtgu0705 <mtgu@amd.com> Date: Wed Sep 3 10:00:54 2025 -0500 some updates commit `12526b626a` Merge: `47cee0471` `00fd72b2d` Author: asleepzzz <hanwen.chang@amd.com> Date: Wed Sep 3 13:22:03 2025 +0800 Merge branch 'develop' into ginolu/add_wgmfma_dispatcher commit `47cee04712` Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 1 02:11:02 2025 -0500 fix vec size error commit `d2892925e5` Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 1 01:23:39 2025 -0500 fix format error commit `16993acd1d` Author: mtgu0705 <mtgu@amd.com> Date: Sat Aug 30 03:19:07 2025 -0500 update codes commit `9c37e55d13` Author: mtgu0705 <mtgu@amd.com> Date: Fri Aug 29 11:27:33 2025 -0500 init ck_tile mxfp4 flatmm commit `5c484a5672` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 28 08:02:50 2025 +0000 Add bias for f16xf4 moe_flatmm commit `dd6539f366` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 27 13:39:47 2025 +0000 update case construction commit `65b702454c` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Tue Aug 26 12:32:29 2025 +0000 support swiglu activaion and use rcpf to accelerate silu commit `b422e41e08` Author: Gino Lu <gino.lu@amd.com> Date: Tue Aug 26 02:33:55 2025 -0500 first commit commit `d05eed931d` Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Date: Fri Aug 22 04:01:59 2025 -0500 add line to last commit `d69cab7f0c` Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Date: Fri Aug 22 03:20:46 2025 -0500 adjust A_LDS descriptor to avoid bankconflict commit `65989e940c` Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Date: Thu Aug 21 09:46:52 2025 -0500 enable hotloop commit `c378e9bdf8` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 21 09:12:21 2025 +0000 support atomic_pk_add_bf16 on gfx950 commit `85976b0b87` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 21 06:58:55 2025 +0000 use int64_t as expert stride to avoid overflow commit `9fbcc8f8a4` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 20 13:53:32 2025 +0000 use v4i32 as the storage type for B to avoid repack operation commit `81899bd920` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 20 06:40:03 2025 +0000 add pk_fp4_t and e8m0_t support for amd_buffer_load_impl commit `c27eb0771a` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 20 04:39:14 2025 +0000 optimize cvt_pkf4_to_f16 implementation commit `3ca0bd500a` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Tue Aug 19 14:56:46 2025 +0000 optimize A_LDS descriptor to avoid bankconflict commit `f7f0306eea` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 18 18:43:37 2025 +0000 fix gate-up when GU_NRepeat > 1 commit `be55c0f9cb` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 18 17:28:11 2025 +0000 add fp16xf4 moe commit `599e1f5b32` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Sun Aug 17 17:51:18 2025 +0000 rename example commit `7899fb4a8d` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Aug 15 06:20:46 2025 +0000 remove additional check when e8m0->float commit `714b341797` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 14 09:34:12 2025 +0000 eliminate repeat dequant commit `53e8c0c533` Merge: `5de620895` `cc9c7b9e5` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 13 16:51:49 2025 +0000 Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm commit `5de6208952` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 13 16:16:48 2025 +0000 update f16xMXF4 commit `732ebdee8b` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 13 10:48:53 2025 +0000 update scale-preshuffle for MXF4 commit `edb58d0680` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 11:24:34 2025 +0000 update commit `cc9c7b9e58` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 08:38:23 2025 +0000 optimize gemm2 atomic_add pattern commit `200a11afc8` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 07:59:47 2025 +0000 update scale for mxfp4 commit `87aed564dc` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 07:56:14 2025 +0000 update case construction commit `8b85fa6cf2` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 06:03:06 2025 +0000 update granularity control commit `1b8c7097b8` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 03:42:46 2025 +0000 fix TileConfig commit `8ba1c708dc` Author: Gino Lu <gino.lu@amd.com> Date: Thu Aug 7 21:37:28 2025 +0800 Add e8m0 scaled convert into CK_TILE (#2617) * first commit * remove redundent code * modify according to comments. * fix type_convert error with scaled_type_convert commit `f788d3d629` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Aug 8 20:19:16 2025 +0000 add mixed_prec fp16xfp4 commit `3dea10a277` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 7 09:22:04 2025 +0000 debug mixed_prec flatmm commit `0ba513b148` Merge: `90e910f3a` `c0cb4d036` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Aug 6 16:49:47 2025 +0800 Merge pull request #2626 from ROCm/felix/flatmm_fix_splitk fix split k commit `6d3cbc7c0e` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 6 08:33:33 2025 +0000 add moe_flatmm commit `c0cb4d036d` Author: coderfeli <coderfeli@163.com> Date: Wed Aug 6 02:45:31 2025 +0000 fix split k commit `90e910f3a7` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 4 07:16:36 2025 +0000 fix flatmm with scaling when WarpTileM == 32 commit `aa5e008fa5` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Aug 1 11:01:23 2025 +0000 optimize scaling epilogue commit `ac5908c0bb` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Aug 1 07:28:38 2025 +0000 fix wrong config for fp8 scaling commit `3f43b841d4` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Jul 30 06:20:30 2025 +0000 prune debug message commit `2e5d4c74cd` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Jul 30 04:52:08 2025 +0000 fix compile error commit `c117a1986a` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Tue Jul 29 15:42:58 2025 +0000 Add persistent option on flatmm for tuning commit `a587701117` Author: AMD-dteng <dteng@amd.com> Date: Tue Jul 29 22:48:00 2025 +0800 update pipeline v1: add atomic IGLP schedule commit `f9e48148d2` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Thu Jul 24 09:09:27 2025 +0000 fix error log throwing commit `1b6d7cf407` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Jul 28 08:24:51 2025 +0000 crz idea commit `5473f06461` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Sun Jul 27 11:57:38 2025 +0000 Add permuteN optimzization when NRepeat % 2 == 0 on flatmm commit `bfb9f4002f` Author: sjfeng <j514681085@icloud.com> Date: Sun Jul 27 17:24:08 2025 +0800 try to remove c_shuffle_lds commit `1264f4d2ab` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Jul 25 07:41:48 2025 +0000 fix loop-dim mismatch and improve c_shuffle alu parallelism commit `1239d8a546` Merge: `406645448` `b908f5e80` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Thu Jul 24 08:46:51 2025 +0000 merge flatmm -scale commit `4066454483` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Thu Jul 24 16:19:58 2025 +0800 revert delete of inc file commit `68390988c9` Author: solin <bingzhou@amd.com> Date: Thu Jul 24 04:38:16 2025 +0000 reorg flatmm code commit `b908f5e803` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Jul 23 19:12:31 2025 +0000 fix flatmm syntax error on gfx950 commit `5a1183ebbd` Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Jul 23 19:04:22 2025 +0000 support flatmm scaling commit `89fa639207` Author: valarLip <340077269@qq.com> Date: Wed Jul 23 08:44:12 2025 +0000 merge flatmm pipe v0 from dteng_flatmm_opt commit `3f7d848dd3` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Jul 23 15:38:12 2025 +0800 build pass commit `6dacf833da` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Jul 23 07:20:26 2025 +0000 fix bug commit `7e1bd4b839` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Jul 23 15:01:53 2025 +0800 sync commit `46a538e39e` Author: valarLip <340077269@qq.com> Date: Tue Jul 22 08:09:35 2025 +0000 adaptive scheduler instead of Macro definition commit `9aa3396a79` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Thu Jul 17 08:40:35 2025 +0000 fix tail handler bug commit `fb76450e63` Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Jul 16 10:12:19 2025 +0000 merge from dteng_flatmm_opt --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: AMD-dteng <dteng@amd.com> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: sjfeng <j514681085@icloud.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: Gino Lu <gino.lu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> * Fix crash on small M * Apply suggestion from @Copilot --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: AMD-dteng <dteng@amd.com> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: sjfeng <j514681085@icloud.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: Gino Lu <gino.lu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-10-31 11:29:05 +08:00
Anton Gorenko	1e77695fe8	[CK_TILE] Support WMMA (gfx12) in FMHA (#2528 ) * Pass hdim to tile_example_fmha_fwd in fp8 tests * Add WMMA support to fwd FMHA pipelines * Tune tile sizes a bit for less spilling fp16 256 is still quite slow * Fix Q grad tile distribution for warp size = 32 and hdim >= 256 With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile. * Use code based on BlockDropout in BlockDropoutBwd * Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal * Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9 because they both equal to 8 while on gfx12 they are 8 and 4; * Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1], it worked on gfx9 because 4 * kM == kN1 == 32; * Removing these hidden dependecies allows to support: * any number of warps (power-of-2), not only 4; * kN1 = 16, not only 32; * any number of splits; * Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now * Replace hard-coded kN1 in dispatch code with the requested tile size * Add gfx12-specific tile sizes for split KV * Pass GPU architecture to kernel generation scripts This is still a temporary solution. * Build and run FMHA CI tests for gfx12 * Fix issue after merging * Fix bwd tile sizes The current pipelines always read only one tile K and V tile, this requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and kK2 == kVHeaddim). * Use hardware f32->f8 on gfx12, remove v_perm __builtin_amdgcn_perm is not needed because __builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of 32-bit dword) is used to store results (two f8 values). * Update changelog * Add WMMA support to pagedkv * Fix scripts after rebasing * Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Fix names after cherry-picking * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Do not use filters related to qr_async_trload They disable tiles/pipelines which are valid for gfx12. * Use different dstr encoding when C is transposed * Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code Some WarpGemmDispatcher instantiations are defined only for specific archs and undefined on host. Calculations related to sched barriers are moved from Pipeline's public fields into pipeline's operator(). * Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used in codegen scripts are 32, 32, 32. * Generalize usages of WarpGemmDispatcher for MFMA and WMMA WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still used explicitly becaus of swizzle factor = 4. * Mark has_load_tr as maybe_unused There are no transpose loading for RDNA. * Remove CK_TILE_USE_MFMA/WMMA from fmha-related code * Detect BlockSize on host based on warp size of the current device If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with kBlockSize / 2 because on host get_warp_size() == 64 always. * Fix calculation of grid size for combine kernel with warp size = 32 * Add missing includes and header * Support multiple archs in one binary for fwd * Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill * Support multiple archs in one binary for bwd * trload kernels are compiled only for gfx950; * instances with padding are checked after instances without padding so they can be used as fallbacks (similarly to fwd); * Extract common code from register_traits * Revert "Fix regression with philox seed and offset when they exceed 32-bit int" To simplify merging , the proper fix is in develop already. * Support new numerical d paddings in trait ordering checks * Build fp32 tests only on gfx9 * Do not use hardcoded M0 = 64 for dot bwd kernel * Use textwrap.indent from standard library * Make fp8 pipelines on gfx12 consistent with gfx9 * Update tests for current pipelines * Make ninja check more responsive in CI ninja buffers output so this job looks hanging. * Support fp8fp32 by limiting O vector size The fp32 output type requires storing 8 * sizeof(float) = 32 bytes, which is not implemented (here 8 is the number of C values per lane for v_wmma_f32_16x16x16...). * Remove unused cmake options * Unify including amd_buffer_addressing.hpp/_builtins.hpp * Temporarily use amd_buffer_addressing.hpp on >=gfx10 amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores which is not compatible with >=gfx10: * 1 scalar for exec masks instead of 2, * gfx12 uses different instruction names etc. * Update asm in bf16 conversions to work with warp 32 * Do not generate splitkv/appendkv with vlayout=col for consistency with fwd * Add arch tags to kernels/host funcs, compile for each arch separately * Add kM0 to fmha_bwd_dot_do_o kernel name to match filename * Add workaround for miscompilation of bwd with padded hdim SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent branches used to store padded tensors (when some lanes are inactive due to padding). Inline asm with dummy dependencies on VGPRs of the tensors prevents the compiler doing this. * Fix add_gtest_executable for absolute paths Some tests (like gemm_tile_engine) pass absolute paths to source files. In CI the branch name is a part of the root dir, and if the branch name contains "wmma", "xdl" etc., files can be incorrectly excluded. * Run only hdim 128 smoke tests for fp8fp32 There are no instances for hdim 64 and 256. * Format py with ruff to simplify merging develop * Fix incorrect var name * Codegen for gfx9,gfx950 when --targets is not specified Aiter and Pytorch require changes for passing their targets to the codegen scripts. With this temporary solution the files are generated but not all of them have to be really built (depending on the used --offload-arch=). * Combine arch-related values into ArchTrait This more centralized approach removes duplication of various formatting templates. * Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large" Some code is extracted into a function.	2025-10-29 13:31:08 -07:00
Ville Pietilä	121bf0e1f3	[CK_Tile] Merge multiple convolution groups into a single GEMM batch (#2986 ) * Fix compilation of the grouped conv examples. * Fix grouped conv bwd weight example output in CK Tile. * Add number of groups to merge to ck tile grouped gemm example. * Initial set of tests for TransformConvBwdWeightToGemm. * Added unit tests for TransformConvBwdWeightToGemm conv groups are merged. * WIP: Tensor transformations. * Add unit tests for coordinate transforms. * Fully working conv group merging for TransformConvBwdWeightToGemm. * WIP: Merged conv groups offset calculation. * Adde unit tests for tensor view. * WIP: Merged conv groups epilogue. * Enable running multiple conv groups per batch. * Add tests for tile_distribution_encoding. * Change example to match optimally depthwise convolution with merged groups. * Add more tests for tensor view. * Integration test for reading diagonal blocks from grouped distributed tensor. * Improved integration test. * Improve test for accessing diagonal blocks. * Added integration test for cshuffle epilogue LDS tile distribution. * Add more logging. * Increase the max number of reported errors. * WIP: merged conv groups GEMM epilogue changes. * LDS to global memory copy. * Fix tile window size for c block. * Integration test for CShuffle epilogue. * Improved CShuffle test. * WIP: Separate epilogue for merged conv groups. * Tile example parameters changes to match depthwise conv. * Offset fixes. * Epilogue fixes. * Working baseline for depthwise covolution with merged conv groups. * Fix build. * Initial unit tests for tensor descriptor. * Add one more unit test for tensor view. * WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding. * Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding. * Add more comments, disable debug code. * Remove debug and other dead code. * Code clean-up for bwd tensor transformations. * Enable running multiple GEMM batches of merged conv groups. * Add compile check for assumed row-mjor layout. * Fix strides in 1D conv to gemm transformation. * WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases. * Fix case k > 1 and c=1. * Remove debug code. * Make MPerGroup and NPerGroup template parameters. * Add additional check for non-supported c > 1 case. * WIP: Put back the generic tensor descriptors for convolutions. * Fix tensor descriptors. * Remove the obsolete template parameters. * Add more instances. * Fix bugs in merged conv groups tensor descriptors. * Fix tensor descriptors for merged conv groups when K > 1. * Remove debug output. * Remove dead code. * Fix merge conflicts. * Code clean-up. * Remove unused code. * Run clang-formatting. * Remove debug prints and obsolete tests. * Check that number of convolution groups is multiple of merged groups. * Fix build after removing obsolete functionality. * Remove obsolete enumeration. * Fix new unit projects. * Remove unnecessary includes. * Fix passing the number of merged groups. * Remove unrelated tests. * Fix IsSupportedArgument for bwd weight conv kernel. * Fix clang formatting. * Fix the bwd weight conv to gemm mapping for num merged groups > 1. * GEMM config for conv group merging. * Fix clang-formatting. * Remove obsolete comment. * Fix typos in comment strings. * Increase the max number of reported errors when testing against reference implementation. * Rename gemm_config to conv_config. * Rename GemmConfig to ConvConfig and move NumGroupsToMerge into ConvConfig. * Change num_groups_to_merge to a boolean flag in the ck tile grouped conv example. * Run clang-format. * Add number of merged groups into kernel name string. * Remove group merging flag from CK Tile grouped conv example.	2025-10-29 16:49:28 +02:00
Yashvardhan Agarwal	3052d7c9e6	[CK_TILE] Add indexing to pooling operator (Lwpck 3892) (#3013 ) * Add indexing support to pooling operator - Add IndexDataType template parameter to pooling problem and kernel definitions - Enable pooling kernel to output indices of selected elements during max/absmax pooling - Add overloaded operators for Max and AbsMax that track when values change using bool changed parameter - Support optional index buffer allocation and management in device memory - Modify BlockReduce2d classes to handle index tensors alongside value tensors - Add separate shared memory allocation for index data in cross-warp reductions - Create validate_pool_indices function to verify index correctness - Modify pool3d.cpp example to demonstrate index output functionality - Add tests for index output * fixes * Refactor BlockReduce2D functions to get rid auxiliary private types. * comment resolutions and some changes to block_reduce2d - index reference implementation improved - reduce_operator.hpp cleanedup - updated the block_reduce2d.hpp to have index calculation for BlockReduce2dLinearCrossWarpSync as well * conditionally used variable declaration improvement - the conditionally used vairbales are used only when indexing is enabled. To inform the compiler that they may be unused and declare them with least size possible. This may allow it to be optimized compared to the previous declarations * comment resolutions * lexical ordering of the indicies - introduced accumulate methods that handle the intermediate steps if needed to order the indexes * add reduce_operator_accumulate.hpp to core.hpp --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2025-10-29 09:58:04 +02:00
Khushbu Agarwal	b11f53a484	Fix quant scale matrix layout for block scale gemm (#3079 ) * Adding support for TiledPermuteN * Adding test * moving shuffle functions to common place * resolving commit hook * fix formatting	2025-10-27 13:56:07 -07:00
Johannes Graner	5c1974065e	[CK_TILE] Add conv fwd + bias + clamp example (#3012 ) * Implement argument passing to element-wise functions for fwd convolution * Add files for fwd + bias + clamp example * Implement Bias * Implement Clamp * Elementwise function composition * Composition unit test * Implement fwd + bias + clamp example * Simplify argument passing and composition * elfunc -> bias_and_clamp * Rename function to specify example * Move element-wise function instantiation to kernel * Make bias a runtime tensor * No ugly namespace aliasing * Initialize element-wise function on host * Remove function initialization helper, simplify Compose initialization * Remove unintended LSP compatibility patch * Clean up includes and unused code * Switch names in cshuffle epilogue * Move CDElementwise to conv traits * Re-add required include * Initialize bias in same way as other tensors * Better type specification for ds pointer * Disable 1D convolution * Add warning for non-group-constant bias	2025-10-27 18:43:09 +01:00
Khushbu Agarwal	0584399571	[CK_TILE] Adding support for TiledPermuteN on preshuffle Block Scale Gemm (#3019 ) * Adding support for TiledPermuteN * Adding test * resolving remod.py --------- Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu>	2025-10-24 11:06:51 -07:00
Max Podkorytov	fdcc1f75c3	limit the rotating count to prevent oom (#3087 )	2025-10-24 08:55:34 -07:00
lalala-sh	211d64e18a	[CK_TILE] Update flatmm related kernels (#3022 ) --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: felix <felix.li@amd.com>	2025-10-22 22:36:11 +08:00
Gino Lu	fb1d090f3c	[CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044 ) * Patch for pk_fp4_raw_t buffer load and ref check	2025-10-20 14:47:04 +08:00
AviralGoelAMD	b03764ca5a	docs: add inline comments about flush_cache and rotating buffer	2025-10-17 12:56:47 -04:00
Johannes Graner	d40b50b9d5	Update pre-commit to fixed versions, run remod for ck_tile (#2895 ) * Fix ruff linter errors * Fix remod dos2unix command * Clang format * Ignore utility in remod * Run remod * Specify clang-format version in pre-commit * Specify ruff version * Include PoolKernelArgs in reference_pool * Add calculate_total_elements to reference batched contraction * Fix calculate_total_elements declaration * Refactor remod pre-commit hook * Fix Aquant tests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-10-16 15:29:17 -07:00
ClementLinCF	e1b0bdfbfa	[CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540 ) * [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm * Update rmsnorm host reference * Update tree reduction of rmsnorm for reference host * Fix cross warp for m > 1 cases * Add RMSNorm model selectable option for host reference * Fix save_unquant cases * Update reference rmsnorm forward function to use enum for model sensitivity * Update reference rmsnorm calculation for model sensitivity * Fix m warp for layernorm * Adjust parameter of reference for twoPass * Fix clang format * Run clang-format-overwrite.sh to fix formating issue * fix clang format --------- Co-authored-by: MHYang <mengyang@amd.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>	2025-10-13 11:52:37 -07:00
msaffari-amd	e9f0cc83a8	[CK Tile] contraction multi d - kernel & example (#2901 ) * Initial commit. create batched_contraction_kernel file * initial problem definition * implement initial example to launch kernel * add universal gemm to contraction. initial phase * complete implementation for special case all Dims are 1 and no Ds * clean code * initial changes to support multi dimensional G * more progress in implementing multiple G * tmp commit * manage dynamic NumDimG in kernel * improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit * implement the example for general Multi dimension G M N K and test different reference calculation algorithms * 2 functions for reference using multi dimensional and flat indexing * clean the code for muti dimentional G, M, N, K contraction and add some logs * Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E * some cleaning on kernel * clean the code for calculating the offsets from flatten batch number * Start adding MultiD support to kernel and example * more changes to manage multi D in kernel and example * manage passing multi d to kernel and testing. * complete multi D support in kernel. modify example code to support it * Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning * Minor fix * Generalize example code for variable NumD tensors and apply cleanup based on review feedback * Refactored code and addressed review feedback * refactoring, cleaning, add documents, in kernel side and example codes * Optimize batch offset calculation in kernel * Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-10-13 12:30:28 +02:00
Yashvardhan Agarwal	7b6451b68e	[CK_TILE] Pooling FWD (Lwpck 3683) (#2956 ) * Pooling 2D/3D with refernce * Tests & cleanup - added test for ppoling - cleanup - removed 2d example * Comment resolution - README added - example target name rectified - appropriate arg description and comments added * clang-format * appropriate blocksize calc * modifications for future indexing addition - instead of transforming views we now transform the descriptors, so that the same descriptor can be re-used for index tensor in the future * some basic fixes * comment resolutions * comment resolutions --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-10-09 16:13:26 +02:00
Anton Gorenko	1edd250115	[CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836 ) * Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Add F32 MFMA warp gemms * Support f32 in fwd FMHA * Implement transpose_vectors for 4-byte types (float) * Fix unexpected implicit f32->uint32 cast in buffer_store<4> __builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint). mbuf_t types in other buffer_store<> are changed for consistency. * Support F32 in bwd FMHA hdim = 256 is disabled for now because it uses too much memory on gfx90a * Support Headdim = 48 (divisible by 16) in fwd * Add fp32-specific receipts (800 and 801) * Tune fwd tiles * Tune bwd tiles * Use small tiles only for small seqlen_q * Fix after rebasing * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Remove constraints and adjust filtering for fp32 Custom constraints are no longer needed because now the smallest tile is selected automtically based on seqlen_q. Filters related to qr_async_trload disabled valid fp32 tiles. * Add fp32 tests * Make splitkv and appendkv compile for fp32 only There are no instances yet, but API still must compile when only fp32 is requested. * Remove unimportant f32 instances * Add test_ck_tile_fmha__fp32 to REGRESSION_TESTS Replace magic numbers with a constant, improve comments for dropout * Update changelog * Fix condition that dq_acc must be set to zero when mask is used The change was introduced in #2799 * Replace warp_uniform with recently added amd_wave_read_first_lane * Add hdim = 96 and 192 to fwd	2025-09-27 18:03:48 +05:00
Sami Remes	4363a82bd6	[CK_TILE] Tensor-wise scaled quant gemm kernel (#2846 ) * rename gemm_group_quant to gemm_quant * Add TensorWise quant mode * Cshuffle epilogue tests with tensor scaling * Add tensor quant to example * Don't use readfirstlane for reading scales - doesn't work for some reason * Add to changelog * revert include - from a merge problem? * revert common.hpp include * revert host.hpp include * remove unused utility function * rename quant pipeline problem * refactor quant tests * remove aquant utils * use TEST_F * fix all tests by changing gemm config * Use typed tests * fix copyright	2025-09-19 16:52:35 -07:00
Mateusz Ozga	30ab1d6a71	[CK_TILE] Multiple-ABD GEMM example (#2788 ) * Multi ABD - initial commit * Clang-foramt fix * block gemm, unify the name of CDataType * Apply chnages to mem-pipeline * Rollback prefix for DType and Layout * Gemm Kernel Basic, rename * WMMA config * Grouped GEMM * Clang-format * Dropout, name * Review v2 * Move element_wise fn to unnary, remov old ones fn * clang-format * Fix issue review * WP operator adjust to universal gemm * v2 prepare * Remove unused comment * Remove vectorsize * Rollback * Adjust pipeline for abd * Shuffle argument * CI-fail fix quant * Fix ag_br pipeline * Failing tests * Typo * Single argument support	2025-09-19 01:14:11 +02:00

1 2 3

126 Commits