Commit Graph

1065 Commits

Author SHA1 Message Date
Qianfeng Zhang
d7930cd541 Update IsFulleTileInsideMask() for kUseLocal is true situtation 2025-06-06 15:55:26 +00:00
Qianfeng Zhang
9e6a24010a Move all test and bench scripts to folder scripts 2025-06-06 08:22:38 +00:00
Qianfeng Zhang
2bb59dff0a Add two scripts 2025-06-06 08:15:25 +00:00
Andriy Roshchenko
00247e3c29 Optimized GEMMs for MX FP4/8 (#2294)
Adds V3 GEMM pipeline for MX FP4 and MX FP8 
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275)
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler




Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
2025-06-05 13:54:15 -06:00
Illia Silin
233e274077 Revert "[CK_TILE] Tile loop persistent gemm kernel (#2191)" (#2293)
This reverts commit ffb52783d0.
2025-06-05 09:24:00 -07:00
Sami Remes
7ea1508b59 [CK_TILE] Move GEMM pipeline tail handling logic to pipelines (#2222)
* Add TailHandler for V3, V4 and Mem pipelines

* Adapt examples and tests to use TailHandler

* move tail-handling logic to pipeline in persistent grouped gemm

* Fix Mem pipeline dispatching, add CompV4 dispatching

* Use a macro for handling the many tails of Mem pipeline

* Fix formatting again

* Use const-ref RunFunction, remove unnecessary try_run
2025-06-04 11:50:21 +03:00
Sami Remes
ffb52783d0 [CK_TILE] Tile loop persistent gemm kernel (#2191)
* Implement tile loop persistent gemm kernel

* Enable timing

* Add tests for persistent gemm

* Fix formatting

* Fix gemm_basic

* Rename True/False to Persistent/NonPersistent

* Use only one set of layouts for persistent tests

* Fix gemm example persistent template parameter

* Fix formatting
2025-06-04 11:46:28 +03:00
Anton Gorenko
52b4860a30 WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230)
* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Updates to support mixed precision

* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip

* Added support for F8xF16xF16 to gemm_wmma_universal

* Added support for F16xF8xF16 to gemm_wmma_universal

* Added support for BF16xI4xBF16 to gemm_wmma_universal

* Added support for F16xI4xF16 to gemm_wmma_universal

* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType

* Added missing test class for FP16_KM_NK

* Pre-commit hooks fixes

* Added padding instances for f16xf16xf16

* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Ammending changes for adding support for padding instances for f16xf16xf16

* Fixes for padding instances for f16xf16xf16

* Added padding instances for bf16xbf16, f8xf8

* Added packed instances for bf16xi4xbf16

* Added padding instances for f8xf16xf16

* Added padding instances for f16xf8xf16, f16xi4xf16

* Fixed typos for bf16xbf16xbf16 padding instances

* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Fix int8 DTYPES check for gemm_bilinear

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
2025-06-04 12:22:33 +06:00
Khushbu Agarwal
59a85cb4bc [CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes (#2262)
* debugging issue

* debugging issue

* debugging

* debugging

* reverting debugging code

* clang formatted

* updating default_config.json

* fix ci failure

* clang formatted
2025-06-03 20:16:10 -07:00
Aviral Goel
11f6c14e03 Add 0 as an acceptable arguement for strides in CK GEMM example (Issue 2037) (#2268)
* add 0 as valid default arguement for strides

* add 0 as valid default arguement for strides

# Conflicts:
#	example/01_gemm/common.hpp
2025-06-03 07:26:58 -07:00
Illia Silin
b76fdbe47f Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274)
* upgrade to rocm6.4.1 and use gfx1x-generic targets

* add rocm version parsing

* fix the gfx10-3-generic syntax in cmake
2025-06-03 07:17:35 -07:00
Khushbu Agarwal
2e38eb4f1c Rotating buffer PR CI fix (#2257)
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)"

This reverts commit bbdaf79a52.

* fix regression
2025-06-02 10:25:01 -07:00
Qianfeng Zhang
9582ae2dff Move dividing by max_seqlen to end of Gemm1 loop in the reference hstu-attention codes 2025-05-31 06:22:54 +00:00
Qianfeng Zhang
bec35abafe Tune the settings for hdim-256 2025-05-30 08:51:49 +00:00
Qianfeng Zhang
832747c58d Add example parameter alpha to ease the testing 2025-05-30 08:51:24 +00:00
slippedJim
57f497452a remove restriction of group mode hd192 no lse (#2252)
Co-authored-by: Jim <jimguo12@amd.com>
2025-05-30 10:14:21 +08:00
Po Yen Chen
28cd0dffc9 [CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251)
* Add constraint on traits/tile/pipeline

* Use kM0=128 if max_seqlen_q == 8192

* Re-format codegen script

* Remove redundant attr name postix

* Fix import error: default field in dataclass

* Use kK0=64 & kK1=64 to hide latency

* Use CU utilization to decide tile size
2025-05-29 18:36:33 +09:00
Qianfeng Zhang
781cba355a Convert P to fp16/bf16 before doing second gemm in reference hstu implementation 2025-05-29 01:07:04 +00:00
Qianfeng Zhang
36a0f2020c not-critical updates in example and block_masking codes 2025-05-29 01:06:32 +00:00
Qianfeng Zhang
68a5ab8ff8 Add init_qkv and dump_output example parameters for easier debugging 2025-05-29 01:05:49 +00:00
Illia Silin
bbdaf79a52 Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)
This reverts commit 99857e10e6.
2025-05-28 09:46:52 -06:00
Khushbu Agarwal
99857e10e6 [CK_tile] Add rotating buffer feature for universal gemm (#2200)
* Add rotating buffer feature for universal gemm

* adding changes in tile_engine

* Updated code to merge kernel_launch

* removing comments

* Enable rotating buffer changes to flatmm

* Created diff launch_kernel function for rotating buffer

* Simplfied calculation using macros

* merge code with new changes in tile_engine

* clang formatted

* Redefine macros
2025-05-27 23:00:58 -07:00
Aviral Goel
c52649ad57 Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) (#2234)
* added catch statements to examples

* clang format
2025-05-27 22:32:42 -07:00
Qianfeng Zhang
10c35125d2 Add example parameter max_seqlen and max_target 2025-05-27 14:43:43 +00:00
Qianfeng Zhang
c9e19351c7 Update to the method for calculating max_seqlen in the example 2025-05-27 10:38:52 +00:00
Qianfeng Zhang
dc0977faad Use NRepetitions2DEpilogue for outputing o_acc tile 2025-05-26 14:43:30 +00:00
Zzz9990
ece38b9d7a [VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221)
* fix splitkv compiler issue since lse is used to select kernel instances

* bypass seqlen == 1

* add chunked prefill into mha varlen

This reverts commit aa9847e42d.

* skip compile when receipt 2-4 and add comments

* fix

---------

Co-authored-by: fsx950223 <fsx950223@outlook.com>
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1 fix the buffer intrinsic names for clang >=20 (#2228) 2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6 Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 58f9e9ffbc.
2025-05-22 15:41:17 -07:00
Qianfeng Zhang
81f7b139e0 Use LDS to in-directly load Q-tile to enable dwordx4 loading and avoid cachelines wasting 2025-05-22 09:49:11 +00:00
SamiAario-AMD
380bca2b85 Fix 11_add_rmsnorm2d_rdquant (#2207) 2025-05-20 15:15:28 -07:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00
Sami Remes
d1e6f0982d [CK_TILE] Grouped GEMM tile loop (#2146)
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm

* Some helper functions for persistent kernel case

* Get max occupancy grid using device properties

* Implement tile loop in main entry point to grouped gemm

* Enable GridSize() on device

* Handle offset tile index using real current block index

* Add persistent kernel choice to grouped gemm example

* Use a for-loop for iterating over the group

* Reduce VGPR spills by early-exit

* Enable persistent kernel choice in grouped_gemm example

* Add persistent kernel option to grouped_gemm test

* Fix formatting with remod.py

* Remove GridUpdateBlocks as blocks are now iteratively computed

* Add comment about VGPR spilling

* Fix formatting

* Use CK_TILE_HOST instead of __host__

* Enable all Row/Col combinations in grouped gemm unit test

* Add some KBatch=2 cases to grouped gemm tests

* Fix SplitK for grouped gemm

* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm

* Add type traits

* Split examples to regular and tileloop

* Formatting

* Use hipExtStreamGetCUMask to get current active CUs for the given stream

* Align test and example kernel config, and disable validation for splitk repeats

* Remove debug options from CMakeLists.txt

* Separate the code paths for persistent/non-persistent in test

* Fix formatting

* Address review comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-20 17:18:57 +03:00
Qianfeng Zhang
a1346aaf3e Update the reference hstu to not do fp32 to fp16/bf16 conversion before P@V gemm 2025-05-20 07:53:25 +00:00
Qianfeng Zhang
0a8ea6bd02 Adjust the threshold values for fp16/bf16 in the example 2025-05-20 07:52:37 +00:00
Qianfeng Zhang
29cf1610f1 Enable RTN fp32 to bf16 conversion by adding compiler option in CMakeLists.txt 2025-05-20 07:52:01 +00:00
Aviral Goel
c4929225f6 remove debug statements from CMakeLists (#2204) 2025-05-19 17:31:04 -07:00
jefyang1
f18170064d Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-19 17:29:51 -07:00
jefyang1
b8b12bb81e Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203)
* Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950

* Run clang format
2025-05-19 14:25:50 -07:00
Qianfeng Zhang
fac03abbda Change do-while main-loop to while-do and remove early exiting check 2025-05-19 15:39:31 +00:00
Qianfeng Zhang
14ab6f154d Adjust the codes before the main-loop 2025-05-19 13:59:57 +00:00
Qianfeng Zhang
f411d676f2 Move k_tile loading and v_tile loading earlier in the loop 2025-05-19 13:59:24 +00:00
Qianfeng Zhang
902b1c645c Move k_tile loading in the loop earlier 2025-05-19 13:58:43 +00:00
Qianfeng Zhang
f582c21418 Replace s_acc and pcomp tile array by single tile object for simplification 2025-05-19 13:57:11 +00:00
Qianfeng Zhang
4e65469fe8 Add _builtin_amdgcn_sched_barrier(0) for instructing the compiler for better codes isolation 2025-05-18 16:20:47 +00:00
Qianfeng Zhang
e4e70f8b0a Set the block_per_cu to 3 for hdim-128 2025-05-18 15:58:57 +00:00
Qianfeng Zhang
ff3415d97d Prefetch b_warp_tensor for next nIter and move b_warp_windows construction into n-iteration in block_gemm_areg_bsmem_creg for gemm-1 2025-05-18 15:26:31 +00:00
Qianfeng Zhang
694295a9d3 Move b_warp_windows construction into k-iteration in block_gemm_areg_bsmem_creg for gemm-0 2025-05-18 15:25:43 +00:00
Qianfeng Zhang
afd7793e92 Prefetch K for next iteration from LDS in block_gemm_areg_bsmem_creg for gemm-0 2025-05-18 15:25:05 +00:00
Qianfeng Zhang
7c0ac51b4b Hack block_gemm_areg_bsmem_creg_v2 for gemm_1 2025-05-18 15:24:17 +00:00