Commit Graph

456 Commits

Author SHA1 Message Date
Bartłomiej Kocot
050cad09b5 Grouped Convolution Backward Weight Explicit GEMM (#2282)
* Grouped conv bwd weight explicit gemm

* 3d

* cmake fixes

* fix test

* fix
2025-06-06 10:30:08 +02:00
Andriy Roshchenko
00247e3c29 Optimized GEMMs for MX FP4/8 (#2294)
Adds V3 GEMM pipeline for MX FP4 and MX FP8 
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275)
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler




Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
2025-06-05 13:54:15 -06:00
Anton Gorenko
52b4860a30 WMMA GEMM universal pipeline v1, mixed precision and paddings, examples (#2230)
* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Updates to support mixed precision

* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip

* Added support for F8xF16xF16 to gemm_wmma_universal

* Added support for F16xF8xF16 to gemm_wmma_universal

* Added support for BF16xI4xBF16 to gemm_wmma_universal

* Added support for F16xI4xF16 to gemm_wmma_universal

* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType

* Added missing test class for FP16_KM_NK

* Pre-commit hooks fixes

* Added padding instances for f16xf16xf16

* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Ammending changes for adding support for padding instances for f16xf16xf16

* Fixes for padding instances for f16xf16xf16

* Added padding instances for bf16xbf16, f8xf8

* Added packed instances for bf16xi4xbf16

* Added padding instances for f8xf16xf16

* Added padding instances for f16xf8xf16, f16xi4xf16

* Fixed typos for bf16xbf16xbf16 padding instances

* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Fix int8 DTYPES check for gemm_bilinear

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com>
2025-06-04 12:22:33 +06:00
Bartłomiej Kocot
6e5acee0f9 Add Clamp/Relu bf16/fp16 cast fixes (#2279)
* Add Clamp/Relu bf16/fp16 fixes

* fix
2025-06-03 18:31:46 +02:00
Illia Silin
b76fdbe47f Upgrade to ROCm6.4.1 and use generic targets for gfx1x. (#2274)
* upgrade to rocm6.4.1 and use gfx1x-generic targets

* add rocm version parsing

* fix the gfx10-3-generic syntax in cmake
2025-06-03 07:17:35 -07:00
Bartłomiej Kocot
e7906dd644 Change relu to clamp for grouped conv fwd instances (#2249) 2025-05-29 00:51:25 +02:00
Adam Dickin
6df1c56ad6 Changes to allow MIOpen to build CK as part of its build. (#2247)
* tweaks to the miopen specific build.  add way to skip clang-tidy checks and a way to skip some custom build targets MIOpen also has.

* move the tidy if statment

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-05-28 13:51:15 -07:00
BrianHarrisonAMD
e91be7d96a Add option to disable offload compress for CK builds (#2250)
* Add option to disable offload compress for CK builds

* Remove gemm exe offload compress flag conditional
2025-05-28 13:47:56 -07:00
Bartłomiej Kocot
b1ed92b131 Revert "Remove not needed bwd wei merged groups instances (#2218)" (#2235)
This reverts commit 4583aeffad.
2025-05-26 23:26:04 +02:00
Bartłomiej Kocot
4583aeffad Remove not needed bwd wei merged groups instances (#2218)
* Grouped conv bwd wei add two stage instances for larger filter and Merge Groups

* Fix

* fix

* Revert "Restore oddc instances (#2201)"

This reverts commit 6342f6b5e8.

* fix

---------

Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
2025-05-26 22:46:18 +02:00
Illia Silin
bc2551ac3b disable building device_mha_operations by default (#2225) 2025-05-22 14:03:04 -07:00
Adam Dickin
417a6b65b6 Add MIOPEN_REQ_LIBS_ONLY option for cmake to build only the libs MIOpen requires (#2224)
* cut out anything we dont need for MIOpen to test

* refactor exclusion code to be more streamlined.
2025-05-22 11:14:33 -07:00
Bartłomiej Kocot
ebc5a6ef87 Grouped conv bwd wei add for larger filter and Merge Groupes optimization (#2197)
* Grouped conv bwd wei add two stage instances for larger filter and Merge Groups

* Fix

* fix

* Restore removed instances

---------

Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
2025-05-21 22:47:34 +02:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00
Andriy Roshchenko
57e0f5df29 MX GEMM - Expand MX MFMA Testing to BF8, FP6, and BF6 Data Types (#2199)
* Unify test interface for different layouts.

* WIP: Introducing FP4/FP6/FP8 abstractions

* WIP: Introducing packed storage abstraction

* WIP: Introducing packed storage abstraction

* WIP: Improved support for FP6 data type

* Refactor packed storage for f6_t

* WIP: FP6 MFMA test

* Test if we correctly represent all FP6/FP4 numbers

* Additional output for failed FP4 test.

* More failing conversion tests

* Even more failing conversion tests

* Working FP6 MFMA tests

* Expand MX MFMA testing to BF8/6

* Update and verify MX MFMA test for packed types

* Fix fp4 and fp6 conversions on host

* Working MX MFMA tests for FP8/6/4

* Cleanup

* Add missing type

* Cleanup

* Final cleanup

* Restrict FP6/4 values output to CK_LOGGING=1

* Use CHAR_BIT instead of number 8

* Fix typo

* Remove FP6 and FP4 from the list of native types

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
2025-05-19 16:52:51 -05:00
Bartłomiej Kocot
6342f6b5e8 Restore oddc instances (#2201) 2025-05-16 18:42:02 -07:00
Illia Silin
40668c9a99 Build and store CK library deb package for all targets daily. (#2196)
* generate and store library package for all targets

* use ninja to build packages for all targets

* make sure to use ftime-trace when using ninja

* make sure build trace only runs on gfx9

* archive lib package and stash only library package
2025-05-16 07:40:53 -07:00
Bartłomiej Kocot
7c0e29cc0f Extend 64x64 with 4 waves instances for grouped conv bwd wei (#2187)
* Extend 64x64 with 4 waves instnaces for grouped conv bwd wei

* Fix

* fix

* fix
2025-05-15 16:21:34 +02:00
Thomas Ning
b49f7de81f Improve the general performance of the Preshuffled GEMM V3 & delete the unnecessary instances (#2166)
* make the work compiled

* Solved the example code, but still have the profiler error

* Finished the feature

* Clang format and update the CHANGELOG

* solve the preshuffle v1 & v2 problem

* Comment Addressed

* Comment Addressed
2025-05-12 09:52:58 -07:00
Bartłomiej Kocot
6fddb5708c Add grouped conv fwd bias relu instances (#2179)
* Add grouped conv fwd bias relu instances

* fixes

* fix
2025-05-09 22:52:34 +02:00
Andriy Roshchenko
cb27e7c77f Ensure MX GEMM Instances can be Cross-Compiled for Multiple Architectures (#2171)
* Re-enable MX GEMM instances

* Fix compilation error when building MX GEMM for multiple architectures
2025-05-08 13:26:03 -06:00
Bartłomiej Kocot
397b9080a2 Move 16x16 grouped conv fwd instances from comp header (#2165)
* Move 16x16 grouped conv fwd instances from comp header

* Improvements
2025-05-07 08:04:31 -07:00
Rostyslav Geyyer
8a0d659f92 Add FP4 MX MFMA tests (#2151)
* Add conversion tests

* Fix ctor

* Fix nan logic

* Fix conversion logic

* Permute packed f4_t values

* Fix conversion to float, repack vector elements

* Fix device tests

* Permute elements in a vector

* Add a repro test

* Add a conversion for a repro test

* Update test vectors

* Update conversion

* Fix the test

* Update test vector generator

* Fix vector sr conversion

* Permute conversion args

* Update conversion

* Test

* Fix packing

* Simplify conversion function

* Pack conversion in a loop

* Pack conversion in a loop

* Pack another conversion in a loop

* Pack one more conversion in a loop

* Pack the last conversion in a loop

* Clean up

* Add ops

* Add tests

* Add missing utils

* Update reference mx gemm

* Add f4x2 init mode

* Update host tensor utils

* Update chunk size for f4x2

* Add non scaled ops

* Add a type utility

* Update non scaled reference kernel

* Add non scaled tests

* Debug mfma arguments

* Add more debug info

* Update chunk size

* Update data layout

* Add more debugging

* Fix B stride

* Fix reference gemm

* Fix build

* One more reference fix

* Add more debug info

* Disable some tests

* Enable tests

* Add fp4 dimensions

* Update reference kernels

* Temp edits

* Remove leftovers

* Fix conflicts

* Clean up

* More clean up

* Revert "More clean up"

This reverts commit d8d35a0846.

* Add layouts to tests

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-06 09:24:00 -05:00
Andriy Roshchenko
79beaacdd1 Restrict MX GEMM instantiation to GFX950 arch (#2157) 2025-05-05 08:18:22 -07:00
Andriy Roshchenko
79b0bfeb41 MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152)
* Add gemm_mx_fp8_bf8 example with row-major B

* Add more overloads of MX MFMA instructions

* Add MK_KN (RRR) tests

* Add KM_NK (CCR) tests

* Add more problem sizes to Large tests

* Add test_gemm_mx to the list of regression tests
2025-05-01 11:55:48 -06:00
Bartłomiej Kocot
23de234dbe Add grouped conv fwd 16x16 mfma instruction instances (#2140)
* Add grouped conv fwd 16x16 mfma instruction instances

* fix

* remove oddc

* fix

* fix
2025-04-30 09:49:37 +02:00
Bartłomiej Kocot
4094ad158a Integrate universal gemm with conv bwd data and add SplitK (#1315)
* Integrate universal gemm with conv bwd data

* Fix multi d kernel

* Add splitK support

* instances refactor

* instances refactor

* refactor

* fixeS

* fixes

* 16x16 instnaces

* Fixes

* Fix

* Fix

* Fix

* Fix

* Fix

* Fixes

* fix

* fix
2025-04-28 23:54:49 +02:00
Anton Gorenko
edd92fc546 DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096)
* Prepare files for DeviceGemm_Wmma_CShuffleV3

* Implement main part of CShuffleV3 with block pipeline v3 for WMMA

* Remove unused functions and template params for A/B descriptors

* Support both gfx11 and gfx12

* Enable SplitK for gfx12 and disable for gfx11

* Added RowColRow layout for DeviceGemmV2 fp16

* Added more instances for Row, Col, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout

* Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout

* Fix formatting

* Add documentation

Based on e5ad48a784

* Enable gemm_universal profiling for gfx11/12

* Add WMMA intrinsics for F8/BF8

* Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances

* Add BF16 instances and tests

* Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
2025-04-28 10:14:21 +05:00
lalala-sh
39ba03f25d Moe gemm activation (#2026)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* impl int64 but result not correct

* int64 index ok now

* input output all ok

* fix uint32

* revert v1 test

* use uint32

* mork to support 13w tokens

* moe sorting fix moebuf

* fix merge

* update moe api fix aiter build

* fix buid

* fuse silu

* silu ok

* acale ok

* add silu

* change code

* gemm2 ok

* gufusion compatible ok, fix warnings

* gu fusion for m32 m64 ok

* support bf16 cshuffle

* i4 gemm2 ok

* i4 gemm2 ok and i4 gemm1 build

* 16x16 run ok

* change flops; change cshuffle dtype

* fuse gelu silu act in moe gemm1

* fp8 with act ready

* int4 act ready

* remove useless changes

* remove useless code change

* fix clang format

* add the arch limit of int4 moe gemm

* fuse moe activation

* fix fp8 16x16

* fix no quant case

* fix bugs

* fix fp8 gufusion bug

* remove useless comments

* refine activation code & complete moe example

* fix int8 bugs

* merge tkw1

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: root <root@hjbog-srdc-51.amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-04-23 10:35:34 +08:00
lalala-sh
bcf5bb41be enable do top k weights in moe stage1 gemm (#2094)
* add switch for mul topk weights

* fix bf16/f16 bugs

* complete
2025-04-18 10:45:49 +08:00
Andriy Roshchenko
213b203a3c MX GEMM - Parameterized Test Template (#2088)
* Tests for MX FP8 GEMM

* Improve documentation
2025-04-16 19:56:00 -06:00
aledudek
7c32652e03 Add grouped conv fwd 3d GKCYX instances for f32, f16, bf16 (#2069)
* Part1

* Add grouped conv fwd 3d GKCYX instances for f32, f16, bf16

* Add missing coma

* Add missing cpp instance files

* Fix 3d layout

* Add missing closing bracket

* Add missing comp x2 and part2 instances

* Fix typo in instance name

* fix

* Fix

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-04-16 11:00:55 +02:00
Illia Silin
3e6d21adeb enable gfx115x support (#2065) 2025-04-09 10:06:42 -07:00
Khushbu Agarwal
263ff689e0 New instances for gemm_multiply_multiply_weightpreshuffle operator (#2061)
* Add new instances for weight_preshuffle for f8->bf16

* Add new instances for weight_preshuffle for f8->f16

* clang formatted

---------

Co-authored-by: Khushbu Agarwal <khuagar@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-04-08 15:14:53 -07:00
Bartłomiej Kocot
2ccf914888 Add support for GKCYX grouped conv weight (#2023)
* Grouped conv bwd weight GKCYX support

* fix and changelog

* fix

* fix

* fixes

* comments

* fix
2025-04-02 23:59:49 +02:00
Bartłomiej Kocot
8c0ab61ece Grouped conv backward data GKCYX support (#2029)
* Grouped conv backward data GKCYX support

* profiler

* Converter

* split instances
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
6355ee7ca5 Improve compilation time for grouped conv fwd (#2039)
* Improve compilation time for grouped conv fwd

* Fix
2025-04-01 07:11:42 -07:00
Muhammed Emin Ozturk
dd4c12b155 f8/bf16 GEMM Stream-K (#1879) 2025-03-31 20:30:17 -06:00
jefyang1
16b15e336a Fix gemm universal and grouped_conv_fwd test failures on gfx950 (#2031) 2025-03-31 09:20:52 -07:00
Bartłomiej Kocot
54c81a1fcf Add support for GKCYX grouped conv fwd (#2015)
* Add support for GKCYX grouped conv fwd

* fixes

* fix

* changelog

* Fixes
2025-03-26 21:13:38 +01:00
Bartłomiej Kocot
5b0873c31a Fix split N for large images in groupd conv fwd (#2004)
* Fix split N for large images in groupd conv fwd

* Fix comments
2025-03-22 23:19:49 +01:00
Bartłomiej Kocot
fdaff5603e Add grouped conv bwd wei merged grouped instance for larger filter (#1984)
* Add grouped conv bwd wei merged grouped instance for larger filter

* Update readme
2025-03-18 16:16:24 +01:00
Bartłomiej Kocot
c2e4898b4b Grouped conv bwd data NGCHW (#1967)
* Grouped conv bwd data NGCHW

* fixes

* fix

* Improvements

* Fix

* Fix

* add client example
2025-03-17 13:32:00 +01:00
Haocong WANG
cbd74c2d12 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-11 10:11:21 -07:00
Thomas Ning
c954bd0cfa Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955)
* tempsave, not selected

* finish the feature and merge with develop

---------

Co-authored-by: aska-0096 <haocwang@amd.com>
2025-03-07 13:44:06 -08:00
Illia Silin
0e8e711ec8 add missing headers (#1959) 2025-03-07 11:11:30 -08:00
Illia Silin
9b51c08bf7 remove support for gfx940 and gfx941 targets (#1944)
* remove support for gfx940 and gfx941 targets

* update changelog
2025-03-05 11:07:33 -08:00
feli
3786e16375 ck moe gemm implement (#1936)
* port all moe changes from ck_moe_gemm branch

* refine codes in the pr

* fix tail odd

* fix clang format

* fix clang format2

* make hot loop scheduler compatible with 16x16 and 32x32

* clang format

* fix per token quant

* rename moe example

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-05 15:56:55 +08:00
jefyang1
c95bda93ba Remove CK_USE_AMD_MFMA_GFX950 (#1935)
* Add runtime check in example_gemm_xdl_streamk for gfx950

* Add runtime check in grouped conv fwd examples for gfx950

* Disable CK_USE_AMD_MFMA_GFX950

* Add new instances for gfx950

* Fix test_gemm_universal on gfx950
2025-03-04 10:32:25 -08:00
asleepzzz
ef16010273 Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913)" (#1933)
This reverts commit 020148d0f7.
2025-03-03 07:17:39 -08:00