Commit Graph

769 Commits

Author SHA1 Message Date
Aviral Goel
769336b640 [CK_TILE] Add type traits to detect tile window types at compile time (#2158)
* added WindowType enum to tile_window_structs and static assert checks in computev4 pipeline

* added type traits instead of enum to tile_window() and tile_window_linear() with debug comments

* removed comments, added documentation and clang format
2025-05-07 00:00:39 -07:00
Rostyslav Geyyer
8a0d659f92 Add FP4 MX MFMA tests (#2151)
* Add conversion tests

* Fix ctor

* Fix nan logic

* Fix conversion logic

* Permute packed f4_t values

* Fix conversion to float, repack vector elements

* Fix device tests

* Permute elements in a vector

* Add a repro test

* Add a conversion for a repro test

* Update test vectors

* Update conversion

* Fix the test

* Update test vector generator

* Fix vector sr conversion

* Permute conversion args

* Update conversion

* Test

* Fix packing

* Simplify conversion function

* Pack conversion in a loop

* Pack conversion in a loop

* Pack another conversion in a loop

* Pack one more conversion in a loop

* Pack the last conversion in a loop

* Clean up

* Add ops

* Add tests

* Add missing utils

* Update reference mx gemm

* Add f4x2 init mode

* Update host tensor utils

* Update chunk size for f4x2

* Add non scaled ops

* Add a type utility

* Update non scaled reference kernel

* Add non scaled tests

* Debug mfma arguments

* Add more debug info

* Update chunk size

* Update data layout

* Add more debugging

* Fix B stride

* Fix reference gemm

* Fix build

* One more reference fix

* Add more debug info

* Disable some tests

* Enable tests

* Add fp4 dimensions

* Update reference kernels

* Temp edits

* Remove leftovers

* Fix conflicts

* Clean up

* More clean up

* Revert "More clean up"

This reverts commit d8d35a0846.

* Add layouts to tests

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-06 09:24:00 -05:00
carlushuang
4e9b76f88c [CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153)
* combine 2-3 as single stage

* support zeroing

* improve long tokens

* update specialization

* b16 ws

* 8bit topk optimize

* update 15 example
2025-05-06 17:32:07 +08:00
jakpiase
0bcb804ad0 [CK_TILE] Remove scratch usage from universal gemm (#2001)
* moves kbatch condition outside of kernel

* add reviewer comments

* fixes

* fix tests

* fixes after review

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-05 18:46:44 +02:00
Khushbu Agarwal
d58f2b8bd0 mfma_32x32x64_fp8/bf8 (#2148)
* support for mfma_32x32x64_fp8

* clang-formatted

* Fixing sparsity in codegen
2025-05-01 13:36:24 -07:00
Andriy Roshchenko
79b0bfeb41 MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152)
* Add gemm_mx_fp8_bf8 example with row-major B

* Add more overloads of MX MFMA instructions

* Add MK_KN (RRR) tests

* Add KM_NK (CCR) tests

* Add more problem sizes to Large tests

* Add test_gemm_mx to the list of regression tests
2025-05-01 11:55:48 -06:00
Aviral Goel
1d8ef40760 Add documentation for ck_tile::array<T,N> (#2078)
* addded documentation for ck_tile::array<T,N>

* clang format fix

* spelling errros

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* spelling errros

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Revert "spelling errros"

This reverts commit 4179e7d193.

* Revert "spelling errros"

This reverts commit 3f90733dbe.

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
2025-04-30 16:43:36 -07:00
Illia Silin
9a9f59ae69 Revert "Add ck tile examples to package (#1880)" (#2150) 2025-04-30 10:20:16 -07:00
Aviral Goel
65f182d617 Add Matrix A and Matrix B Swizzle for LDS in Computev4 policy (#2136)
* fixed computev4 policy bug for lds swizzle

* added swizzle for input matrix B

* Improved ComputeV4 policy and pipeline by swizzling A and B

* consolidated LDS descriptor functions in parent struct
2025-04-28 18:20:47 -07:00
Khushbu Agarwal
d107f3c3a5 Support for MFMA_16x16x128 for fp8/bf8 (#2125)
* Adding 16x16x128 support for gfx950

* Support for fp8 and bf8

* fix input arguments for MFMA scale instruction

* clang-formatted

* Fixes for lwpck-3145 (#2138)

* Fix lds tile & cmake dep & default epilogue

* Fallback BTypeToUse to ADataType in WOQ cases

* reverting instance json file

* reverting instance json file

---------

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-04-28 18:19:50 -07:00
Bartłomiej Kocot
4094ad158a Integrate universal gemm with conv bwd data and add SplitK (#1315)
* Integrate universal gemm with conv bwd data

* Fix multi d kernel

* Add splitK support

* instances refactor

* instances refactor

* refactor

* fixeS

* fixes

* 16x16 instnaces

* Fixes

* Fix

* Fix

* Fix

* Fix

* Fix

* Fixes

* fix

* fix
2025-04-28 23:54:49 +02:00
jakpiase
434d19f696 Add ck tile examples to package (#1880)
* add ck tile examples to package

* Update jenkinsfile

* fix for jenkinsfile

* fix for building ck tile code on non gfx9

* compile ck tile examples only for gfx94

* include ck tile examples in all target

* fix for basic gemm UseStructuredSparsity

* Update CMakeLists.txt

* Update gemm_pipeline_problem.hpp

* add targets to rocm install

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-04-28 09:53:19 -07:00
Anton Gorenko
edd92fc546 DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096)
* Prepare files for DeviceGemm_Wmma_CShuffleV3

* Implement main part of CShuffleV3 with block pipeline v3 for WMMA

* Remove unused functions and template params for A/B descriptors

* Support both gfx11 and gfx12

* Enable SplitK for gfx12 and disable for gfx11

* Added RowColRow layout for DeviceGemmV2 fp16

* Added more instances for Row, Col, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout

* Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout

* Fix formatting

* Add documentation

Based on e5ad48a784

* Enable gemm_universal profiling for gfx11/12

* Add WMMA intrinsics for F8/BF8

* Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances

* Add BF16 instances and tests

* Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
2025-04-28 10:14:21 +05:00
Yi DING
8add2cf45d Fix fp8 convert & add option for basic example (#2129) 2025-04-27 16:26:05 -07:00
Khushbu Agarwal
a2ed34a112 MFMA_32x32x16 for gfx950 (#2121)
* Enable MFMA_32x32x16 for fp16/BF16 for gfx950

* clang formatted
2025-04-24 10:20:22 -07:00
Illia Silin
01cb8379cd make code compliant with std=c++20 (#2123) 2025-04-24 10:14:52 -07:00
carlushuang
5487289fc4 [CK_TILE] support gfx950 matrix core in 01_fmha fwd (#2110)
* gfx950 01_fmha fwd

* fix comment

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-04-23 12:40:18 -07:00
lalala-sh
39ba03f25d Moe gemm activation (#2026)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* impl int64 but result not correct

* int64 index ok now

* input output all ok

* fix uint32

* revert v1 test

* use uint32

* mork to support 13w tokens

* moe sorting fix moebuf

* fix merge

* update moe api fix aiter build

* fix buid

* fuse silu

* silu ok

* acale ok

* add silu

* change code

* gemm2 ok

* gufusion compatible ok, fix warnings

* gu fusion for m32 m64 ok

* support bf16 cshuffle

* i4 gemm2 ok

* i4 gemm2 ok and i4 gemm1 build

* 16x16 run ok

* change flops; change cshuffle dtype

* fuse gelu silu act in moe gemm1

* fp8 with act ready

* int4 act ready

* remove useless changes

* remove useless code change

* fix clang format

* add the arch limit of int4 moe gemm

* fuse moe activation

* fix fp8 16x16

* fix no quant case

* fix bugs

* fix fp8 gufusion bug

* remove useless comments

* refine activation code & complete moe example

* fix int8 bugs

* merge tkw1

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: root <root@hjbog-srdc-51.amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-04-23 10:35:34 +08:00
Gino Lu
504f563f78 [CK-Tile] warp-gemm support for using V_MFMA_F32_16x16x32_BF16 (#2073)
* draft v_mfma_f32_16x16x32_bf16

* fix error config and add debug code.

* Solve the CShuffle Problem

* draft v_mfma_f32_16x16x32_bf16

* fix error config and add debug code.

* Solve the CShuffle Problem

* fix error while testing new command

* Finished the feature of new mfma 16*16*32

* Addressed the comment

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-04-22 15:52:36 -07:00
Rostyslav Geyyer
416e851584 Temporarily disable MX FP4 device tests (#2112) 2025-04-22 16:08:48 -05:00
Thomas Ning
0cca8fa28f GEMM Multiply Multiply Fix (#2102)
* fix the type convert and increase the BF16 conversion + the profile comment

* fix the CI
2025-04-22 01:13:22 -07:00
Muhammed Emin Ozturk
b092c18da7 MI308 fix for streamk 1-Tile floating point exception (#2101) 2025-04-21 11:44:07 -07:00
Thomas Ning
a738e43445 MFMA 16x16x32fp8 (#2103)
* add mfma_16x16x32_fp8

* clang format code

* Finished the fix for gemm basic

* clang foramt

* rebuild CI

* recover gemm.hpp

* add MFMA 16*16*32bf8

---------

Co-authored-by: solin <bingzhou@amd.com>
2025-04-21 10:21:35 -07:00
solin
c318ec0778 fix CI build fail 2025-04-21 16:00:12 +08:00
lalala-sh
bcf5bb41be enable do top k weights in moe stage1 gemm (#2094)
* add switch for mul topk weights

* fix bf16/f16 bugs

* complete
2025-04-18 10:45:49 +08:00
Andriy Roshchenko
213b203a3c MX GEMM - Parameterized Test Template (#2088)
* Tests for MX FP8 GEMM

* Improve documentation
2025-04-16 19:56:00 -06:00
Andriy Roshchenko
da54464cce MX GEMM - Add MX BF8 example (#2071)
* Add MX GEMM example for MX BF8

* Verified MX FP8 with 16x16x128 scale builtin

* Verify MX BF8 GEMM with BF16 output
2025-04-16 15:25:02 -06:00
BingYuan.Zhou
eaf1f0bf3b [flatmm] implement basic fp16 flatmm (#2089)
* [flatmm] implement basic fp16 flatmm

* fix CI build fail

---------

Co-authored-by: root <root@hjbog-srdc-50.amd.com>
Co-authored-by: solin <bingzhou@amd.com>
2025-04-16 16:51:17 +08:00
Andriy Roshchenko
7106976a72 MX GEMM - New GEMM pipeline for MX data types (#2059)
* Allow selection of mfma_scale instructions

* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order

* Add constexpr and synchronize return type for `get_exponent_value`

* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`

* Add support for microscaling instructions in `XdlopsGemm`

* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper

* Remove software implementation of MX GEMM

* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction

* Update README

* Updated CHANGELOG

* Remove unused static methods
2025-04-15 17:17:07 -06:00
Mingtao Gu
56378f810f CK pk_i4_t test failures fix (SWDEV-518629) (#2075)
* fix pk_i4_v3 tests failures in Unbuntu env.

* fix pk_i4_t tests failure on Unbuntu issues.

* some fixed.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-04-14 16:58:57 +08:00
Thomas Ning
269f4f6af5 Solve the Static Encoding Pattern compile error when the tile size is too small (#2079) 2025-04-13 20:09:30 -07:00
Illia Silin
0d4f145078 Fix build issues for multiple targets. (#2077)
* build for multiple targets on gfx942

* add missing ignore statements
2025-04-11 12:12:53 -07:00
jakpiase
6c61f4d237 [CK_TILE] Add 2:4 structured sparsity support for fp16 gemm (#1957)
* add structured sparsity fp16 support for gemm

* added reviewer suggestions

* update changelog

* update changelog

* add reviewers suggestions

* Minor fix

* clang fix

* fix doxygen
2025-04-11 12:18:26 +02:00
slippedJim
5f885d2b7a add fmha fwd splitkv receipt for aiter c++ api (#2068)
* add s_randval for c++ api

* Fix bug of bias in splitkv

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
2025-04-10 23:21:13 +08:00
Juan Manuel Martinez Caamaño
f14e648e7c Replace inline assembly with builtins in FHMA (#2067)
* Replace inline assembly with builtins in FHMA

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
2025-04-10 09:48:37 +02:00
Illia Silin
3e6d21adeb enable gfx115x support (#2065) 2025-04-09 10:06:42 -07:00
MHYang-gh
03ce8729fd Make buffer coherence configurable in tensor view (#2041)
* Make buffer coherence configurable in tensor view

* Fix clang-format for tensor_view.hpp
2025-04-08 15:34:11 -07:00
valarLip
2c563fecf7 add passthrough for int32->float32 (#2062) 2025-04-08 15:16:30 -07:00
Max Podkorytov
6ce0797dad simplify generate_tuple (#2043) 2025-04-08 09:00:51 -07:00
aledudek
80aae6119b [CK_TILE] Fix GEMM Memory Pipeline (#2034)
* [CK_TILE] Fix GEMM Memory Pipeline

* Fix transpose tile

* Add comments
2025-04-08 12:40:04 +02:00
Illia Silin
1793228422 fix codegen issues (#2052) 2025-04-07 07:08:39 -07:00
Illia Silin
572cd820ce Split env.hpp header from the ck.hpp header. (#2049)
* split env.hpp out of main headers

* fix namespace logic
2025-04-03 15:30:21 -07:00
Rostyslav Geyyer
265af71a71 Add FP16/BF16<->FP8/BF8 conversions (#2035)
* Move conversion functions and add missing conversions

* Add tests

* Add missing conversions

* Add missing conversions

* Add bf8 tests

* Update clipping for vectors

* Add missing conversions

* Add bf16 fp8 tests

* Add bf16 bf8 tests

* Fix device conversion

* Fix conversions

* Fix vector use

* Minor fix

* Add a workaround flag

* Add a workaround flag for bf16 conversion

* Add another workaround

* Add a workaround for fp16 to bf8 conversion

* Update type alias

* Add docstrings and missing wrappers

* Fix if defined macros

* Fix more if defined macros

* Add comments

* Remove __host__ specifier

* Add a gfx950 guard

* Update function naming
2025-04-03 12:42:03 -05:00
aledudek
9329432f6c Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments
2025-04-03 13:35:43 +02:00
Bartłomiej Kocot
2ccf914888 Add support for GKCYX grouped conv weight (#2023)
* Grouped conv bwd weight GKCYX support

* fix and changelog

* fix

* fix

* fixes

* comments

* fix
2025-04-02 23:59:49 +02:00
Adam Osewski
e5ad48a784 Basic docs for universal gemm & ck-tile gemm. (#2014)
* Basic docs for universal gemm & ck-tile gemm.

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Reviewers suggestions.

* Align tparam names in doc with class tparams.

* More reviewers fine tuning ;)

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
2025-04-02 11:03:40 +02:00
Bartłomiej Kocot
8c0ab61ece Grouped conv backward data GKCYX support (#2029)
* Grouped conv backward data GKCYX support

* profiler

* Converter

* split instances
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
ec742908bd Grouped conv fwd v3 fix for SplitN an G > 1 (#2038)
* Grouped conv fwd v3 fix for SplitN an G > 1

* Remove int8 large test

* Retore int8 test
2025-04-01 13:19:35 -07:00
Seunghoon Lee
df32020f93 Fix Windows build. (#2012)
* Remove duplicate using uint64_t.

* Cast before shift.
2025-04-01 12:22:10 -07:00
Max Podkorytov
c59a8bb206 add a fast compilation path for static for (0..N) (#2005)
* add a fast compilation path for static for (0..N)

* Update functional2.hpp

add comment and put range applier into detail namespace

* Update functional.hpp

ditto for ck-tile

* prettify

* prettify more

* add comment

* clang-format
2025-04-01 12:06:25 -07:00