Commit Graph

685 Commits

Author SHA1 Message Date
Khushbu Agarwal
c7b8e86e34 [CK_Tile] Simplified Mem pipeline (#2159)
* simplify code

* compiled the code

* Simplified example and codegen for mem pipeline

* Reveting config and universal gemm example

* clang formatted

* remove comments

* clang formatted

* Add memory operation changes for defualt pipeline

* fix config file

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-05-07 18:37:31 -07:00
kylasa
956fe8f751 Simple copy kernel, which can be a tool to experiment with CK_Tile API with minimal code. (#2156)
* Test Copy kernel code for testing tile distribution logic

* Fix the error

* Solved the problem

* Updated comments and document formatting

* Removed unused tile distribution and code cleanup

* Added README.md and formatting for CI/CD.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-05-07 00:02:59 -07:00
carlushuang
4e9b76f88c [CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153)
* combine 2-3 as single stage

* support zeroing

* improve long tokens

* update specialization

* b16 ws

* 8bit topk optimize

* update 15 example
2025-05-06 17:32:07 +08:00
jakpiase
0bcb804ad0 [CK_TILE] Remove scratch usage from universal gemm (#2001)
* moves kbatch condition outside of kernel

* add reviewer comments

* fixes

* fix tests

* fixes after review

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-05 18:46:44 +02:00
Andriy Roshchenko
79b0bfeb41 MX GEMM - Add FP8 GEMM Tests for Different Layouts (#2152)
* Add gemm_mx_fp8_bf8 example with row-major B

* Add more overloads of MX MFMA instructions

* Add MK_KN (RRR) tests

* Add KM_NK (CCR) tests

* Add more problem sizes to Large tests

* Add test_gemm_mx to the list of regression tests
2025-05-01 11:55:48 -06:00
Illia Silin
9a9f59ae69 Revert "Add ck tile examples to package (#1880)" (#2150) 2025-04-30 10:20:16 -07:00
Daniel Su
d9786f3363 Check max-ilp-scheduling compiler option for moe_gemm examples (#2127) 2025-04-28 13:40:22 -07:00
jakpiase
434d19f696 Add ck tile examples to package (#1880)
* add ck tile examples to package

* Update jenkinsfile

* fix for jenkinsfile

* fix for building ck tile code on non gfx9

* compile ck tile examples only for gfx94

* include ck tile examples in all target

* fix for basic gemm UseStructuredSparsity

* Update CMakeLists.txt

* Update gemm_pipeline_problem.hpp

* add targets to rocm install

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-04-28 09:53:19 -07:00
lalala-sh
83394e40d2 fix moe i4 example bug (#2139) 2025-04-28 09:49:31 -07:00
Yi DING
8add2cf45d Fix fp8 convert & add option for basic example (#2129) 2025-04-27 16:26:05 -07:00
Po Yen Chen
3d4d70d2fc Avoid using store_tile_raw() for fp32 tensors (#2072) 2025-04-26 23:07:41 -07:00
joyeamd
41541aff7a SWDEV-52596 for hdim=256, when use splitkv pipeline, two new pipelines need to be added (#2126) 2025-04-25 16:31:09 +08:00
rocking
02ce6d39ea Only generate specific hdim (#2120) 2025-04-24 18:52:58 +08:00
lalala-sh
39ba03f25d Moe gemm activation (#2026)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* impl int64 but result not correct

* int64 index ok now

* input output all ok

* fix uint32

* revert v1 test

* use uint32

* mork to support 13w tokens

* moe sorting fix moebuf

* fix merge

* update moe api fix aiter build

* fix buid

* fuse silu

* silu ok

* acale ok

* add silu

* change code

* gemm2 ok

* gufusion compatible ok, fix warnings

* gu fusion for m32 m64 ok

* support bf16 cshuffle

* i4 gemm2 ok

* i4 gemm2 ok and i4 gemm1 build

* 16x16 run ok

* change flops; change cshuffle dtype

* fuse gelu silu act in moe gemm1

* fp8 with act ready

* int4 act ready

* remove useless changes

* remove useless code change

* fix clang format

* add the arch limit of int4 moe gemm

* fuse moe activation

* fix fp8 16x16

* fix no quant case

* fix bugs

* fix fp8 gufusion bug

* remove useless comments

* refine activation code & complete moe example

* fix int8 bugs

* merge tkw1

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: root <root@hjbog-srdc-51.amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-04-23 10:35:34 +08:00
Gino Lu
504f563f78 [CK-Tile] warp-gemm support for using V_MFMA_F32_16x16x32_BF16 (#2073)
* draft v_mfma_f32_16x16x32_bf16

* fix error config and add debug code.

* Solve the CShuffle Problem

* draft v_mfma_f32_16x16x32_bf16

* fix error config and add debug code.

* Solve the CShuffle Problem

* fix error while testing new command

* Finished the feature of new mfma 16*16*32

* Addressed the comment

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-04-22 15:52:36 -07:00
lalala-sh
bcf5bb41be enable do top k weights in moe stage1 gemm (#2094)
* add switch for mul topk weights

* fix bf16/f16 bugs

* complete
2025-04-18 10:45:49 +08:00
Andriy Roshchenko
da54464cce MX GEMM - Add MX BF8 example (#2071)
* Add MX GEMM example for MX BF8

* Verified MX FP8 with 16x16x128 scale builtin

* Verify MX BF8 GEMM with BF16 output
2025-04-16 15:25:02 -06:00
BingYuan.Zhou
eaf1f0bf3b [flatmm] implement basic fp16 flatmm (#2089)
* [flatmm] implement basic fp16 flatmm

* fix CI build fail

---------

Co-authored-by: root <root@hjbog-srdc-50.amd.com>
Co-authored-by: solin <bingzhou@amd.com>
2025-04-16 16:51:17 +08:00
felix
c5975529bb add preshuffle gemm fp16 (#2036)
* add preshuffle gemm fp16

* clang format and test ok

* Update gemm_multiply_multiply_xdl_fp16_bpreshuffle.cpp

remove useless comments in example

* Update gemm_multiply_multiply_xdl_fp16_bpreshuffle.cpp

remove 2

---------

Co-authored-by: coderfeli <coderfeli@163.com>
2025-04-16 10:53:21 +08:00
joyeamd
94d47b1680 fmha hdim256 vectorize improve (#2086)
For hdim 256, will not have vectorized buffer load when seqlen % 256 != 0 and hdim % 256 = 0; this commit tries to solve this condition.
2025-04-16 09:21:04 +08:00
Andriy Roshchenko
7106976a72 MX GEMM - New GEMM pipeline for MX data types (#2059)
* Allow selection of mfma_scale instructions

* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order

* Add constexpr and synchronize return type for `get_exponent_value`

* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`

* Add support for microscaling instructions in `XdlopsGemm`

* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper

* Remove software implementation of MX GEMM

* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction

* Update README

* Updated CHANGELOG

* Remove unused static methods
2025-04-15 17:17:07 -06:00
Mingtao Gu
56378f810f CK pk_i4_t test failures fix (SWDEV-518629) (#2075)
* fix pk_i4_v3 tests failures in Unbuntu env.

* fix pk_i4_t tests failure on Unbuntu issues.

* some fixed.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-04-14 16:58:57 +08:00
jakpiase
6c61f4d237 [CK_TILE] Add 2:4 structured sparsity support for fp16 gemm (#1957)
* add structured sparsity fp16 support for gemm

* added reviewer suggestions

* update changelog

* update changelog

* add reviewers suggestions

* Minor fix

* clang fix

* fix doxygen
2025-04-11 12:18:26 +02:00
slippedJim
5f885d2b7a add fmha fwd splitkv receipt for aiter c++ api (#2068)
* add s_randval for c++ api

* Fix bug of bias in splitkv

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
2025-04-10 23:21:13 +08:00
Illia Silin
3e6d21adeb enable gfx115x support (#2065) 2025-04-09 10:06:42 -07:00
slippedJim
5a22b61de5 Add new receipt (#2055) 2025-04-07 14:18:01 +08:00
Thomas Ning
50d1f8ff90 Add the MI355 support for CK TILE GEMM (#2046)
* Get the root cause of the ck tile gemm failing on mi355

* Fix the ck tile gemm on MI355

* delete the debug info
2025-04-03 11:48:54 -07:00
aledudek
9329432f6c Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments
2025-04-03 13:35:43 +02:00
Muhammed Emin Ozturk
dd4c12b155 f8/bf16 GEMM Stream-K (#1879) 2025-03-31 20:30:17 -06:00
rocking
8a20b62e91 Reduce redundant space in bias tensor (#2024)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-03-28 21:58:06 +08:00
felix
a82f338fb9 hotfix fix sorting int64 (#2025)
* fix sorting int64

* clang format

* fix example issue

* update WA issue #

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2025-03-28 11:31:52 +08:00
felix
36d50de50e ckmoe: change cmake; use smaller shape for i4 (#2027)
* change cmake; use smaller shape for i4

* fix pki4 run

* fix typo

* fix runtime arch logic for moe_gemm2 example

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2025-03-27 09:04:31 -07:00
Illia Silin
23a949706c Disable all pk_i4 tests for all targets except gfx942/950. (#2022)
* only build gemm_fp8_pk_i4 examples for gfx942/950

* fix cmake logic

* moved the architecture check to IsSupported function

* Revert "moved the architecture check to IsSupported function"

This reverts commit 056d2a08b3.

* disable all pk_i4 tests for targets other than gfx942/950

* fix cmake logic
2025-03-26 15:15:57 -07:00
Illia Silin
99b2bbc1d6 Make sure gemm_fp8_pk_i4 examples only build and run on gfx942/950. (#2010)
* only build gemm_fp8_pk_i4 examples for gfx942/950

* fix cmake logic

* moved the architecture check to IsSupported function

* Revert "moved the architecture check to IsSupported function"

This reverts commit 056d2a08b3.
2025-03-25 14:43:38 -07:00
Andriy Roshchenko
72d888821c MX GEMM examples with FP8, FP16, and E8M0 scales (#2016)
* Add `scalar_type` specification for E8M0 exponent

* Specialize `nnvb_data_t_selector` for E8M0 exponent

* Remove partial specializations for `scalar_type` of `non_native_vector_base` template

* Reword command line helper string

* Create MX GEMM examples for different scales
2025-03-25 15:33:03 -06:00
ruanjm
d49abdaa87 [CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861)
* 50ms -> 28ms

* Fix bug in non fuse_add_store cases

* Fine tuned setting for 2 pass pipeline

* adjust workload

* remove unnecessary change

* add layernorm

* Adding output quant and unquant results at the same time.

* fix test

* fix format

* tune for cases 128x640 and 128x1024

* bug ifx
2025-03-25 20:09:45 +08:00
Andriy Roshchenko
6660dc6b8e Introduce MX GEMM for FP8 data type (#2000) 2025-03-24 15:41:07 -06:00
carlushuang
6c08c5c46d add mask support in hdim=192/128 (#1999) 2025-03-21 18:28:43 +08:00
BingYuan.Zhou
5a0d693b86 fix ck_tile/basic_gemm build error (#1988) 2025-03-20 22:01:14 -07:00
felix
902dbe89ad change cmake (#2006)
Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-20 19:25:11 -07:00
carlushuang
e3c9886cdf [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-03-20 11:00:29 -07:00
jakpiase
0e91d32c61 [CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919)
* switch to universal gemm for batched and grouped gemms

* added reviewer comments

* fixed grouped gemm tests
2025-03-20 11:17:04 +01:00
rocking
b819c217e4 Sync the kname with instance name (#1989)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-03-20 00:06:45 +08:00
Illia Silin
1342ecf7fb Add a daily CI build on gfx908. (#1987)
* add one daily ci build on gfx908

* add redis invocation tag for gfx908

* make ci build for gfx908 conditional

* fix groovy logic

* add option to run perf tests for gfx908

* disable a few tests on mi100
2025-03-17 18:08:53 -07:00
aledudek
5095906975 Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue
2025-03-17 16:42:43 +01:00
valarLip
52b1cd7780 hotfix fmoe build issue (#1976) 2025-03-13 15:11:59 +08:00
carlushuang
3e81279d26 Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit 8cbcd3e0d0.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-13 11:41:39 +08:00
Illia Silin
d4a6d69643 disable tests that take too long to build for gfx90a (#1975) 2025-03-12 17:54:03 -07:00
Illia Silin
8cbcd3e0d0 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 7a93b16ff6.
2025-03-11 10:40:18 -07:00
Haocong WANG
cbd74c2d12 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-11 10:11:21 -07:00