Commit Graph

741 Commits

Author SHA1 Message Date
lalala-sh
d703675aa2 Merge remote-tracking branch 'origin/develop' into moe_gemm_fuse_activation 2025-03-26 18:12:43 +08:00
coderfeli
6a0cc4aad1 gu fusion for m32 m64 ok 2025-03-26 05:58:22 +00:00
coderfeli
74d8ac608f gufusion compatible ok, fix warnings 2025-03-26 02:20:30 +00:00
Andriy Roshchenko
72d888821c MX GEMM examples with FP8, FP16, and E8M0 scales (#2016)
* Add `scalar_type` specification for E8M0 exponent

* Specialize `nnvb_data_t_selector` for E8M0 exponent

* Remove partial specializations for `scalar_type` of `non_native_vector_base` template

* Reword command line helper string

* Create MX GEMM examples for different scales
2025-03-25 15:33:03 -06:00
Max Podkorytov
1a58522f01 use fast path for sequence generation in old CK (#1993) 2025-03-25 11:28:44 -07:00
coderfeli
6ca5892256 gemm2 ok 2025-03-25 15:01:10 +00:00
ruanjm
d49abdaa87 [CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861)
* 50ms -> 28ms

* Fix bug in non fuse_add_store cases

* Fine tuned setting for 2 pass pipeline

* adjust workload

* remove unnecessary change

* add layernorm

* Adding output quant and unquant results at the same time.

* fix test

* fix format

* tune for cases 128x640 and 128x1024

* bug ifx
2025-03-25 20:09:45 +08:00
coderfeli
234b8d415c change code 2025-03-25 09:44:32 +00:00
coderfeli
0d266bfd65 add silu 2025-03-25 03:01:27 +00:00
coderfeli
2b15b67b3f acale ok 2025-03-25 02:52:04 +00:00
Illia Silin
d2eab23958 Split up data_type header. (#1996)
* split fp64 vector data type

* add missing header

* move e8m0 structs

* split off numeric_utils header

* fix typo

* split off numeric limits header

* update data_type header

* fix clang format

* split off vector type header

* fix clang format

* fix typo for binary_inf
2025-03-24 15:08:54 -07:00
Andriy Roshchenko
6660dc6b8e Introduce MX GEMM for FP8 data type (#2000) 2025-03-24 15:41:07 -06:00
MHYang-gh
c027637a8f Fix A/B lds transform (#2007) 2025-03-22 23:13:50 -07:00
Bartłomiej Kocot
5b0873c31a Fix split N for large images in groupd conv fwd (#2004)
* Fix split N for large images in groupd conv fwd

* Fix comments
2025-03-22 23:19:49 +01:00
coderfeli
b865e2cf83 silu ok 2025-03-22 14:03:45 +00:00
coderfeli
d69c1c9590 fuse silu 2025-03-21 07:31:49 +00:00
BingYuan.Zhou
5a0d693b86 fix ck_tile/basic_gemm build error (#1988) 2025-03-20 22:01:14 -07:00
Attila T. Áfra
c79bf11148 Fix compile errors on Windows and Linux (#2002)
* Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous)

* Fix compile error (no matching function for call to 'cast_to_f32_from_f8')
2025-03-20 12:37:25 -07:00
carlushuang
e3c9886cdf [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-03-20 11:00:29 -07:00
jakpiase
0e91d32c61 [CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919)
* switch to universal gemm for batched and grouped gemms

* added reviewer comments

* fixed grouped gemm tests
2025-03-20 11:17:04 +01:00
rocking
b819c217e4 Sync the kname with instance name (#1989)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-03-20 00:06:45 +08:00
felix
7eaedeb36c Ck moe hot fix (#1979)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* clang format

* rm useless comments

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2025-03-19 22:58:27 +08:00
coderfeli
98cee8d02b fix merge 2025-03-18 05:45:04 +00:00
coderfeli
5f49b91237 merge develop 2025-03-18 04:49:40 +00:00
aledudek
5095906975 Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue
2025-03-17 16:42:43 +01:00
Bartłomiej Kocot
c2e4898b4b Grouped conv bwd data NGCHW (#1967)
* Grouped conv bwd data NGCHW

* fixes

* fix

* Improvements

* Fix

* Fix

* add client example
2025-03-17 13:32:00 +01:00
coderfeli
7dbdff9f9f moe sorting fix moebuf 2025-03-17 06:20:57 +00:00
coderfeli
5eaa36be18 mork to support 13w tokens 2025-03-17 01:45:34 +00:00
coderfeli
ef8c1333b9 use uint32 2025-03-17 01:45:09 +00:00
coderfeli
6c0e021235 revert v1 test 2025-03-17 01:39:57 +00:00
coderfeli
bccc5192cf fix uint32 2025-03-17 01:18:32 +00:00
coderfeli
da2659d502 input output all ok 2025-03-15 14:26:30 +00:00
coderfeli
d1e999c05c int64 index ok now 2025-03-15 13:28:49 +00:00
coderfeli
f911cf7396 impl int64 but result not correct 2025-03-14 13:01:07 +00:00
coderfeli
d4925e1637 fix output oob 2025-03-14 03:19:26 +00:00
carlushuang
3e81279d26 Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit 8cbcd3e0d0.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-13 11:41:39 +08:00
illsilin
f8464d2087 fix clang format 2025-03-12 20:21:14 -07:00
coderfeli
d85c034977 fix2 2025-03-13 02:30:07 +00:00
coderfeli
8b05fa935d fix coredump in e2e test 2025-03-13 02:12:18 +00:00
feli
251afab3b7 ck_moe: fix useless code and remove usless oob (#1972)
* fix useless code and remove usless oob

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-12 09:22:42 -07:00
Illia Silin
4c97cc511e use old instrinsics with staging compiler (#1970) 2025-03-12 07:29:09 -07:00
feli
2585c78940 Merge branch 'develop' into ck_moe_rm_oob 2025-03-12 16:05:59 +08:00
coderfeli
40542296de clang format 2025-03-12 08:05:12 +00:00
Illia Silin
8cbcd3e0d0 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 7a93b16ff6.
2025-03-11 10:40:18 -07:00
Haocong WANG
cbd74c2d12 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-11 10:11:21 -07:00
carlushuang
7a93b16ff6 [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update
2025-03-11 21:07:40 +08:00
coderfeli
1508cd514a fix useless code and remove usless oob 2025-03-11 07:05:05 +00:00
Mingtao Gu
0db7c8f0b2 Ck int4 moe develop (#1949)
* Add Gemm fp8xint4 example and kernel, function pass.

* Init Gemm_fp8xint4 Bpreshuffle

* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet

* General fix.

* fp8xint4 bpreshuffle function pass

* fix.

* init b preshuffle dequant in VGPR.

* fix bug, function pass.

* move b thread dequant copy to blockwise.

* fix bug, function now passes.

* modified the tile size to 256, 128x128x128.

* fixed a bug.

* Initial int4 moe, compile pass, function not check.

* fix bug in moe_gemm1.cpp, now function pass.

* test expert = 8 and function pass.

* Added moe_pk_i4_gemm2, function pass.

* Added b preshuffle pipeline v3 support.

* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.

* Split the blockwise pipeline for fp8xint4.

* commit missing files

* opt gemm2 to 2x2 wave

* fix swizzle = false

* update int4 moe with latest input changes.

* update tile size.

* enable pipeline v3.

* fix nswizzle = true

* commit a version for compiler debug.

* Updated transfer_v3r1_gather to support pk_i4_t type.

* for int4 moe2 for type_convert support.

* remove some values between mfma instructions.

* fix int4 moe

* Updated transfer_v3r1_gather to support pk_i4_t type.

* i4 support lds multiple shuffle

* fixed int4 moe tflops calculation.

* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle

* updated gemm2.

* change int4 moe example names

* fix and format code.

* format.

* format codes.

* update fp8xint4 example tile size.

* add <unordered_map> header

* fixed.

* format.

* Added conditional compilation for int4 -> fp8 conversion kernels

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-10 11:16:44 +08:00
Thomas Ning
9d51d17dd0 Fix on the error (#1956) 2025-03-07 13:43:52 -08:00
Max Podkorytov
9e132eb77c refactor ck-tile kernel launch (#1925) 2025-03-07 08:29:40 -08:00