Commit Graph

572 Commits

Author SHA1 Message Date
slippedJim
2b5a211f01 Fix swa condition in bwd_api (#2541) 2025-07-23 22:03:38 +08:00
slippedJim
ad509b26bb Refactor group mode paras & Update hd192 group mode kernels (#2527)
* refactore group mode paras and update hd192 kernels

* update smoke test script

---------

Co-authored-by: danyao12 <danyao12@amd.com>
2025-07-18 18:37:16 +08:00
slippedJim
fa1357bc21 FA bwd asm oob nan issue fix (#2511)
* update hex & smoke_test

* remove expired script

* typo
2025-07-17 14:57:16 +08:00
slippedJim
2639a6bc62 remove restriction of group mode hd192 no lse (#2252)
Co-authored-by: Jim <jimguo12@amd.com>
2025-06-30 16:20:39 +00:00
amd-ruitang3
efdf31e26f revert incorrect operations in bwd generation 2025-06-10 08:25:07 +00:00
Wen.Yang
160788cdf4 update benchmark script 2025-05-29 15:17:41 +08:00
Wen.Yang
efaa64e492 update bwd v3 kernels for timing error fixed 2025-05-29 15:11:16 +08:00
slippedJim
e4f0c4a549 [WIP] enable hd128 swa (#2137)
* enable hd128 swa
2025-05-08 21:00:37 +08:00
slippedJim
d0028193fc [WIP] update hd64 group hex & explicit hd128 in kernel name (#2118)
* remove all pssk kernels

* update: update pssk kernels

* update

* update

* fix kernel name

* fix: update initial version hex

* add shell scripts of smoke test and benchmark test for group mode

* fix typo

---------

Co-authored-by: Wen.Yang <Wen.Yang@example.com>
2025-04-24 10:16:13 +08:00
slippedJim
2710b61b78 enable causal/no_causal case (#2111) 2025-04-22 09:11:27 +08:00
wen-des
a115ab14ed group mode attetnion for generic 64<hd<=128 (#2092)
* group mode attetnion for generic 64<hd<=128

* fix loop_idx overridden before kvBase in causal mode

* align format

---------

Co-authored-by: Wen.Yang <Wen.Yang@example.com>
Co-authored-by: danyao12 <danyao12@amd.com>
2025-04-20 11:11:11 +08:00
danyao12
20a250440a atomic_add exec 2025-04-14 17:01:08 +08:00
danyao12
cf964e9982 fix group seqlen_k >> seqlen_q/causal bug 2025-04-12 12:10:21 +08:00
danyao12
9b7c18bec5 fix seqlen_k >> seqlen_q/causal bug 2025-04-12 10:31:04 +08:00
danyao12
03b75b3f97 hd192 causal temp version 2025-04-11 15:22:32 +08:00
danyao12
4c07813b46 hd192 bf16 rtne&rtz 2025-04-08 15:21:13 +08:00
danyao12
7949a5adf8 hd192 temp version 2025-04-05 14:50:43 +08:00
danyao12
ab5f40cfaf enable SBHD 2025-03-24 17:58:01 +08:00
danyao12
bbb526b9d2 support for more layouts 2025-03-24 15:10:33 +08:00
danyao12
5e0ebbe8b6 fix hd64 group mode causal api 2025-03-23 12:43:25 +08:00
danyao12
917f7e55f8 Merge branch 'develop' into ck_tile/fa_bwd_v3 2025-03-22 20:38:04 +08:00
danyao12
6758b77802 fix random memory access fault 2025-03-22 20:03:28 +08:00
carlushuang
6c08c5c46d add mask support in hdim=192/128 (#1999) 2025-03-21 18:28:43 +08:00
BingYuan.Zhou
5a0d693b86 fix ck_tile/basic_gemm build error (#1988) 2025-03-20 22:01:14 -07:00
wen-des
8a25aa2669 support group mode for hd=64 of fa bwd v3 (#1990)
* support group mode for hd=64 of fa bwd v3

* bugfixed for causal mask kernels when using kernel balence

* tiny align

---------

Co-authored-by: Wen.Yang <Wen.Yang@example.com>
Co-authored-by: danyao12 <danyao12@amd.com>
2025-03-21 11:42:58 +08:00
felix
902dbe89ad change cmake (#2006)
Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-20 19:25:11 -07:00
carlushuang
e3c9886cdf [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-03-20 11:00:29 -07:00
jakpiase
0e91d32c61 [CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919)
* switch to universal gemm for batched and grouped gemms

* added reviewer comments

* fixed grouped gemm tests
2025-03-20 11:17:04 +01:00
rocking
b819c217e4 Sync the kname with instance name (#1989)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-03-20 00:06:45 +08:00
danyao12
e80ff1acbb tiny fix 2025-03-18 12:08:34 +08:00
Illia Silin
1342ecf7fb Add a daily CI build on gfx908. (#1987)
* add one daily ci build on gfx908

* add redis invocation tag for gfx908

* make ci build for gfx908 conditional

* fix groovy logic

* add option to run perf tests for gfx908

* disable a few tests on mi100
2025-03-17 18:08:53 -07:00
aledudek
5095906975 Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue
2025-03-17 16:42:43 +01:00
valarLip
52b1cd7780 hotfix fmoe build issue (#1976) 2025-03-13 15:11:59 +08:00
carlushuang
3e81279d26 Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit 8cbcd3e0d0.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-13 11:41:39 +08:00
Illia Silin
d4a6d69643 disable tests that take too long to build for gfx90a (#1975) 2025-03-12 17:54:03 -07:00
Illia Silin
8cbcd3e0d0 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 7a93b16ff6.
2025-03-11 10:40:18 -07:00
Haocong WANG
cbd74c2d12 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-11 10:11:21 -07:00
Illia Silin
aa42c3db06 disable example_moe_gemm2_xdl_pk_i4 on gfx950 (#1968) 2025-03-11 08:34:47 -07:00
carlushuang
7a93b16ff6 [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update
2025-03-11 21:07:40 +08:00
wenchenvincent
489602f9a8 Enabled bwd support for hdim_qk != hdim_v for TE integration. (#1965) 2025-03-11 11:35:27 +08:00
Mingtao Gu
0db7c8f0b2 Ck int4 moe develop (#1949)
* Add Gemm fp8xint4 example and kernel, function pass.

* Init Gemm_fp8xint4 Bpreshuffle

* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet

* General fix.

* fp8xint4 bpreshuffle function pass

* fix.

* init b preshuffle dequant in VGPR.

* fix bug, function pass.

* move b thread dequant copy to blockwise.

* fix bug, function now passes.

* modified the tile size to 256, 128x128x128.

* fixed a bug.

* Initial int4 moe, compile pass, function not check.

* fix bug in moe_gemm1.cpp, now function pass.

* test expert = 8 and function pass.

* Added moe_pk_i4_gemm2, function pass.

* Added b preshuffle pipeline v3 support.

* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.

* Split the blockwise pipeline for fp8xint4.

* commit missing files

* opt gemm2 to 2x2 wave

* fix swizzle = false

* update int4 moe with latest input changes.

* update tile size.

* enable pipeline v3.

* fix nswizzle = true

* commit a version for compiler debug.

* Updated transfer_v3r1_gather to support pk_i4_t type.

* for int4 moe2 for type_convert support.

* remove some values between mfma instructions.

* fix int4 moe

* Updated transfer_v3r1_gather to support pk_i4_t type.

* i4 support lds multiple shuffle

* fixed int4 moe tflops calculation.

* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle

* updated gemm2.

* change int4 moe example names

* fix and format code.

* format.

* format codes.

* update fp8xint4 example tile size.

* add <unordered_map> header

* fixed.

* format.

* Added conditional compilation for int4 -> fp8 conversion kernels

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-10 11:16:44 +08:00
Thomas Ning
c954bd0cfa Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955)
* tempsave, not selected

* finish the feature and merge with develop

---------

Co-authored-by: aska-0096 <haocwang@amd.com>
2025-03-07 13:44:06 -08:00
Max Podkorytov
9e132eb77c refactor ck-tile kernel launch (#1925) 2025-03-07 08:29:40 -08:00
kylasa
66c5f5b0b6 Addressing (Post Merge) code review comments for PR 1845 (#1883)
* Addressing code review comments.

* Addressing code review comments.

* Reorganized code for better readability.

* add ck_tile gemms for new types in CI

* fix jenkins syntax

* fix script syntax

* Add the test cases back

* Address the review comments

* Address review comments

* clang format

* Solve the merging issues

* Addressed the comments

* clang format

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-03-06 11:40:30 -08:00
Illia Silin
9b51c08bf7 remove support for gfx940 and gfx941 targets (#1944)
* remove support for gfx940 and gfx941 targets

* update changelog
2025-03-05 11:07:33 -08:00
feli
3786e16375 ck moe gemm implement (#1936)
* port all moe changes from ck_moe_gemm branch

* refine codes in the pr

* fix tail odd

* fix clang format

* fix clang format2

* make hot loop scheduler compatible with 16x16 and 32x32

* clang format

* fix per token quant

* rename moe example

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-05 15:56:55 +08:00
Ye Wang
151e999931 with receipt id=5, enable bias for te 2025-03-05 00:24:41 -06:00
jefyang1
c95bda93ba Remove CK_USE_AMD_MFMA_GFX950 (#1935)
* Add runtime check in example_gemm_xdl_streamk for gfx950

* Add runtime check in grouped conv fwd examples for gfx950

* Disable CK_USE_AMD_MFMA_GFX950

* Add new instances for gfx950

* Fix test_gemm_universal on gfx950
2025-03-04 10:32:25 -08:00
asleepzzz
ef16010273 Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913)" (#1933)
This reverts commit 020148d0f7.
2025-03-03 07:17:39 -08:00
danyao12
173ed584e0 Merge branch 'develop' into ck_tile/fa_bwd_v3 2025-03-03 19:17:27 +08:00