Commit Graph

251 Commits

Author SHA1 Message Date
Feng Shijie
85976b0b87 use int64_t as expert stride to avoid overflow 2025-08-21 06:58:55 +00:00
Feng Shijie
9fbcc8f8a4 use v4i32 as the storage type for B to avoid repack operation 2025-08-20 13:53:32 +00:00
Feng Shijie
c27eb0771a optimize cvt_pkf4_to_f16 implementation 2025-08-20 04:39:14 +00:00
Feng Shijie
3ca0bd500a optimize A_LDS descriptor to avoid bankconflict 2025-08-19 14:56:46 +00:00
Feng Shijie
f7f0306eea fix gate-up when GU_NRepeat > 1 2025-08-18 18:43:37 +00:00
Feng Shijie
be55c0f9cb add fp16xf4 moe 2025-08-18 17:28:11 +00:00
Feng Shijie
599e1f5b32 rename example 2025-08-17 17:51:18 +00:00
Feng Shijie
7899fb4a8d remove additional check when e8m0->float 2025-08-15 06:20:46 +00:00
Feng Shijie
714b341797 eliminate repeat dequant 2025-08-14 09:34:12 +00:00
Feng Shijie
53e8c0c533 Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm 2025-08-13 16:51:49 +00:00
Feng Shijie
5de6208952 update f16xMXF4 2025-08-13 16:16:48 +00:00
Feng Shijie
732ebdee8b update scale-preshuffle for MXF4 2025-08-13 10:48:53 +00:00
Feng Shijie
edb58d0680 update 2025-08-11 11:24:34 +00:00
Feng Shijie
cc9c7b9e58 optimize gemm2 atomic_add pattern 2025-08-11 08:38:23 +00:00
Feng Shijie
200a11afc8 update scale for mxfp4 2025-08-11 07:59:47 +00:00
Feng Shijie
f788d3d629 add mixed_prec fp16xfp4 2025-08-08 20:19:16 +00:00
Feng Shijie
3dea10a277 debug mixed_prec flatmm 2025-08-07 09:22:04 +00:00
Feng Shijie
6d3cbc7c0e add moe_flatmm 2025-08-06 08:33:33 +00:00
coderfeli
c0cb4d036d fix split k 2025-08-06 02:45:31 +00:00
Feng Shijie
90e910f3a7 fix flatmm with scaling when WarpTileM == 32 2025-08-04 07:16:36 +00:00
Feng Shijie
aa5e008fa5 optimize scaling epilogue 2025-08-01 11:01:23 +00:00
Feng Shijie
3f43b841d4 prune debug message 2025-07-30 06:37:26 +00:00
Feng Shijie
2e5d4c74cd fix compile error 2025-07-30 04:52:08 +00:00
Feng Shijie
c117a1986a Add persistent option on flatmm for tuning 2025-07-29 15:42:58 +00:00
AMD-dteng
a587701117 update pipeline v1: add atomic IGLP schedule 2025-07-29 14:59:32 +00:00
Feng Shijie
1b6d7cf407 crz idea 2025-07-28 08:24:51 +00:00
Feng Shijie
5473f06461 Add permuteN optimzization when NRepeat % 2 == 0 on flatmm 2025-07-27 11:57:38 +00:00
sjfeng
bfb9f4002f try to remove c_shuffle_lds 2025-07-27 17:24:08 +08:00
Feng Shijie
1264f4d2ab fix loop-dim mismatch and improve c_shuffle alu parallelism 2025-07-25 07:41:48 +00:00
lalala-sh
1239d8a546 merge flatmm -scale 2025-07-24 08:46:51 +00:00
Feng Shijie
b908f5e803 fix flatmm syntax error on gfx950 2025-07-23 19:12:31 +00:00
Feng Shijie
5a1183ebbd support flatmm scaling 2025-07-23 19:04:22 +00:00
valarLip
89fa639207 merge flatmm pipe v0 from dteng_flatmm_opt 2025-07-23 09:50:33 +00:00
lalala-sh
3f7d848dd3 build pass 2025-07-23 15:38:12 +08:00
lalala-sh
6dacf833da fix bug 2025-07-23 07:20:26 +00:00
lalala-sh
7e1bd4b839 sync 2025-07-23 15:01:53 +08:00
valarLip
46a538e39e adaptive scheduler instead of Macro definition 2025-07-23 13:40:43 +08:00
lalala-sh
9aa3396a79 fix tail handler bug 2025-07-17 08:40:35 +00:00
lalala-sh
fb76450e63 merge from dteng_flatmm_opt 2025-07-16 10:12:19 +00:00
MHYangAMD
3499fe67ff [CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409)
* Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass

* Update rmsnorm2d_fwd_pipeline_model_sensitive_pass

1.  Add BlockReduce2dTreeCrossWarpSync

* Add Rmsnorm2dFusedModelSensitiveEnum

* Update patch

1. Reverse generate.py
2. Remove comment in generate.py
3. Update tree cross warp reduce

* Refactor RMSNorm model enum and introduce T5-like option

* Update the n stage for cross warp reduce

* Add new cmdline option in RMSNorm for new pipeline testing

---------

Co-authored-by: Clement Lin <clement.lin@amd.com>
Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>
2025-07-16 14:05:26 +08:00
carlushuang
cfe211cc60 [CK_TILE] moe sorting optimize local_token (#2469)
* fix bug in loops that need use local tokens to compute

* support extra chain local_token

* update

* update

* refine some main

* update

* support dispatch_policy

* fix 15 example
2025-07-15 09:42:18 +08:00
Khushbu Agarwal
d239b91fd5 Merge flatmm Operator with universal gemm (#2434)
* Initial commit

* Adding new tile partitioner to flatmm

* intermediate changes

* debugging kernels

* Updating flatmm example to universal gemm example

* updated flatmm kernel to run via gemmKernel

* update universal gemm to incorporate flatmm

* debug

* Fix flatmm call

* Fixing other kernels and tests for API changes

* clang formatted

* fixing gemm tests

* added test for flatmm and simplify kernel arguments

* adding flatmm test

* fix test for flatmm

* simplify gemm kernel with flatmm

* remove flatmm related files

* addressing review comments and code clean up

* resolving empty file

* resolving empty file

* clang formatted

* addressing review comments

* enable persistent kernel for flatmm

* reverted the removed files for flatmm

* reverted the removed files for flatmm

* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example

* some more renames

* clang formatted
2025-07-11 08:27:55 -07:00
Qianfeng
45904b8fd7 Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487)
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline

* i_nhead_ conversion type to prevent overflow

---------

Co-authored-by: ltqin <letaoqin@amd.com>
2025-07-11 18:14:47 +08:00
shay-li77
d814fefe18 support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338)
* mask support ratio for y axis

* format code

* add notes for param y_ratio

* fix comments error

* support template and mdiv for ratio mask

* refactor y-ratio mask constructor

* optimize coordinate calculation

* add SimplifiedRatioAttentionMask
2025-07-09 23:18:55 +08:00
Haocong WANG
5557eadce6 [CK TILE] Fix FA build filter (#2369)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* cmake depends & bwd filter order fix

* revert unexpected reformat

* Avoid change fmha bwd filter order for downstream compatibility

* Revert unexpected changes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2025-07-08 10:42:07 +08:00
Po Yen Chen
b2dea90116 Eliminate warning caused by failed to meet occupancy requirement (#2389)
Co-authored-by: felix <felix.li@amd.com>
2025-07-08 09:17:25 +08:00
Thomas Ning
f240ae3248 Enable Async Copy for MI355 (#2425)
* add for async load builtin

* add async load api

* fix some compiling errors

* fix a compiling error

* fix some compiling errors

* add a pipeline which copies from v4

* add a new pipeline for async load

* fix some compiling errors

* add async load tests

* fix some issues in async load

* fix

* fix async inline assembly

* fix async inline assembly

* add ignore header file

* comment some not gfx950 codes

* comment some not gfx950 codes

* fix a error

* update async load apis

* fix lds descriptor

* fix a compiling error

* fix some compiling errors

* fix a descriptor issue

* update lds descriptor

* change async pipeline's tile distribution pattern from thread to warp

* fix clang format

* update async policy

* fix a CRTP issue

* fix a typo error

* change lds layout

* fix some sync issues

* improve codes

* delete the async test

* fix a commented format issue

* avoid compiling device functions when compile host

* make gemm run

* add the copy kernel support

* finish the feature

* Address comment

* add the support for buffer_builtin

* solved the merging problem

* Comment Addressed

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>
2025-07-07 10:08:49 -07:00
ltqin
9f4c5d7372 ck tile pagedkv prefill (#2405)
* add prefetching physical block id for pagedkv

* start add pagedkv prefill

* rename pipeline

* add kernel for pagedkv

* add an init version pagedkv prefill

* fix redefine issue

* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args

* generate dispatch code

* add body generating code

* comipling pass

* remove dropout from pagedkv

* set lse to false in generating code

* start changing qr kernel to pagedkv

* init version of  kernerl with pagedkv

* change names of file that are generated

* chang host validation for pagedkv prefill

* using iglp to change blockgemm

* add kernel files to op head file

* show parameters

* rewrite print parameter fun

* add fwd

* remove default parameter of GridSize

* format

* fix nhead issue and add seqlen_k_ptr to batch mode

* format code

* remove no-longer used code

* format

* fix some comments

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-07-07 16:16:54 +08:00
Thomas Ning
e03293ebce [CK Tile] Int8 Support on CK Tile GEMM (#2267)
* updates to support int8 in 03_gemm example

* added comments, using aliases, helper functions

* test(gemm_universal): add test cases for int8 gemm pipeline

* fix(test_gemm): fix for failing test unit test for int8

* test(ck_tile): add int8 unit test for gemm universal

* refactor(gemm_universal): GPU reference verification for GEMM code improved

* style(gemm_universal): removed extra comments and did clang format

* merging recent changes to universal gemm to tile_engine

* ck tile engine integration work

* feat(tile_engine): add int8 support to tile engine ops/gemm

* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8

* style: Format code with clang-format-12

* refactor(tile_engine): address review comments

* style: removed unhelpful comments & unused variables.

* build: tile engine uses default config

* feat: add int8 support for CK_TILE GEMM

* style: added trailing commas to codegen_utils.py

* refactor: tile engine

* refactor: formatting and code review

* refactor: code formatting for python files

* fix: suppress build warning

* add support for gfx950

* refactor:KWarpTile size in gemms util

* Fix the branch and wrap up the k warp tile

* Add bf8 integration

* refactor: clang format and rebase

---------

Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>
2025-06-25 08:20:35 -07:00
linqunAMD
37e1a27537 [CK_TILE] Refine fp8 support in flatmm (#2239)
* [CK_TILE] Refine fp8 in flatmm

1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct

* fix clang format

* fix clang format

* keep default bhavior unchanged (warp tile = 16x16)

* fix tile engine build error

* fix a typo in codegen_utils.py

* address review comments

* address review comments

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-06-25 01:07:45 -07:00