Commit Graph

66 Commits

Author SHA1 Message Date
AMD-dteng
31ec8bd33a use builtin function to wait A load's data 2025-09-17 06:54:32 -05:00
Feng Shijie
186c69960e Change the preshuffle pattern of MFMA_16x16x128_F8 2025-09-15 06:07:34 +00:00
Feng Shijie
fdbb22d2b6 Add async_load for tile_scatter_gather 2025-09-12 05:32:45 +00:00
Feng Shijie
a44d35172d Add async_load for tile_scatter_gather 2025-09-11 07:39:41 +00:00
Feng Shijie
2e36b3c3ff Add buffer_load_lds synchronization 2025-09-11 07:06:52 +00:00
Feng Shijie
1c05d3be89 Disable hot schduler when enable buffer_load_lds 2025-09-11 06:40:11 +00:00
Feng Shijie
f4fdaedf4c Add macro option to enable BUFFER_LOAD_LDS 2025-09-11 05:42:27 +00:00
Feng Shijie
e7c1c77120 Enable async_load in the F16xF4_flatmm_pipelineo on gfx950 2025-09-10 11:41:54 +00:00
Feng Shijie
b50c8ee350 enable async_load on gfx950 for a16w4 flatmm pipeline 2025-09-09 08:44:20 +00:00
Feng Shijie
9efde19691 Refactor scale and bias function, encapsulate scale/bias pointer to a tensor view and load data by tilewise operation 2025-09-09 03:52:03 +00:00
coderfeli
946be930dd update 2025-09-07 08:22:07 +00:00
Feng Shijie
65b702454c support swiglu activaion and use rcpf to accelerate silu 2025-08-26 12:32:29 +00:00
root
d05eed931d add line to last 2025-08-22 04:01:59 -05:00
root
d69cab7f0c adjust A_LDS descriptor to avoid bankconflict 2025-08-22 03:20:46 -05:00
root
65989e940c enable hotloop 2025-08-21 09:46:52 -05:00
Feng Shijie
9fbcc8f8a4 use v4i32 as the storage type for B to avoid repack operation 2025-08-20 13:53:32 +00:00
Feng Shijie
c27eb0771a optimize cvt_pkf4_to_f16 implementation 2025-08-20 04:39:14 +00:00
Feng Shijie
3ca0bd500a optimize A_LDS descriptor to avoid bankconflict 2025-08-19 14:56:46 +00:00
Feng Shijie
be55c0f9cb add fp16xf4 moe 2025-08-18 17:28:11 +00:00
Feng Shijie
599e1f5b32 rename example 2025-08-17 17:51:18 +00:00
Feng Shijie
7899fb4a8d remove additional check when e8m0->float 2025-08-15 06:20:46 +00:00
Feng Shijie
714b341797 eliminate repeat dequant 2025-08-14 09:34:12 +00:00
Feng Shijie
53e8c0c533 Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm 2025-08-13 16:51:49 +00:00
Feng Shijie
5de6208952 update f16xMXF4 2025-08-13 16:16:48 +00:00
Feng Shijie
732ebdee8b update scale-preshuffle for MXF4 2025-08-13 10:48:53 +00:00
Feng Shijie
edb58d0680 update 2025-08-11 11:24:34 +00:00
Feng Shijie
200a11afc8 update scale for mxfp4 2025-08-11 07:59:47 +00:00
Feng Shijie
f788d3d629 add mixed_prec fp16xfp4 2025-08-08 20:19:16 +00:00
Feng Shijie
3dea10a277 debug mixed_prec flatmm 2025-08-07 09:22:04 +00:00
Feng Shijie
6d3cbc7c0e add moe_flatmm 2025-08-06 08:33:33 +00:00
coderfeli
c0cb4d036d fix split k 2025-08-06 02:45:31 +00:00
Feng Shijie
3f43b841d4 prune debug message 2025-07-30 06:37:26 +00:00
Feng Shijie
2e5d4c74cd fix compile error 2025-07-30 04:52:08 +00:00
Feng Shijie
c117a1986a Add persistent option on flatmm for tuning 2025-07-29 15:42:58 +00:00
AMD-dteng
a587701117 update pipeline v1: add atomic IGLP schedule 2025-07-29 14:59:32 +00:00
Feng Shijie
1b6d7cf407 crz idea 2025-07-28 08:24:51 +00:00
Feng Shijie
5473f06461 Add permuteN optimzization when NRepeat % 2 == 0 on flatmm 2025-07-27 11:57:38 +00:00
lalala-sh
1239d8a546 merge flatmm -scale 2025-07-24 08:46:51 +00:00
Feng Shijie
b908f5e803 fix flatmm syntax error on gfx950 2025-07-23 19:12:31 +00:00
Feng Shijie
5a1183ebbd support flatmm scaling 2025-07-23 19:04:22 +00:00
valarLip
89fa639207 merge flatmm pipe v0 from dteng_flatmm_opt 2025-07-23 09:50:33 +00:00
lalala-sh
3f7d848dd3 build pass 2025-07-23 15:38:12 +08:00
lalala-sh
6dacf833da fix bug 2025-07-23 07:20:26 +00:00
lalala-sh
7e1bd4b839 sync 2025-07-23 15:01:53 +08:00
valarLip
46a538e39e adaptive scheduler instead of Macro definition 2025-07-23 13:40:43 +08:00
lalala-sh
9aa3396a79 fix tail handler bug 2025-07-17 08:40:35 +00:00
lalala-sh
fb76450e63 merge from dteng_flatmm_opt 2025-07-16 10:12:19 +00:00
Khushbu Agarwal
d239b91fd5 Merge flatmm Operator with universal gemm (#2434)
* Initial commit

* Adding new tile partitioner to flatmm

* intermediate changes

* debugging kernels

* Updating flatmm example to universal gemm example

* updated flatmm kernel to run via gemmKernel

* update universal gemm to incorporate flatmm

* debug

* Fix flatmm call

* Fixing other kernels and tests for API changes

* clang formatted

* fixing gemm tests

* added test for flatmm and simplify kernel arguments

* adding flatmm test

* fix test for flatmm

* simplify gemm kernel with flatmm

* remove flatmm related files

* addressing review comments and code clean up

* resolving empty file

* resolving empty file

* clang formatted

* addressing review comments

* enable persistent kernel for flatmm

* reverted the removed files for flatmm

* reverted the removed files for flatmm

* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example

* some more renames

* clang formatted
2025-07-11 08:27:55 -07:00
linqunAMD
37e1a27537 [CK_TILE] Refine fp8 support in flatmm (#2239)
* [CK_TILE] Refine fp8 in flatmm

1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct

* fix clang format

* fix clang format

* keep default bhavior unchanged (warp tile = 16x16)

* fix tile engine build error

* fix a typo in codegen_utils.py

* address review comments

* address review comments

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-06-25 01:07:45 -07:00
Satyanvesh Dittakavi
4c57157d50 Do not use warpSize as compile time constant as it is removed (#2320)
* Do not use warpSize as compile time constant as it is removed

* Update tile_image_to_column_shape.hpp

update warpSize usage.

* clean-up all use of warpSize, make sure code builds

* fix

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-06-17 11:54:30 -07:00