AMD-dteng
|
31ec8bd33a
|
use builtin function to wait A load's data
|
2025-09-17 06:54:32 -05:00 |
|
Feng Shijie
|
186c69960e
|
Change the preshuffle pattern of MFMA_16x16x128_F8
|
2025-09-15 06:07:34 +00:00 |
|
Feng Shijie
|
fdbb22d2b6
|
Add async_load for tile_scatter_gather
|
2025-09-12 05:32:45 +00:00 |
|
Feng Shijie
|
a44d35172d
|
Add async_load for tile_scatter_gather
|
2025-09-11 07:39:41 +00:00 |
|
Feng Shijie
|
2e36b3c3ff
|
Add buffer_load_lds synchronization
|
2025-09-11 07:06:52 +00:00 |
|
Feng Shijie
|
1c05d3be89
|
Disable hot schduler when enable buffer_load_lds
|
2025-09-11 06:40:11 +00:00 |
|
Feng Shijie
|
f4fdaedf4c
|
Add macro option to enable BUFFER_LOAD_LDS
|
2025-09-11 05:42:27 +00:00 |
|
Feng Shijie
|
e7c1c77120
|
Enable async_load in the F16xF4_flatmm_pipelineo on gfx950
|
2025-09-10 11:41:54 +00:00 |
|
Feng Shijie
|
b50c8ee350
|
enable async_load on gfx950 for a16w4 flatmm pipeline
|
2025-09-09 08:44:20 +00:00 |
|
Feng Shijie
|
4b49839c63
|
fix the tensor_view shape of C in gemm1-gate-up
|
2025-09-09 07:41:32 +00:00 |
|
Feng Shijie
|
08bae35abd
|
fix backward compatibility
|
2025-09-09 04:27:33 +00:00 |
|
Feng Shijie
|
b09b6cdce9
|
fix scale_m gather load for a8w8 moe
|
2025-09-09 04:18:48 +00:00 |
|
Feng Shijie
|
027f5311c6
|
support persist-mode for moe-gemm
|
2025-09-09 03:52:03 +00:00 |
|
Feng Shijie
|
a725bc08a7
|
fix gate-up mismatch when OutputNRepeat > 1
|
2025-09-09 03:52:03 +00:00 |
|
Feng Shijie
|
9efde19691
|
Refactor scale and bias function, encapsulate scale/bias pointer to a tensor view and load data by tilewise operation
|
2025-09-09 03:52:03 +00:00 |
|
coderfeli
|
946be930dd
|
update
|
2025-09-07 08:22:07 +00:00 |
|
lalala-sh
|
adc8db73e0
|
only padding N in epilogue
|
2025-09-05 02:50:47 +00:00 |
|
Feng Shijie
|
5c484a5672
|
Add bias for f16xf4 moe_flatmm
|
2025-08-28 08:02:50 +00:00 |
|
Feng Shijie
|
65b702454c
|
support swiglu activaion and use rcpf to accelerate silu
|
2025-08-26 12:32:29 +00:00 |
|
root
|
d05eed931d
|
add line to last
|
2025-08-22 04:01:59 -05:00 |
|
root
|
d69cab7f0c
|
adjust A_LDS descriptor to avoid bankconflict
|
2025-08-22 03:20:46 -05:00 |
|
root
|
65989e940c
|
enable hotloop
|
2025-08-21 09:46:52 -05:00 |
|
Feng Shijie
|
c378e9bdf8
|
support atomic_pk_add_bf16 on gfx950
|
2025-08-21 09:12:21 +00:00 |
|
Feng Shijie
|
85976b0b87
|
use int64_t as expert stride to avoid overflow
|
2025-08-21 06:58:55 +00:00 |
|
Feng Shijie
|
9fbcc8f8a4
|
use v4i32 as the storage type for B to avoid repack operation
|
2025-08-20 13:53:32 +00:00 |
|
Feng Shijie
|
81899bd920
|
add pk_fp4_t and e8m0_t support for amd_buffer_load_impl
|
2025-08-20 06:40:03 +00:00 |
|
Feng Shijie
|
c27eb0771a
|
optimize cvt_pkf4_to_f16 implementation
|
2025-08-20 04:39:14 +00:00 |
|
Feng Shijie
|
3ca0bd500a
|
optimize A_LDS descriptor to avoid bankconflict
|
2025-08-19 14:56:46 +00:00 |
|
Feng Shijie
|
f7f0306eea
|
fix gate-up when GU_NRepeat > 1
|
2025-08-18 18:43:37 +00:00 |
|
Feng Shijie
|
be55c0f9cb
|
add fp16xf4 moe
|
2025-08-18 17:28:11 +00:00 |
|
Feng Shijie
|
599e1f5b32
|
rename example
|
2025-08-17 17:51:18 +00:00 |
|
Feng Shijie
|
7899fb4a8d
|
remove additional check when e8m0->float
|
2025-08-15 06:20:46 +00:00 |
|
Feng Shijie
|
714b341797
|
eliminate repeat dequant
|
2025-08-14 09:34:12 +00:00 |
|
Feng Shijie
|
53e8c0c533
|
Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm
|
2025-08-13 16:51:49 +00:00 |
|
Feng Shijie
|
5de6208952
|
update f16xMXF4
|
2025-08-13 16:16:48 +00:00 |
|
Feng Shijie
|
732ebdee8b
|
update scale-preshuffle for MXF4
|
2025-08-13 10:48:53 +00:00 |
|
Feng Shijie
|
edb58d0680
|
update
|
2025-08-11 11:24:34 +00:00 |
|
Feng Shijie
|
cc9c7b9e58
|
optimize gemm2 atomic_add pattern
|
2025-08-11 08:38:23 +00:00 |
|
Feng Shijie
|
200a11afc8
|
update scale for mxfp4
|
2025-08-11 07:59:47 +00:00 |
|
Gino Lu
|
8ba1c708dc
|
Add e8m0 scaled convert into CK_TILE (#2617)
* first commit
* remove redundent code
* modify according to comments.
* fix type_convert error with scaled_type_convert
|
2025-08-10 07:42:42 +00:00 |
|
Feng Shijie
|
f788d3d629
|
add mixed_prec fp16xfp4
|
2025-08-08 20:19:16 +00:00 |
|
Feng Shijie
|
3dea10a277
|
debug mixed_prec flatmm
|
2025-08-07 09:22:04 +00:00 |
|
Feng Shijie
|
6d3cbc7c0e
|
add moe_flatmm
|
2025-08-06 08:33:33 +00:00 |
|
coderfeli
|
c0cb4d036d
|
fix split k
|
2025-08-06 02:45:31 +00:00 |
|
Feng Shijie
|
90e910f3a7
|
fix flatmm with scaling when WarpTileM == 32
|
2025-08-04 07:16:36 +00:00 |
|
Feng Shijie
|
aa5e008fa5
|
optimize scaling epilogue
|
2025-08-01 11:01:23 +00:00 |
|
Feng Shijie
|
3f43b841d4
|
prune debug message
|
2025-07-30 06:37:26 +00:00 |
|
Feng Shijie
|
2e5d4c74cd
|
fix compile error
|
2025-07-30 04:52:08 +00:00 |
|
Feng Shijie
|
c117a1986a
|
Add persistent option on flatmm for tuning
|
2025-07-29 15:42:58 +00:00 |
|
AMD-dteng
|
a587701117
|
update pipeline v1: add atomic IGLP schedule
|
2025-07-29 14:59:32 +00:00 |
|