2192 Commits

Author SHA1 Message Date
lalala-sh
10a288c3a2 opt moe sorting (#2822)
* opt moe storing for 2k

* rm duplicated clear

---------

Co-authored-by: root <root@hjbog-srdc-39.amd.com>
2025-09-18 08:57:20 +08:00
coderfeli
6a600f0cff fix unittest 2025-09-10 10:14:22 +00:00
lalala-sh
72d13f3f55 merge felix/sorting 2025-09-10 09:31:50 +00:00
coderfeli
e6ffff0fbf Merge remote-tracking branch 'origin/feat-mixed_input_flatmm' into wip_355 2025-09-09 11:02:07 +00:00
coderfeli
41bc59f5b1 merge nkpad 2025-09-09 10:59:14 +00:00
Feng Shijie
4b49839c63 fix the tensor_view shape of C in gemm1-gate-up 2025-09-09 07:41:32 +00:00
Feng Shijie
08bae35abd fix backward compatibility 2025-09-09 04:27:33 +00:00
Feng Shijie
b09b6cdce9 fix scale_m gather load for a8w8 moe 2025-09-09 04:18:48 +00:00
Feng Shijie
027f5311c6 support persist-mode for moe-gemm 2025-09-09 03:52:03 +00:00
Feng Shijie
a725bc08a7 fix gate-up mismatch when OutputNRepeat > 1 2025-09-09 03:52:03 +00:00
Feng Shijie
9efde19691 Refactor scale and bias function, encapsulate scale/bias pointer to a tensor view and load data by tilewise operation 2025-09-09 03:52:03 +00:00
coderfeli
946be930dd update 2025-09-07 08:22:07 +00:00
lalala-sh
adc8db73e0 only padding N in epilogue 2025-09-05 02:50:47 +00:00
Feng Shijie
5c484a5672 Add bias for f16xf4 moe_flatmm 2025-08-28 08:02:50 +00:00
Feng Shijie
dd6539f366 update case construction 2025-08-27 13:39:47 +00:00
Feng Shijie
65b702454c support swiglu activaion and use rcpf to accelerate silu 2025-08-26 12:32:29 +00:00
root
d05eed931d add line to last 2025-08-22 04:01:59 -05:00
root
d69cab7f0c adjust A_LDS descriptor to avoid bankconflict 2025-08-22 03:20:46 -05:00
root
65989e940c enable hotloop 2025-08-21 09:46:52 -05:00
Feng Shijie
c378e9bdf8 support atomic_pk_add_bf16 on gfx950 2025-08-21 09:12:21 +00:00
Feng Shijie
85976b0b87 use int64_t as expert stride to avoid overflow 2025-08-21 06:58:55 +00:00
Feng Shijie
9fbcc8f8a4 use v4i32 as the storage type for B to avoid repack operation 2025-08-20 13:53:32 +00:00
Feng Shijie
81899bd920 add pk_fp4_t and e8m0_t support for amd_buffer_load_impl 2025-08-20 06:40:03 +00:00
Feng Shijie
c27eb0771a optimize cvt_pkf4_to_f16 implementation 2025-08-20 04:39:14 +00:00
Feng Shijie
3ca0bd500a optimize A_LDS descriptor to avoid bankconflict 2025-08-19 14:56:46 +00:00
Feng Shijie
f7f0306eea fix gate-up when GU_NRepeat > 1 2025-08-18 18:43:37 +00:00
Feng Shijie
be55c0f9cb add fp16xf4 moe 2025-08-18 17:28:11 +00:00
Feng Shijie
599e1f5b32 rename example 2025-08-17 17:51:18 +00:00
Feng Shijie
7899fb4a8d remove additional check when e8m0->float 2025-08-15 06:20:46 +00:00
Feng Shijie
714b341797 eliminate repeat dequant 2025-08-14 09:34:12 +00:00
Feng Shijie
53e8c0c533 Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm 2025-08-13 16:51:49 +00:00
Feng Shijie
5de6208952 update f16xMXF4 2025-08-13 16:16:48 +00:00
Feng Shijie
732ebdee8b update scale-preshuffle for MXF4 2025-08-13 10:48:53 +00:00
Feng Shijie
edb58d0680 update 2025-08-11 11:24:34 +00:00
Feng Shijie
cc9c7b9e58 optimize gemm2 atomic_add pattern 2025-08-11 08:38:23 +00:00
Feng Shijie
200a11afc8 update scale for mxfp4 2025-08-11 07:59:47 +00:00
Feng Shijie
87aed564dc update case construction 2025-08-11 07:56:14 +00:00
Feng Shijie
8b85fa6cf2 update granularity control 2025-08-11 06:03:06 +00:00
Feng Shijie
1b8c7097b8 fix TileConfig 2025-08-11 03:42:46 +00:00
Gino Lu
8ba1c708dc Add e8m0 scaled convert into CK_TILE (#2617)
* first commit

* remove redundent code

* modify according to comments.

* fix type_convert error with scaled_type_convert
2025-08-10 07:42:42 +00:00
Feng Shijie
f788d3d629 add mixed_prec fp16xfp4 2025-08-08 20:19:16 +00:00
Feng Shijie
3dea10a277 debug mixed_prec flatmm 2025-08-07 09:22:04 +00:00
lalala-sh
0ba513b148 Merge pull request #2626 from ROCm/felix/flatmm_fix_splitk
fix split k
2025-08-06 16:49:47 +08:00
Feng Shijie
6d3cbc7c0e add moe_flatmm 2025-08-06 08:33:33 +00:00
coderfeli
c0cb4d036d fix split k 2025-08-06 02:45:31 +00:00
Feng Shijie
90e910f3a7 fix flatmm with scaling when WarpTileM == 32 2025-08-04 07:16:36 +00:00
Feng Shijie
aa5e008fa5 optimize scaling epilogue 2025-08-01 11:01:23 +00:00
Feng Shijie
ac5908c0bb fix wrong config for fp8 scaling 2025-08-01 07:28:38 +00:00
Feng Shijie
3f43b841d4 prune debug message 2025-07-30 06:37:26 +00:00
Feng Shijie
2e5d4c74cd fix compile error 2025-07-30 04:52:08 +00:00