aska-0096
bb5bdff61c
remove unnecessary files
2025-05-30 08:39:25 +00:00
Ding, Yi
6cba96e510
Use v1 pipeline for example_moe_gemm2_xdl_mx_fp4_bns
2025-05-30 05:46:31 +00:00
Ding, Yi
50956c6c7b
Merge remote-tracking branch 'origin/wjx/moe_v3_aiter' into gfx950-mxfp4
2025-05-30 03:56:35 +00:00
Ding, Yi
69418725a6
Merge remote-tracking branch 'origin/moe_bs_fp8_no_asm' into gfx950-mxfp4
2025-05-30 03:15:47 +00:00
aska-0096
d563dac424
fix performance bug of bpreshuffle f8 gemm
2025-05-29 10:02:46 +00:00
valarLip
ccddc5215e
recover example
2025-05-29 09:09:40 +00:00
aska-0096
c3d52993c4
update the flag name for f8blockscale
2025-05-29 08:47:34 +00:00
OscarXu
6be76c53b6
No asm ver. for merging moe blocksale fp8 into mainline
2025-05-29 03:38:56 -05:00
aska-0096
0db8d71dc1
Remove debug infos; Enable flags for blockscale f8
2025-05-29 08:21:54 +00:00
OscarXu
52d68c9529
flag and barrier fix for copmiler branch MainOpSelV3
2025-05-29 03:13:11 -05:00
Ding, Yi
f9ccd1a378
Fix bf8 config
2025-05-29 02:20:47 +00:00
Ding, Yi
2b4b189a5f
Fix fp8 config
2025-05-29 02:18:02 +00:00
OscarXu
653bc83f8a
Remove rocm6.3 workaround flags and macro
2025-05-28 21:05:21 -05:00
Ding, Yi
35b436c0d9
Clang-format after 2 merges
2025-05-28 11:16:00 +00:00
Ding, Yi
aecac410d0
Merge remote-tracking branch 'origin/f8blk_scale_opt' into wip-f4-mergemoe-2
2025-05-28 11:15:22 +00:00
OscarXu
772debdf8f
Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format
2025-05-28 18:29:06 +08:00
Ding, Yi
ad7fd89c1d
Merge remote-tracking branch 'origin/feiw/mxfp4_moe_2Stages' into wip-f4
2025-05-28 09:28:26 +00:00
Ding, Yi
857ef9f8c4
Merge preshuffle device
2025-05-28 07:02:28 +00:00
aska-0096
78d0fd4e65
add vmcnt guard for async copy
2025-05-28 03:47:46 +00:00
Ding, Yi
b99c50a5d5
pad ascale
2025-05-28 03:35:33 +00:00
Ding, Yi
cf5b4c11a2
Pad shuffled a scale only
2025-05-28 02:37:14 +00:00
aska-0096
65255e12fb
Unconditional Ascale padding
2025-05-28 01:55:23 +00:00
mtgu0705
2f0ee8ccb1
change the gemm1 tile from 64x128x128 to 128x64x128
2025-05-27 20:43:38 -05:00
mtgu0705
52b764d59f
update MX moe GEMM1 hotloopscheduling
2025-05-27 20:43:22 -05:00
aska-0096
63c9388881
Pad the M for scale buffer unconditionaly
2025-05-27 11:52:12 +00:00
aska-0096
9da2995163
Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4
2025-05-27 10:23:21 +00:00
aska-0096
04f7265c19
refactor the pipeline
2025-05-27 10:14:45 +00:00
Ding, Yi
d3015785cb
Fix 'Merge gemm_mx_common.hpp'
2025-05-27 09:08:02 +00:00
aska-0096
71e7346bf4
Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4
2025-05-27 07:32:16 +00:00
aska-0096
137e28d151
temp save, 4.4~4.5
2025-05-27 07:31:16 +00:00
Ding, Yi
85ac576109
Merge gemm_mx_common.hpp
2025-05-27 06:13:03 +00:00
Ding, Yi
123053b685
Merge remote-tracking branch 'origin/wip-f4-wp' into wip-f4
2025-05-27 03:36:38 +00:00
aska-0096
d1d56e89ef
fix the correctness issue
2025-05-26 09:29:36 +00:00
Ding, Yi
40af523e2c
Add rotating to mx examples
2025-05-26 05:05:54 +00:00
Andriy Roshchenko
f03da29b65
Merge branch origin/wip-f4 into andriy/wip-f4
2025-05-23 22:14:30 +00:00
Andriy Roshchenko
1c91f6bf1e
Fix example_gemm_mx build
2025-05-23 22:00:07 +00:00
aska-0096
574d65efed
temp save
2025-05-23 14:51:24 +00:00
feifei14119
2e39bf06f7
fix typo
2025-05-23 11:23:01 +00:00
mtgu0705
2216ff0521
update mx moe gemm1 gemm2 TF and BW calculation
2025-05-23 05:29:39 -05:00
mtgu0705
d6bfdc9d7d
update mx moe gemm1_bns tile size to 64x128x256
2025-05-23 05:10:45 -05:00
feifei14119
ce4e7b39da
gemm1 func pass
2025-05-23 09:26:38 +00:00
aska-0096
a4dae9eb86
optimize offset math in dma
2025-05-22 08:15:31 +00:00
aska-0096
7f7c4d35c7
lds conflict free + buffer load lds
2025-05-22 08:04:52 +00:00
Andriy Roshchenko
e302ab8f0c
Merge branch origin/develop into wip-fp4
2025-05-22 06:31:47 +00:00
Lin, Qun
97709c4aa1
correct preShuffleBuffer
...
we should used packed k to do shuffle.
2025-05-22 01:09:13 -05:00
OscarXu
fc9ef98e7b
Add gemm2 64x128x128 asm. Fix BF16 ref.
2025-05-21 16:57:57 +08:00
SamiAario-AMD
380bca2b85
Fix 11_add_rmsnorm2d_rdquant ( #2207 )
2025-05-20 15:15:28 -07:00
Thomas Ning
1386924749
Add the instances for small sized GEMM in preshuffle and improve CMake Flag ( #2212 )
...
* Add small instance, add the bug fix, & improve the example CMake
* clang format
2025-05-20 15:05:08 -07:00
mtgu0705
eb588bc083
update the TFlops calculation in the example
2025-05-20 10:00:13 -05:00
Sami Remes
d1e6f0982d
[CK_TILE] Grouped GEMM tile loop ( #2146 )
...
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm
* Some helper functions for persistent kernel case
* Get max occupancy grid using device properties
* Implement tile loop in main entry point to grouped gemm
* Enable GridSize() on device
* Handle offset tile index using real current block index
* Add persistent kernel choice to grouped gemm example
* Use a for-loop for iterating over the group
* Reduce VGPR spills by early-exit
* Enable persistent kernel choice in grouped_gemm example
* Add persistent kernel option to grouped_gemm test
* Fix formatting with remod.py
* Remove GridUpdateBlocks as blocks are now iteratively computed
* Add comment about VGPR spilling
* Fix formatting
* Use CK_TILE_HOST instead of __host__
* Enable all Row/Col combinations in grouped gemm unit test
* Add some KBatch=2 cases to grouped gemm tests
* Fix SplitK for grouped gemm
* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm
* Add type traits
* Split examples to regular and tileloop
* Formatting
* Use hipExtStreamGetCUMask to get current active CUs for the given stream
* Align test and example kernel config, and disable validation for splitk repeats
* Remove debug options from CMakeLists.txt
* Separate the code paths for persistent/non-persistent in test
* Fix formatting
* Address review comments
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2025-05-20 17:18:57 +03:00