mtgu0705
40ed20a30d
updated the codes
2025-06-04 02:04:28 -05:00
mtgu0705
5117e99822
Merge remote-tracking branch 'origin/mx_moe_f4_scaleshuffle_Bnoshuffle' into mxfp4_moe_blockscale_buf2lds
2025-06-04 00:14:18 -05:00
Ding, Yi
dd8284dd63
Fix unknown compiler flag
2025-06-04 03:54:39 +00:00
aska-0096
86c8bef5d7
Refactor thread_copy_lds_direct_load; fix gfx942 direct lds load example; fix f16_pki4 example
2025-06-03 14:54:30 +00:00
Ding, Yi
0cfb09c6c8
Fix target_compile_options for disabled target on gfx942
2025-06-03 06:52:52 +00:00
Ding, Yi
331ccb8ca2
Merge remote-tracking branch 'origin/develop' into gfx950-mxfp4
2025-06-03 05:38:56 +00:00
Ding, Yi
f9bf27548e
Generate random tensor values with multiple threads
2025-06-03 02:40:24 +00:00
Khushbu Agarwal
2e38eb4f1c
Rotating buffer PR CI fix ( #2257 )
...
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 )"
This reverts commit bbdaf79a52 .
* fix regression
2025-06-02 10:25:01 -07:00
aska-0096
dd24786f78
Enable splitk for mxfp4; clang format;
2025-06-02 12:23:01 +00:00
aska-0096
5696e3c9f5
fix the cmake issue
2025-05-30 15:51:58 +00:00
aska-0096
bb5bdff61c
remove unnecessary files
2025-05-30 08:39:25 +00:00
Ding, Yi
6cba96e510
Use v1 pipeline for example_moe_gemm2_xdl_mx_fp4_bns
2025-05-30 05:46:31 +00:00
Ding, Yi
50956c6c7b
Merge remote-tracking branch 'origin/wjx/moe_v3_aiter' into gfx950-mxfp4
2025-05-30 03:56:35 +00:00
Ding, Yi
69418725a6
Merge remote-tracking branch 'origin/moe_bs_fp8_no_asm' into gfx950-mxfp4
2025-05-30 03:15:47 +00:00
slippedJim
57f497452a
remove restriction of group mode hd192 no lse ( #2252 )
...
Co-authored-by: Jim <jimguo12@amd.com >
2025-05-30 10:14:21 +08:00
aska-0096
d563dac424
fix performance bug of bpreshuffle f8 gemm
2025-05-29 10:02:46 +00:00
Po Yen Chen
28cd0dffc9
[CK_TILE] FMHA forward batch_prefill optimization for low CU utilization ( #2251 )
...
* Add constraint on traits/tile/pipeline
* Use kM0=128 if max_seqlen_q == 8192
* Re-format codegen script
* Remove redundant attr name postix
* Fix import error: default field in dataclass
* Use kK0=64 & kK1=64 to hide latency
* Use CU utilization to decide tile size
2025-05-29 18:36:33 +09:00
valarLip
ccddc5215e
recover example
2025-05-29 09:09:40 +00:00
aska-0096
c3d52993c4
update the flag name for f8blockscale
2025-05-29 08:47:34 +00:00
OscarXu
6be76c53b6
No asm ver. for merging moe blocksale fp8 into mainline
2025-05-29 03:38:56 -05:00
aska-0096
0db8d71dc1
Remove debug infos; Enable flags for blockscale f8
2025-05-29 08:21:54 +00:00
OscarXu
52d68c9529
flag and barrier fix for copmiler branch MainOpSelV3
2025-05-29 03:13:11 -05:00
Ding, Yi
f9ccd1a378
Fix bf8 config
2025-05-29 02:20:47 +00:00
Ding, Yi
2b4b189a5f
Fix fp8 config
2025-05-29 02:18:02 +00:00
OscarXu
653bc83f8a
Remove rocm6.3 workaround flags and macro
2025-05-28 21:05:21 -05:00
Illia Silin
bbdaf79a52
Revert "[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )" ( #2256 )
...
This reverts commit 99857e10e6 .
2025-05-28 09:46:52 -06:00
Ding, Yi
35b436c0d9
Clang-format after 2 merges
2025-05-28 11:16:00 +00:00
Ding, Yi
aecac410d0
Merge remote-tracking branch 'origin/f8blk_scale_opt' into wip-f4-mergemoe-2
2025-05-28 11:15:22 +00:00
OscarXu
772debdf8f
Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format
2025-05-28 18:29:06 +08:00
Ding, Yi
ad7fd89c1d
Merge remote-tracking branch 'origin/feiw/mxfp4_moe_2Stages' into wip-f4
2025-05-28 09:28:26 +00:00
Ding, Yi
857ef9f8c4
Merge preshuffle device
2025-05-28 07:02:28 +00:00
Khushbu Agarwal
99857e10e6
[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )
...
* Add rotating buffer feature for universal gemm
* adding changes in tile_engine
* Updated code to merge kernel_launch
* removing comments
* Enable rotating buffer changes to flatmm
* Created diff launch_kernel function for rotating buffer
* Simplfied calculation using macros
* merge code with new changes in tile_engine
* clang formatted
* Redefine macros
2025-05-27 23:00:58 -07:00
Aviral Goel
c52649ad57
Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) ( #2234 )
...
* added catch statements to examples
* clang format
2025-05-27 22:32:42 -07:00
aska-0096
78d0fd4e65
add vmcnt guard for async copy
2025-05-28 03:47:46 +00:00
Ding, Yi
b99c50a5d5
pad ascale
2025-05-28 03:35:33 +00:00
Ding, Yi
cf5b4c11a2
Pad shuffled a scale only
2025-05-28 02:37:14 +00:00
aska-0096
65255e12fb
Unconditional Ascale padding
2025-05-28 01:55:23 +00:00
mtgu0705
2f0ee8ccb1
change the gemm1 tile from 64x128x128 to 128x64x128
2025-05-27 20:43:38 -05:00
mtgu0705
52b764d59f
update MX moe GEMM1 hotloopscheduling
2025-05-27 20:43:22 -05:00
aska-0096
63c9388881
Pad the M for scale buffer unconditionaly
2025-05-27 11:52:12 +00:00
aska-0096
9da2995163
Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4
2025-05-27 10:23:21 +00:00
aska-0096
04f7265c19
refactor the pipeline
2025-05-27 10:14:45 +00:00
Ding, Yi
d3015785cb
Fix 'Merge gemm_mx_common.hpp'
2025-05-27 09:08:02 +00:00
aska-0096
71e7346bf4
Merge branch 'wip-f4' of https://github.com/ROCm/composable_kernel into wip-f4
2025-05-27 07:32:16 +00:00
aska-0096
137e28d151
temp save, 4.4~4.5
2025-05-27 07:31:16 +00:00
Ding, Yi
85ac576109
Merge gemm_mx_common.hpp
2025-05-27 06:13:03 +00:00
Ding, Yi
123053b685
Merge remote-tracking branch 'origin/wip-f4-wp' into wip-f4
2025-05-27 03:36:38 +00:00
Zzz9990
ece38b9d7a
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit aa9847e42d .
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
2025-05-26 19:17:18 +08:00
aska-0096
d1d56e89ef
fix the correctness issue
2025-05-26 09:29:36 +00:00
Ding, Yi
40af523e2c
Add rotating to mx examples
2025-05-26 05:05:54 +00:00