Commit Graph

634 Commits

Author SHA1 Message Date
lalala-sh
7e10e4d1d9 Merge branch 'develop' into wjx/mxfp4_moe_2Stages 2025-05-28 05:01:13 -05:00
root
b0c58577db Merge remote-tracking branch 'origin/feiw/mxfp4_moe_2Stages' into wjx/mxfp4_moe_2Stages 2025-05-28 06:31:56 +00:00
lalala-sh
50764c6662 Merge branch 'develop' into wjx/mxfp4_moe_2Stages 2025-05-28 01:25:55 -05:00
mtgu0705
2f0ee8ccb1 change the gemm1 tile from 64x128x128 to 128x64x128 2025-05-27 20:43:38 -05:00
mtgu0705
52b764d59f update MX moe GEMM1 hotloopscheduling 2025-05-27 20:43:22 -05:00
Zzz9990
ece38b9d7a [VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221)
* fix splitkv compiler issue since lse is used to select kernel instances

* bypass seqlen == 1

* add chunked prefill into mha varlen

This reverts commit aa9847e42d.

* skip compile when receipt 2-4 and add comments

* fix

---------

Co-authored-by: fsx950223 <fsx950223@outlook.com>
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1 fix the buffer intrinsic names for clang >=20 (#2228) 2025-05-23 14:58:25 -07:00
feifei14119
2e39bf06f7 fix typo 2025-05-23 11:23:01 +00:00
mtgu0705
2216ff0521 update mx moe gemm1 gemm2 TF and BW calculation 2025-05-23 05:29:39 -05:00
mtgu0705
d6bfdc9d7d update mx moe gemm1_bns tile size to 64x128x256 2025-05-23 05:10:45 -05:00
feifei14119
ce4e7b39da gemm1 func pass 2025-05-23 09:26:38 +00:00
Illia Silin
1b846143c6 Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 58f9e9ffbc.
2025-05-22 15:41:17 -07:00
SamiAario-AMD
380bca2b85 Fix 11_add_rmsnorm2d_rdquant (#2207) 2025-05-20 15:15:28 -07:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00
mtgu0705
eb588bc083 update the TFlops calculation in the example 2025-05-20 10:00:13 -05:00
Sami Remes
d1e6f0982d [CK_TILE] Grouped GEMM tile loop (#2146)
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm

* Some helper functions for persistent kernel case

* Get max occupancy grid using device properties

* Implement tile loop in main entry point to grouped gemm

* Enable GridSize() on device

* Handle offset tile index using real current block index

* Add persistent kernel choice to grouped gemm example

* Use a for-loop for iterating over the group

* Reduce VGPR spills by early-exit

* Enable persistent kernel choice in grouped_gemm example

* Add persistent kernel option to grouped_gemm test

* Fix formatting with remod.py

* Remove GridUpdateBlocks as blocks are now iteratively computed

* Add comment about VGPR spilling

* Fix formatting

* Use CK_TILE_HOST instead of __host__

* Enable all Row/Col combinations in grouped gemm unit test

* Add some KBatch=2 cases to grouped gemm tests

* Fix SplitK for grouped gemm

* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm

* Add type traits

* Split examples to regular and tileloop

* Formatting

* Use hipExtStreamGetCUMask to get current active CUs for the given stream

* Align test and example kernel config, and disable validation for splitk repeats

* Remove debug options from CMakeLists.txt

* Separate the code paths for persistent/non-persistent in test

* Fix formatting

* Address review comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-20 17:18:57 +03:00
mtgu0705
582dc7f842 enable fp4 moe no weigth preshuffle, function pass 2025-05-20 08:18:22 -05:00
mtgu0705
589e1dfea9 init mx fp4 B no preshuffle version 2025-05-20 04:40:22 -05:00
mtgu0705
680de28f77 commit with debug info 2025-05-19 21:47:35 -05:00
Aviral Goel
c4929225f6 remove debug statements from CMakeLists (#2204) 2025-05-19 17:31:04 -07:00
jefyang1
f18170064d Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-19 17:29:51 -07:00
jefyang1
b8b12bb81e Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203)
* Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950

* Run clang format
2025-05-19 14:25:50 -07:00
mtgu0705
2e6fafaf75 updated code, build passed. 2025-05-18 22:29:32 -05:00
mtgu0705
a4b5a374b9 Merge remote-tracking branch 'origin/wip-f4-pk' into mx_moe_f4_scale_shuffle 2025-05-17 09:49:24 -05:00
mtgu0705
d4aaf9d0d0 Merge remote-tracking branch 'origin/moe_mx_fp4_for_aiter' into mx_moe_f4_scale_shuffle 2025-05-17 09:30:48 -05:00
mtgu0705
eeeba8901f update code 2025-05-17 09:28:26 -05:00
OscarXu
6fb2b54ff4 mx_fp4 default parameter change 2025-05-17 01:03:28 -05:00
mtgu0705
94fb9190be init moe mx f4 scale shuffle 2025-05-16 14:46:09 -05:00
aska-0096
248e287866 generalize the pipeline scheduling. 2025-05-16 10:41:59 +00:00
aska-0096
a0379d81e7 modify the way we represent fp4 2025-05-16 09:44:04 +00:00
OscarXu
ec8d00d58d mx_moe_fp4 ready for aiter with clang-format. 2025-05-16 04:09:26 -05:00
OscarXu
39ff3fbf05 v3 function pass 2025-05-16 03:42:48 -05:00
OscarXu
c5be9a501b v1 function pass. 2025-05-16 03:16:38 -05:00
aska-0096
a1bec7670a tempsave 2025-05-16 08:14:56 +00:00
OscarXu
efd3c24587 minor fix 2025-05-16 01:02:51 -05:00
Po Yen Chen
8cb0474b3d Use only qr_async pipeline for batch_prefill (#2195) 2025-05-15 11:47:29 -07:00
OscarXu
f70f778e27 v1 compile pass. Function not ready 2025-05-15 08:01:56 -05:00
Ding, Yi
9009d75c7a Pack e8m0 as int32_t 2025-05-15 09:12:17 +00:00
aska-0096
062e16d54a Improve the pipeline 2025-05-15 09:08:36 +00:00
OscarXu
68dbe558df compile error fix 2025-05-15 16:55:20 +08:00
OscarXu
c0babbca62 Merge remote-tracking branch 'origin/fp4_gu_moe' into fp4_gu_moe_gemm1 2025-05-15 16:12:19 +08:00
OscarXu
17922821ec Add gemm1 v1 2025-05-15 16:11:43 +08:00
mtgu0705
4e2ec31e4d rename moe block selector and pipeline 2025-05-15 02:12:50 -05:00
mtgu0705
dfba3c11e7 fix the bug, 128x128x256 tile function passed 2025-05-15 00:11:10 -05:00
mtgu0705
7cfd1db335 update debug 2025-05-14 21:41:28 -05:00
mtgu0705
efdd420742 debug save 2025-05-14 09:33:24 -05:00
mtgu0705
102151ebcf temp save 2025-05-14 08:13:47 -05:00
mtgu0705
2700b217be 16x16x128 input size blockscale function passed 2025-05-14 03:20:59 -05:00
Ding, Yi
4ba9fe186c Use random scale for init1 2025-05-14 05:42:39 +00:00
mtgu0705
1bbb50b212 mfma using asm, device result correct, host result need to check 2025-05-13 20:57:34 -05:00