Commit Graph

641 Commits

Author SHA1 Message Date
zhiming.ding
178ef223d2 Merge remote-tracking branch 'origin/develop' into shared/big_day_merge_asm_fix 2025-06-03 01:44:30 +00:00
Khushbu Agarwal
2e38eb4f1c Rotating buffer PR CI fix (#2257)
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)"

This reverts commit bbdaf79a52.

* fix regression
2025-06-02 10:25:01 -07:00
slippedJim
57f497452a remove restriction of group mode hd192 no lse (#2252)
Co-authored-by: Jim <jimguo12@amd.com>
2025-05-30 10:14:21 +08:00
Po Yen Chen
28cd0dffc9 [CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251)
* Add constraint on traits/tile/pipeline

* Use kM0=128 if max_seqlen_q == 8192

* Re-format codegen script

* Remove redundant attr name postix

* Fix import error: default field in dataclass

* Use kK0=64 & kK1=64 to hide latency

* Use CU utilization to decide tile size
2025-05-29 18:36:33 +09:00
Illia Silin
bbdaf79a52 Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200)" (#2256)
This reverts commit 99857e10e6.
2025-05-28 09:46:52 -06:00
lalala-sh
7e10e4d1d9 Merge branch 'develop' into wjx/mxfp4_moe_2Stages 2025-05-28 05:01:13 -05:00
root
b0c58577db Merge remote-tracking branch 'origin/feiw/mxfp4_moe_2Stages' into wjx/mxfp4_moe_2Stages 2025-05-28 06:31:56 +00:00
lalala-sh
50764c6662 Merge branch 'develop' into wjx/mxfp4_moe_2Stages 2025-05-28 01:25:55 -05:00
Khushbu Agarwal
99857e10e6 [CK_tile] Add rotating buffer feature for universal gemm (#2200)
* Add rotating buffer feature for universal gemm

* adding changes in tile_engine

* Updated code to merge kernel_launch

* removing comments

* Enable rotating buffer changes to flatmm

* Created diff launch_kernel function for rotating buffer

* Simplfied calculation using macros

* merge code with new changes in tile_engine

* clang formatted

* Redefine macros
2025-05-27 23:00:58 -07:00
Aviral Goel
c52649ad57 Add catch blocks in example GEMM apps to enable better error handling (Issue: 1928) (#2234)
* added catch statements to examples

* clang format
2025-05-27 22:32:42 -07:00
mtgu0705
2f0ee8ccb1 change the gemm1 tile from 64x128x128 to 128x64x128 2025-05-27 20:43:38 -05:00
mtgu0705
52b764d59f update MX moe GEMM1 hotloopscheduling 2025-05-27 20:43:22 -05:00
Zzz9990
ece38b9d7a [VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221)
* fix splitkv compiler issue since lse is used to select kernel instances

* bypass seqlen == 1

* add chunked prefill into mha varlen

This reverts commit aa9847e42d.

* skip compile when receipt 2-4 and add comments

* fix

---------

Co-authored-by: fsx950223 <fsx950223@outlook.com>
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1 fix the buffer intrinsic names for clang >=20 (#2228) 2025-05-23 14:58:25 -07:00
feifei14119
2e39bf06f7 fix typo 2025-05-23 11:23:01 +00:00
mtgu0705
2216ff0521 update mx moe gemm1 gemm2 TF and BW calculation 2025-05-23 05:29:39 -05:00
mtgu0705
d6bfdc9d7d update mx moe gemm1_bns tile size to 64x128x256 2025-05-23 05:10:45 -05:00
feifei14119
ce4e7b39da gemm1 func pass 2025-05-23 09:26:38 +00:00
Illia Silin
1b846143c6 Revert "Update the buffer load/store intrinsic names for clang>=20. (#2192)" (#2227)
This reverts commit 58f9e9ffbc.
2025-05-22 15:41:17 -07:00
SamiAario-AMD
380bca2b85 Fix 11_add_rmsnorm2d_rdquant (#2207) 2025-05-20 15:15:28 -07:00
Thomas Ning
1386924749 Add the instances for small sized GEMM in preshuffle and improve CMake Flag (#2212)
* Add small instance, add the bug fix, & improve the example CMake

* clang format
2025-05-20 15:05:08 -07:00
mtgu0705
eb588bc083 update the TFlops calculation in the example 2025-05-20 10:00:13 -05:00
Sami Remes
d1e6f0982d [CK_TILE] Grouped GEMM tile loop (#2146)
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm

* Some helper functions for persistent kernel case

* Get max occupancy grid using device properties

* Implement tile loop in main entry point to grouped gemm

* Enable GridSize() on device

* Handle offset tile index using real current block index

* Add persistent kernel choice to grouped gemm example

* Use a for-loop for iterating over the group

* Reduce VGPR spills by early-exit

* Enable persistent kernel choice in grouped_gemm example

* Add persistent kernel option to grouped_gemm test

* Fix formatting with remod.py

* Remove GridUpdateBlocks as blocks are now iteratively computed

* Add comment about VGPR spilling

* Fix formatting

* Use CK_TILE_HOST instead of __host__

* Enable all Row/Col combinations in grouped gemm unit test

* Add some KBatch=2 cases to grouped gemm tests

* Fix SplitK for grouped gemm

* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm

* Add type traits

* Split examples to regular and tileloop

* Formatting

* Use hipExtStreamGetCUMask to get current active CUs for the given stream

* Align test and example kernel config, and disable validation for splitk repeats

* Remove debug options from CMakeLists.txt

* Separate the code paths for persistent/non-persistent in test

* Fix formatting

* Address review comments

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-05-20 17:18:57 +03:00
mtgu0705
582dc7f842 enable fp4 moe no weigth preshuffle, function pass 2025-05-20 08:18:22 -05:00
mtgu0705
589e1dfea9 init mx fp4 B no preshuffle version 2025-05-20 04:40:22 -05:00
mtgu0705
680de28f77 commit with debug info 2025-05-19 21:47:35 -05:00
Aviral Goel
c4929225f6 remove debug statements from CMakeLists (#2204) 2025-05-19 17:31:04 -07:00
jefyang1
f18170064d Use new mfma instructions for FP8 on gfx950 (#2202)
* Add logic to use new mfma instructions for fp8 bf8

* Fix example_gemm_xdl_fp8_pk_i4_bpreshuffle_v3 on gfx950 and run clang format

* Update include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Fix intrin_mfma f8 calls due to merge mistake

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
2025-05-19 17:29:51 -07:00
jefyang1
b8b12bb81e Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950 (#2203)
* Fix example_grouped_gemm_multiple_d_xdl_fp16 on gfx950

* Run clang format
2025-05-19 14:25:50 -07:00
mtgu0705
2e6fafaf75 updated code, build passed. 2025-05-18 22:29:32 -05:00
mtgu0705
a4b5a374b9 Merge remote-tracking branch 'origin/wip-f4-pk' into mx_moe_f4_scale_shuffle 2025-05-17 09:49:24 -05:00
mtgu0705
d4aaf9d0d0 Merge remote-tracking branch 'origin/moe_mx_fp4_for_aiter' into mx_moe_f4_scale_shuffle 2025-05-17 09:30:48 -05:00
mtgu0705
eeeba8901f update code 2025-05-17 09:28:26 -05:00
OscarXu
6fb2b54ff4 mx_fp4 default parameter change 2025-05-17 01:03:28 -05:00
mtgu0705
94fb9190be init moe mx f4 scale shuffle 2025-05-16 14:46:09 -05:00
aska-0096
248e287866 generalize the pipeline scheduling. 2025-05-16 10:41:59 +00:00
aska-0096
a0379d81e7 modify the way we represent fp4 2025-05-16 09:44:04 +00:00
OscarXu
ec8d00d58d mx_moe_fp4 ready for aiter with clang-format. 2025-05-16 04:09:26 -05:00
OscarXu
39ff3fbf05 v3 function pass 2025-05-16 03:42:48 -05:00
OscarXu
c5be9a501b v1 function pass. 2025-05-16 03:16:38 -05:00
aska-0096
a1bec7670a tempsave 2025-05-16 08:14:56 +00:00
OscarXu
efd3c24587 minor fix 2025-05-16 01:02:51 -05:00
Po Yen Chen
8cb0474b3d Use only qr_async pipeline for batch_prefill (#2195) 2025-05-15 11:47:29 -07:00
OscarXu
f70f778e27 v1 compile pass. Function not ready 2025-05-15 08:01:56 -05:00
Ding, Yi
9009d75c7a Pack e8m0 as int32_t 2025-05-15 09:12:17 +00:00
aska-0096
062e16d54a Improve the pipeline 2025-05-15 09:08:36 +00:00
OscarXu
68dbe558df compile error fix 2025-05-15 16:55:20 +08:00
OscarXu
c0babbca62 Merge remote-tracking branch 'origin/fp4_gu_moe' into fp4_gu_moe_gemm1 2025-05-15 16:12:19 +08:00
OscarXu
17922821ec Add gemm1 v1 2025-05-15 16:11:43 +08:00
mtgu0705
4e2ec31e4d rename moe block selector and pipeline 2025-05-15 02:12:50 -05:00