Commit Graph

2114 Commits

Author SHA1 Message Date
Feng Shijie
cc9c7b9e58 optimize gemm2 atomic_add pattern 2025-08-11 08:38:23 +00:00
Feng Shijie
87aed564dc update case construction 2025-08-11 07:56:14 +00:00
Feng Shijie
8b85fa6cf2 update granularity control 2025-08-11 06:03:06 +00:00
Feng Shijie
1b8c7097b8 fix TileConfig 2025-08-11 03:42:46 +00:00
Feng Shijie
6d3cbc7c0e add moe_flatmm 2025-08-06 08:33:33 +00:00
Feng Shijie
90e910f3a7 fix flatmm with scaling when WarpTileM == 32 2025-08-04 07:16:36 +00:00
Feng Shijie
aa5e008fa5 optimize scaling epilogue 2025-08-01 11:01:23 +00:00
Feng Shijie
ac5908c0bb fix wrong config for fp8 scaling 2025-08-01 07:28:38 +00:00
Feng Shijie
3f43b841d4 prune debug message 2025-07-30 06:37:26 +00:00
Feng Shijie
2e5d4c74cd fix compile error 2025-07-30 04:52:08 +00:00
Feng Shijie
c117a1986a Add persistent option on flatmm for tuning 2025-07-29 15:42:58 +00:00
AMD-dteng
a587701117 update pipeline v1: add atomic IGLP schedule 2025-07-29 14:59:32 +00:00
lalala-sh
f9e48148d2 fix error log throwing 2025-07-29 14:59:18 +00:00
Feng Shijie
1b6d7cf407 crz idea 2025-07-28 08:24:51 +00:00
Feng Shijie
5473f06461 Add permuteN optimzization when NRepeat % 2 == 0 on flatmm 2025-07-27 11:57:38 +00:00
sjfeng
bfb9f4002f try to remove c_shuffle_lds 2025-07-27 17:24:08 +08:00
Feng Shijie
1264f4d2ab fix loop-dim mismatch and improve c_shuffle alu parallelism 2025-07-25 07:41:48 +00:00
lalala-sh
1239d8a546 merge flatmm -scale 2025-07-24 08:46:51 +00:00
lalala-sh
4066454483 revert delete of inc file 2025-07-24 16:19:58 +08:00
solin
68390988c9 reorg flatmm code 2025-07-24 04:38:16 +00:00
Feng Shijie
b908f5e803 fix flatmm syntax error on gfx950 2025-07-23 19:12:31 +00:00
Feng Shijie
5a1183ebbd support flatmm scaling 2025-07-23 19:04:22 +00:00
valarLip
89fa639207 merge flatmm pipe v0 from dteng_flatmm_opt 2025-07-23 09:50:33 +00:00
lalala-sh
3f7d848dd3 build pass 2025-07-23 15:38:12 +08:00
lalala-sh
6dacf833da fix bug 2025-07-23 07:20:26 +00:00
lalala-sh
7e1bd4b839 sync 2025-07-23 15:01:53 +08:00
valarLip
46a538e39e adaptive scheduler instead of Macro definition 2025-07-23 13:40:43 +08:00
lalala-sh
9aa3396a79 fix tail handler bug 2025-07-17 08:40:35 +00:00
lalala-sh
fb76450e63 merge from dteng_flatmm_opt 2025-07-16 10:12:19 +00:00
MHYangAMD
3499fe67ff [CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409)
* Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass

* Update rmsnorm2d_fwd_pipeline_model_sensitive_pass

1.  Add BlockReduce2dTreeCrossWarpSync

* Add Rmsnorm2dFusedModelSensitiveEnum

* Update patch

1. Reverse generate.py
2. Remove comment in generate.py
3. Update tree cross warp reduce

* Refactor RMSNorm model enum and introduce T5-like option

* Update the n stage for cross warp reduce

* Add new cmdline option in RMSNorm for new pipeline testing

---------

Co-authored-by: Clement Lin <clement.lin@amd.com>
Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>
2025-07-16 14:05:26 +08:00
rahjain-amd
6b09f0823e add missing condition for bf16 (#2502)
Without this DataType = unknown -
``` sh
Run Flatmm kernel with DataType = unknown M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.228837 ms, 187.687 TFlops, 341.374 GB/s,
```

after this change
```sh
Run Flatmm kernel with DataType = bf16 M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.227029 ms, 189.181 TFlops, 344.092 GB/s,
```
2025-07-15 21:25:56 +05:30
carlushuang
cfe211cc60 [CK_TILE] moe sorting optimize local_token (#2469)
* fix bug in loops that need use local tokens to compute

* support extra chain local_token

* update

* update

* refine some main

* update

* support dispatch_policy

* fix 15 example
2025-07-15 09:42:18 +08:00
Gino Lu
141bf2d54d [CK_TILE] Add pk_fp4 data type (#2422)
* [draft] Add pk_fp4 and test

* Add hw conversion for fp4

* Refine test code and pk_fp4 constructor.

* fix test indent

* modify according to comment.

* fix clang-format

* modify according comments.

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-07-14 20:35:06 +08:00
Andriy Roshchenko
25b359d630 MX GEMM - Add FP6 GEMM Test (#2488)
* Add F6 GEMM MX Test

* Add BF6 GEMM MX Test
2025-07-11 15:32:12 -06:00
Andriy Roshchenko
518dc21ae8 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8
2025-07-11 13:07:05 -06:00
Khushbu Agarwal
d239b91fd5 Merge flatmm Operator with universal gemm (#2434)
* Initial commit

* Adding new tile partitioner to flatmm

* intermediate changes

* debugging kernels

* Updating flatmm example to universal gemm example

* updated flatmm kernel to run via gemmKernel

* update universal gemm to incorporate flatmm

* debug

* Fix flatmm call

* Fixing other kernels and tests for API changes

* clang formatted

* fixing gemm tests

* added test for flatmm and simplify kernel arguments

* adding flatmm test

* fix test for flatmm

* simplify gemm kernel with flatmm

* remove flatmm related files

* addressing review comments and code clean up

* resolving empty file

* resolving empty file

* clang formatted

* addressing review comments

* enable persistent kernel for flatmm

* reverted the removed files for flatmm

* reverted the removed files for flatmm

* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example

* some more renames

* clang formatted
2025-07-11 08:27:55 -07:00
Qianfeng
45904b8fd7 Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487)
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline

* i_nhead_ conversion type to prevent overflow

---------

Co-authored-by: ltqin <letaoqin@amd.com>
2025-07-11 18:14:47 +08:00
Aviral Goel
a26ba690fd fix(precommit_install): fix bug for bare metal machines (#2448)
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-07-10 11:00:47 -06:00
Andres Lugo
aadeffde18 Update FMHA recipe for Pytorch SDPA integration (#2480)
* Add receipts in splitk and appendk

* remove grouped

* Remove logits

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-07-10 09:00:23 -07:00
Illia Silin
1b66f3f4a3 Add declarations for atomic add for fp16 and unsigned short. (#2483)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short

* revrt back to atomic add using casting
2025-07-10 07:18:56 -07:00
Illia Silin
d9b37c7121 Fix blockscale fp8 gemm examples (#2476)
* fix blockscale fp8 gemm examples

* refactor the compiler flags

* fix hip version calculation
2025-07-10 07:12:13 -07:00
shay-li77
d814fefe18 support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338)
* mask support ratio for y axis

* format code

* add notes for param y_ratio

* fix comments error

* support template and mdiv for ratio mask

* refactor y-ratio mask constructor

* optimize coordinate calculation

* add SimplifiedRatioAttentionMask
2025-07-09 23:18:55 +08:00
Yi DING
032ca60015 [CK_TILE] Avoid compile kernel in host pass (#2475) 2025-07-09 22:27:54 +08:00
Po Yen Chen
ad9863fe05 [CK_TILE] Low CU utilization optimization for fMHA fwd kernels (#2402)
* Wrap tile size mapping as class method

* Warp pipeline generating as class method

* Add constraint as kernel dispatching criteria

* Support mutltiple tile size for a (hdim, hdim_v) combination

* Use smaller tile size if CU utilization is low

* Use integar as the key of the tile size map

* Fix type error

* Simply override parent class method return value

* Add attribute to eliminate warnging

* Allow using environment variables to turn on/off custom factory

* Unify param naming style

* Add missing HIP runtime include directive

* Fix os.environ.get() usage
2025-07-09 22:01:33 +08:00
Vidyasagar Ananthan
e391b025a0 New ninja tracing script (#2472)
* Adding ninja log json convertion utility

* Updating to match old ninjatracing

* Updating Jenkins to use new ninjatracing

* Ensuring v7 works

* Removing old ninjatracing from dockerfile
2025-07-08 22:36:50 -07:00
Illia Silin
93420ecf89 Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" (#2474)
This reverts commit 112b47e885.
2025-07-08 19:01:26 -07:00
Illia Silin
112b47e885 Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. (#2471)
* add template for fp16 atomic add

* add template for unsigned short atomic add

* use atomicCAS in atomic add for fp16 and unsigned short
2025-07-08 18:09:30 -04:00
Vidyasagar Ananthan
33d704a6f9 Separating ninja build tracing and setting flag to false (#2470)
* Separating ninja build tracing and setting flag to false

* Add ftime-tracing flag

* Fix conditional issue

* Try adding a script block

* Embed Clang analysis in ftime trace block
2025-07-08 10:52:00 -07:00
Haocong WANG
5557eadce6 [CK TILE] Fix FA build filter (#2369)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* cmake depends & bwd filter order fix

* revert unexpected reformat

* Avoid change fmha bwd filter order for downstream compatibility

* Revert unexpected changes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2025-07-08 10:42:07 +08:00
Illia Silin
e033a1b4bf fix compilation errors with clang20 (#2464) 2025-07-07 19:40:30 -07:00