Commit Graph

1043 Commits

Author SHA1 Message Date
Qianfeng Zhang
b0ae27046f Fix the integer overflow in total_flops calculation 2025-04-17 10:34:13 +00:00
Qianfeng Zhang
6086ead2f9 Add scripts for comparing with triton 2025-04-17 10:33:44 +00:00
Andriy Roshchenko
da54464cce MX GEMM - Add MX BF8 example (#2071)
* Add MX GEMM example for MX BF8

* Verified MX FP8 with 16x16x128 scale builtin

* Verify MX BF8 GEMM with BF16 output
2025-04-16 15:25:02 -06:00
BingYuan.Zhou
eaf1f0bf3b [flatmm] implement basic fp16 flatmm (#2089)
* [flatmm] implement basic fp16 flatmm

* fix CI build fail

---------

Co-authored-by: root <root@hjbog-srdc-50.amd.com>
Co-authored-by: solin <bingzhou@amd.com>
2025-04-16 16:51:17 +08:00
Qianfeng Zhang
1351d9cd1b Use exp2() to calculate exp() for better performance 2025-04-16 06:54:06 +00:00
Qianfeng Zhang
226a254723 Remove the comparing of row/col to max_uih_len in masking 2025-04-16 04:35:42 +00:00
felix
c5975529bb add preshuffle gemm fp16 (#2036)
* add preshuffle gemm fp16

* clang format and test ok

* Update gemm_multiply_multiply_xdl_fp16_bpreshuffle.cpp

remove useless comments in example

* Update gemm_multiply_multiply_xdl_fp16_bpreshuffle.cpp

remove 2

---------

Co-authored-by: coderfeli <coderfeli@163.com>
2025-04-16 10:53:21 +08:00
joyeamd
94d47b1680 fmha hdim256 vectorize improve (#2086)
For hdim 256, will not have vectorized buffer load when seqlen % 256 != 0 and hdim % 256 = 0; this commit tries to solve this condition.
2025-04-16 09:21:04 +08:00
Andriy Roshchenko
7106976a72 MX GEMM - New GEMM pipeline for MX data types (#2059)
* Allow selection of mfma_scale instructions

* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order

* Add constexpr and synchronize return type for `get_exponent_value`

* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`

* Add support for microscaling instructions in `XdlopsGemm`

* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper

* Remove software implementation of MX GEMM

* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction

* Update README

* Updated CHANGELOG

* Remove unused static methods
2025-04-15 17:17:07 -06:00
Qianfeng Zhang
d1749b3aae Use kM0=128 kN0=64 to completely remove the vgprs spilling 2025-04-15 15:08:46 +00:00
Qianfeng Zhang
3cd1b13e46 Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations 2025-04-15 14:40:55 +00:00
Qianfeng Zhang
cad1356170 Use packed cast_tile for fp16 2025-04-15 14:29:30 +00:00
Qianfeng Zhang
fff13b6c76 Update to partially reduce the register spilling 2025-04-15 07:44:33 +00:00
Mingtao Gu
56378f810f CK pk_i4_t test failures fix (SWDEV-518629) (#2075)
* fix pk_i4_v3 tests failures in Unbuntu env.

* fix pk_i4_t tests failure on Unbuntu issues.

* some fixed.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-04-14 16:58:57 +08:00
Qianfeng Zhang
c2e6ab8516 Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier() 2025-04-13 11:00:22 +00:00
Qianfeng Zhang
238e78d82e Update the in pipeline codes 2025-04-13 09:43:58 +00:00
Qianfeng Zhang
53e567977e Fix in calculation of total_flops and update benchmark scripts 2025-04-13 08:50:00 +00:00
jakpiase
6c61f4d237 [CK_TILE] Add 2:4 structured sparsity support for fp16 gemm (#1957)
* add structured sparsity fp16 support for gemm

* added reviewer suggestions

* update changelog

* update changelog

* add reviewers suggestions

* Minor fix

* clang fix

* fix doxygen
2025-04-11 12:18:26 +02:00
slippedJim
5f885d2b7a add fmha fwd splitkv receipt for aiter c++ api (#2068)
* add s_randval for c++ api

* Fix bug of bias in splitkv

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
2025-04-10 23:21:13 +08:00
Illia Silin
3e6d21adeb enable gfx115x support (#2065) 2025-04-09 10:06:42 -07:00
Qianfeng Zhang
71697d9cb9 Add output of estimated TFLOPS 2025-04-09 14:50:18 +00:00
Qianfeng Zhang
1766e6d3be Update to the scripts and error thresholds 2025-04-09 10:34:37 +00:00
Qianfeng Zhang
dd2cd2cbcb Tune the input initialization to avoid over-flow in silu 2025-04-09 10:03:32 +00:00
Qianfeng Zhang
86c0e45987 Add benchmark_hstu_attention.sh 2025-04-09 08:28:05 +00:00
Qianfeng Zhang
9cb2dca958 Add several verification test cases 2025-04-08 16:38:35 +00:00
Qianfeng Zhang
561d490990 Fix in kernel and forward dispatch for jagged mode 2025-04-08 16:37:52 +00:00
Qianfeng Zhang
dc2f72a09f Fix in hstu-attention pipeline (which makes some testing cases passed) 2025-04-08 15:53:08 +00:00
Qianfeng Zhang
dbcf38aae9 Fixes and updates 2025-04-07 15:29:23 +00:00
slippedJim
5a22b61de5 Add new receipt (#2055) 2025-04-07 14:18:01 +08:00
Thomas Ning
50d1f8ff90 Add the MI355 support for CK TILE GEMM (#2046)
* Get the root cause of the ck tile gemm failing on mi355

* Fix the ck tile gemm on MI355

* delete the debug info
2025-04-03 11:48:54 -07:00
Qianfeng Zhang
10e72d3362 Change in HstBlockMasking and kernel/reference codes for using masking 2025-04-03 14:46:12 +00:00
Qianfeng Zhang
733734553b Fix and change in example 2025-04-03 14:44:36 +00:00
aledudek
9329432f6c Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments
2025-04-03 13:35:43 +02:00
Qianfeng Zhang
121a950df5 Add hstu attention kernel implementation, instances and interfaces (building succeeded) 2025-04-03 08:20:54 +00:00
Muhammed Emin Ozturk
dd4c12b155 f8/bf16 GEMM Stream-K (#1879) 2025-03-31 20:30:17 -06:00
Qianfeng Zhang
83f29243df fix the jagged mode tensor access in reference_hstu_attention 2025-03-29 12:55:40 +00:00
Qianfeng Zhang
4a0fc292d0 Initial reference implementation of hstu attention 2025-03-28 16:26:43 +00:00
rocking
8a20b62e91 Reduce redundant space in bias tensor (#2024)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-03-28 21:58:06 +08:00
felix
a82f338fb9 hotfix fix sorting int64 (#2025)
* fix sorting int64

* clang format

* fix example issue

* update WA issue #

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2025-03-28 11:31:52 +08:00
felix
36d50de50e ckmoe: change cmake; use smaller shape for i4 (#2027)
* change cmake; use smaller shape for i4

* fix pki4 run

* fix typo

* fix runtime arch logic for moe_gemm2 example

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2025-03-27 09:04:31 -07:00
Illia Silin
23a949706c Disable all pk_i4 tests for all targets except gfx942/950. (#2022)
* only build gemm_fp8_pk_i4 examples for gfx942/950

* fix cmake logic

* moved the architecture check to IsSupported function

* Revert "moved the architecture check to IsSupported function"

This reverts commit 056d2a08b3.

* disable all pk_i4 tests for targets other than gfx942/950

* fix cmake logic
2025-03-26 15:15:57 -07:00
Illia Silin
99b2bbc1d6 Make sure gemm_fp8_pk_i4 examples only build and run on gfx942/950. (#2010)
* only build gemm_fp8_pk_i4 examples for gfx942/950

* fix cmake logic

* moved the architecture check to IsSupported function

* Revert "moved the architecture check to IsSupported function"

This reverts commit 056d2a08b3.
2025-03-25 14:43:38 -07:00
Andriy Roshchenko
72d888821c MX GEMM examples with FP8, FP16, and E8M0 scales (#2016)
* Add `scalar_type` specification for E8M0 exponent

* Specialize `nnvb_data_t_selector` for E8M0 exponent

* Remove partial specializations for `scalar_type` of `non_native_vector_base` template

* Reword command line helper string

* Create MX GEMM examples for different scales
2025-03-25 15:33:03 -06:00
ruanjm
d49abdaa87 [CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861)
* 50ms -> 28ms

* Fix bug in non fuse_add_store cases

* Fine tuned setting for 2 pass pipeline

* adjust workload

* remove unnecessary change

* add layernorm

* Adding output quant and unquant results at the same time.

* fix test

* fix format

* tune for cases 128x640 and 128x1024

* bug ifx
2025-03-25 20:09:45 +08:00
Andriy Roshchenko
6660dc6b8e Introduce MX GEMM for FP8 data type (#2000) 2025-03-24 15:41:07 -06:00
carlushuang
6c08c5c46d add mask support in hdim=192/128 (#1999) 2025-03-21 18:28:43 +08:00
BingYuan.Zhou
5a0d693b86 fix ck_tile/basic_gemm build error (#1988) 2025-03-20 22:01:14 -07:00
felix
902dbe89ad change cmake (#2006)
Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-20 19:25:11 -07:00
carlushuang
e3c9886cdf [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-03-20 11:00:29 -07:00
jakpiase
0e91d32c61 [CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919)
* switch to universal gemm for batched and grouped gemms

* added reviewer comments

* fixed grouped gemm tests
2025-03-20 11:17:04 +01:00