Qianfeng Zhang
b0ae27046f
Fix the integer overflow in total_flops calculation
2025-04-17 10:34:13 +00:00
Qianfeng Zhang
6086ead2f9
Add scripts for comparing with triton
2025-04-17 10:33:44 +00:00
Andriy Roshchenko
da54464cce
MX GEMM - Add MX BF8 example ( #2071 )
...
* Add MX GEMM example for MX BF8
* Verified MX FP8 with 16x16x128 scale builtin
* Verify MX BF8 GEMM with BF16 output
2025-04-16 15:25:02 -06:00
BingYuan.Zhou
eaf1f0bf3b
[flatmm] implement basic fp16 flatmm ( #2089 )
...
* [flatmm] implement basic fp16 flatmm
* fix CI build fail
---------
Co-authored-by: root <root@hjbog-srdc-50.amd.com >
Co-authored-by: solin <bingzhou@amd.com >
2025-04-16 16:51:17 +08:00
Qianfeng Zhang
1351d9cd1b
Use exp2() to calculate exp() for better performance
2025-04-16 06:54:06 +00:00
Qianfeng Zhang
226a254723
Remove the comparing of row/col to max_uih_len in masking
2025-04-16 04:35:42 +00:00
felix
c5975529bb
add preshuffle gemm fp16 ( #2036 )
...
* add preshuffle gemm fp16
* clang format and test ok
* Update gemm_multiply_multiply_xdl_fp16_bpreshuffle.cpp
remove useless comments in example
* Update gemm_multiply_multiply_xdl_fp16_bpreshuffle.cpp
remove 2
---------
Co-authored-by: coderfeli <coderfeli@163.com >
2025-04-16 10:53:21 +08:00
joyeamd
94d47b1680
fmha hdim256 vectorize improve ( #2086 )
...
For hdim 256, will not have vectorized buffer load when seqlen % 256 != 0 and hdim % 256 = 0; this commit tries to solve this condition.
2025-04-16 09:21:04 +08:00
Andriy Roshchenko
7106976a72
MX GEMM - New GEMM pipeline for MX data types ( #2059 )
...
* Allow selection of mfma_scale instructions
* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order
* Add constexpr and synchronize return type for `get_exponent_value`
* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`
* Add support for microscaling instructions in `XdlopsGemm`
* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper
* Remove software implementation of MX GEMM
* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction
* Update README
* Updated CHANGELOG
* Remove unused static methods
2025-04-15 17:17:07 -06:00
Qianfeng Zhang
d1749b3aae
Use kM0=128 kN0=64 to completely remove the vgprs spilling
2025-04-15 15:08:46 +00:00
Qianfeng Zhang
3cd1b13e46
Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations
2025-04-15 14:40:55 +00:00
Qianfeng Zhang
cad1356170
Use packed cast_tile for fp16
2025-04-15 14:29:30 +00:00
Qianfeng Zhang
fff13b6c76
Update to partially reduce the register spilling
2025-04-15 07:44:33 +00:00
Mingtao Gu
56378f810f
CK pk_i4_t test failures fix (SWDEV-518629) ( #2075 )
...
* fix pk_i4_v3 tests failures in Unbuntu env.
* fix pk_i4_t tests failure on Unbuntu issues.
* some fixed.
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
2025-04-14 16:58:57 +08:00
Qianfeng Zhang
c2e6ab8516
Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier()
2025-04-13 11:00:22 +00:00
Qianfeng Zhang
238e78d82e
Update the in pipeline codes
2025-04-13 09:43:58 +00:00
Qianfeng Zhang
53e567977e
Fix in calculation of total_flops and update benchmark scripts
2025-04-13 08:50:00 +00:00
jakpiase
6c61f4d237
[CK_TILE] Add 2:4 structured sparsity support for fp16 gemm ( #1957 )
...
* add structured sparsity fp16 support for gemm
* added reviewer suggestions
* update changelog
* update changelog
* add reviewers suggestions
* Minor fix
* clang fix
* fix doxygen
2025-04-11 12:18:26 +02:00
slippedJim
5f885d2b7a
add fmha fwd splitkv receipt for aiter c++ api ( #2068 )
...
* add s_randval for c++ api
* Fix bug of bias in splitkv
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
2025-04-10 23:21:13 +08:00
Illia Silin
3e6d21adeb
enable gfx115x support ( #2065 )
2025-04-09 10:06:42 -07:00
Qianfeng Zhang
71697d9cb9
Add output of estimated TFLOPS
2025-04-09 14:50:18 +00:00
Qianfeng Zhang
1766e6d3be
Update to the scripts and error thresholds
2025-04-09 10:34:37 +00:00
Qianfeng Zhang
dd2cd2cbcb
Tune the input initialization to avoid over-flow in silu
2025-04-09 10:03:32 +00:00
Qianfeng Zhang
86c0e45987
Add benchmark_hstu_attention.sh
2025-04-09 08:28:05 +00:00
Qianfeng Zhang
9cb2dca958
Add several verification test cases
2025-04-08 16:38:35 +00:00
Qianfeng Zhang
561d490990
Fix in kernel and forward dispatch for jagged mode
2025-04-08 16:37:52 +00:00
Qianfeng Zhang
dc2f72a09f
Fix in hstu-attention pipeline (which makes some testing cases passed)
2025-04-08 15:53:08 +00:00
Qianfeng Zhang
dbcf38aae9
Fixes and updates
2025-04-07 15:29:23 +00:00
slippedJim
5a22b61de5
Add new receipt ( #2055 )
2025-04-07 14:18:01 +08:00
Thomas Ning
50d1f8ff90
Add the MI355 support for CK TILE GEMM ( #2046 )
...
* Get the root cause of the ck tile gemm failing on mi355
* Fix the ck tile gemm on MI355
* delete the debug info
2025-04-03 11:48:54 -07:00
Qianfeng Zhang
10e72d3362
Change in HstBlockMasking and kernel/reference codes for using masking
2025-04-03 14:46:12 +00:00
Qianfeng Zhang
733734553b
Fix and change in example
2025-04-03 14:44:36 +00:00
aledudek
9329432f6c
Post-merge changes for fully async args copy in ck grouped gemm ( #1991 )
...
* Post-merge changes for fully async args copy in ck grouped gemm
* Post-merge documentation and naming changes
* Build fix and updated changelog
* Revised comments
2025-04-03 13:35:43 +02:00
Qianfeng Zhang
121a950df5
Add hstu attention kernel implementation, instances and interfaces (building succeeded)
2025-04-03 08:20:54 +00:00
Muhammed Emin Ozturk
dd4c12b155
f8/bf16 GEMM Stream-K ( #1879 )
2025-03-31 20:30:17 -06:00
Qianfeng Zhang
83f29243df
fix the jagged mode tensor access in reference_hstu_attention
2025-03-29 12:55:40 +00:00
Qianfeng Zhang
4a0fc292d0
Initial reference implementation of hstu attention
2025-03-28 16:26:43 +00:00
rocking
8a20b62e91
Reduce redundant space in bias tensor ( #2024 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-03-28 21:58:06 +08:00
felix
a82f338fb9
hotfix fix sorting int64 ( #2025 )
...
* fix sorting int64
* clang format
* fix example issue
* update WA issue #
---------
Co-authored-by: coderfeli <coderfeli@163.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
2025-03-28 11:31:52 +08:00
felix
36d50de50e
ckmoe: change cmake; use smaller shape for i4 ( #2027 )
...
* change cmake; use smaller shape for i4
* fix pki4 run
* fix typo
* fix runtime arch logic for moe_gemm2 example
---------
Co-authored-by: coderfeli <coderfeli@163.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
2025-03-27 09:04:31 -07:00
Illia Silin
23a949706c
Disable all pk_i4 tests for all targets except gfx942/950. ( #2022 )
...
* only build gemm_fp8_pk_i4 examples for gfx942/950
* fix cmake logic
* moved the architecture check to IsSupported function
* Revert "moved the architecture check to IsSupported function"
This reverts commit 056d2a08b3 .
* disable all pk_i4 tests for targets other than gfx942/950
* fix cmake logic
2025-03-26 15:15:57 -07:00
Illia Silin
99b2bbc1d6
Make sure gemm_fp8_pk_i4 examples only build and run on gfx942/950. ( #2010 )
...
* only build gemm_fp8_pk_i4 examples for gfx942/950
* fix cmake logic
* moved the architecture check to IsSupported function
* Revert "moved the architecture check to IsSupported function"
This reverts commit 056d2a08b3 .
2025-03-25 14:43:38 -07:00
Andriy Roshchenko
72d888821c
MX GEMM examples with FP8, FP16, and E8M0 scales ( #2016 )
...
* Add `scalar_type` specification for E8M0 exponent
* Specialize `nnvb_data_t_selector` for E8M0 exponent
* Remove partial specializations for `scalar_type` of `non_native_vector_base` template
* Reword command line helper string
* Create MX GEMM examples for different scales
2025-03-25 15:33:03 -06:00
ruanjm
d49abdaa87
[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance ( #1861 )
...
* 50ms -> 28ms
* Fix bug in non fuse_add_store cases
* Fine tuned setting for 2 pass pipeline
* adjust workload
* remove unnecessary change
* add layernorm
* Adding output quant and unquant results at the same time.
* fix test
* fix format
* tune for cases 128x640 and 128x1024
* bug ifx
2025-03-25 20:09:45 +08:00
Andriy Roshchenko
6660dc6b8e
Introduce MX GEMM for FP8 data type ( #2000 )
2025-03-24 15:41:07 -06:00
carlushuang
6c08c5c46d
add mask support in hdim=192/128 ( #1999 )
2025-03-21 18:28:43 +08:00
BingYuan.Zhou
5a0d693b86
fix ck_tile/basic_gemm build error ( #1988 )
2025-03-20 22:01:14 -07:00
felix
902dbe89ad
change cmake ( #2006 )
...
Co-authored-by: coderfeli <coderfeli@163.com >
2025-03-20 19:25:11 -07:00
carlushuang
e3c9886cdf
[CK_TILE] return value with macro in ck_tile::kernel_launch API ( #1982 )
...
* return value with macro and revert the return value
* [CK-TILE] no-macro launch api solution (#1992 )
* no-macro solution
* address -Wcomma
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
2025-03-20 11:00:29 -07:00
jakpiase
0e91d32c61
[CK_TILE] Switch to universal gemm for batched and grouped gemms ( #1919 )
...
* switch to universal gemm for batched and grouped gemms
* added reviewer comments
* fixed grouped gemm tests
2025-03-20 11:17:04 +01:00