Commit Graph

1496 Commits

Author SHA1 Message Date
Sami Aario
1ee1307d49 Candidate fix 13 2026-02-24 11:26:05 +00:00
Sami Aario
1a5ec9efdb Candidate fix 12 2026-02-24 10:06:23 +00:00
Sami Aario
f35c9da001 Candidate fix 11 2026-02-23 13:54:53 +00:00
Sami Aario
3dad12583b Candidate fix 10 2026-02-23 13:18:40 +00:00
Sami Aario
aebc095d0e Candidate fix 9 2026-02-23 12:22:47 +00:00
Sami Aario
c82c7fe2b3 Candidate fix 8 2026-02-23 10:35:51 +00:00
Sami Aario
c79ab1f84a Candidate fix 7 2026-02-23 09:03:20 +00:00
Sami Aario
1de8bc9501 Candidate fix 6 2026-02-20 16:25:36 +00:00
Sami Aario
a6ffc9c6e5 Candidate fix 5 2026-02-20 13:40:12 +00:00
Sami Aario
e2a85ee7a0 Candidate fix 4 2026-02-20 13:40:12 +00:00
Sami Aario
5d40ac6c1b Candidate fix 3 2026-02-20 13:40:12 +00:00
Sami Aario
709608f843 Candidate fix 2 2026-02-20 13:40:12 +00:00
Sami Aario
de1a228b34 Candidate fix 2026-02-20 13:40:12 +00:00
Sami Aario
8b462b04ce Clear load_tile_transpose_convert_with_offset 2026-02-20 13:40:12 +00:00
Sami Aario
3ec60914ad Add include statements added by remod.py 2026-02-03 13:52:41 +00:00
Sami Aario
31c91a9535 Formatting changes 2026-02-03 13:52:41 +00:00
Sami Aario
ad2d10a633 Switch to an implementation of DetermineWarpPrecType that explicitly defines the A and B types
- This is for improved clarity and finer control of the datatypes to use
2026-02-03 13:52:41 +00:00
Sami Aario
298fd29fba Add and use load_tile_transpose_convert for mixed precision transpose loading 2026-02-03 13:52:41 +00:00
Sami Aario
7fef648bca Refactor type conversions out of MakeBLdsBlockDescriptor, WIP! 2026-02-03 13:52:41 +00:00
Sami Aario
1b610f4aaf Add type conversions to V4 pipeline, WIP! 2026-02-03 13:52:40 +00:00
Sami Aario
3a792017fb Add functionality and tests for fp16 x fp8 and fp8 x fp16 2026-02-03 13:52:40 +00:00
Sami Aario
f8c4868a59 Add functionality and tests for bf16 x fp8 and fp8 x bf16 2026-02-03 13:52:40 +00:00
Sami Aario
3f4a85146c Add MFMA warp gemm for float, float, float, 32, 32, 16 2026-02-03 13:52:40 +00:00
Sami Aario
7f22e8c66a Add and use load_with_type_convert 2026-02-03 13:52:40 +00:00
Sami Aario
b41ed6e371 Introduce DetermineWarpPrecType for determining warp GEMM precision types 2026-02-03 13:52:40 +00:00
Sami Aario
f2fcc4a461 Add NumAccess as a template parameter to WarpGemmAttributeMfma::get_warp_dstr_encoding 2026-02-03 13:52:40 +00:00
Sami Aario
933e09f6c3 Rename the parameters of load_interleaved_pk_type and load_and_convert_tile 2026-02-03 13:52:40 +00:00
SamiAario-AMD
8c8715904e Merge branch 'develop' into LWPCK-3549-cleanups 2026-02-03 13:28:08 +02:00
Max Podkorytov
3f04d27b68 Remove concrete performance numbers from BUILD_TIME_OPTIMIZATION.md (#3702)
Replace specific benchmark numbers with qualitative descriptions since
measurements vary across environments and may become outdated.

Co-authored-by: Claude <noreply@anthropic.com>
2026-02-03 03:54:18 -07:00
Illia Silin
8b56ffb6ae Fix one more lifetimebound error. (#3703)
* fix staging compiler errors

* fix clang format
2026-02-02 18:25:56 -08:00
Aviral Goel
3e77721755 feat: add split_k support for block scale gemm bquant mode. (#3653)
* WIP: add splitk to bquant

* feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types

* chore: remove temporary test script

* fix: incorrect tile window length for splitted bq tensor window

* chore: improve comments

* test: add unit tests to cover bquant splitk functionality

* fix: conflict resolution by renaming variables
2026-02-02 14:41:53 -08:00
Zoltán Lakatos
301eb5cf08 Implement device grouped gemm fixed nk multi abd for rdna4 (#3619)
* device struct implementation

* added xdl grouped multi abd fixed nk testing

* wmma implementation fixed

* avoid unnecessary device mem allocation and code cleanups

* cleanup instances definitions

* wmma examples added

* code cleanups

* fix clang format

* typo and compilation fixes related to reference gemm

* fix compilation error due to std::remove_cvref_t

* added missing hip_check_error includes

* correction to example instances

* review commentes addressed

* removed split-k from testing

* code formatting

---------

Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 13:58:11 -08:00
Jan Patrick Lehr
069500464d [Compiler] Addressing new compiler warnings (#3640)
* [Compiler] Addressing new compiler warnings

Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.

The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.

* Update some more instances

* Adds file-level ignores via clang diagnostic pragma

The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.

It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.

* This adds the remaining instances

For a build on gfx90a.

* fix clang format

* Adding couple more instances from gfx1200 build

* Fixed another few instances

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 09:39:48 -08:00
Sami Aario
4eceb2fc69 Fix a build break 2026-02-02 15:24:35 +00:00
Sami Aario
aa247e2d63 Fix a build break 2026-02-02 14:30:39 +00:00
Sami Aario
70be645270 Fix a build break 2026-02-02 10:27:58 +00:00
Sami Aario
348b555cc3 Merge remote-tracking branch 'origin/develop' into LWPCK-3549-cleanups 2026-02-02 10:00:44 +00:00
ZheWang
e6bcd192d4 Mx fp6 flatmm (#3601)
* add fp6 data-type and support sync/async dwordx3 load/store

* clang-format

* pre-commit

* 1st commit

* default mnk pass ut

* fix a distrubution

* fix

* fix bdram distr

* update

* pass ut

* improve perf

* update

* clean code

* resolve copilot comment

* reslove comment

* clang-format

---------

Co-authored-by: ZheWang <zhewan@amd.com>
2026-02-02 16:04:40 +08:00
Po Yen Chen
8c1788757a [CK_TILE] Fix incompatible vector type arguments for the intrinsic calls (#3672)
* Change call to the intrinsics

* fix clang format

* Undo changes under include/ck/utility

* Use named variable as vector size

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-01-30 12:02:49 -08:00
ApoorvaKalyani
70d71b1514 Test fix for gemm_b_scale_xdl_v3. (#3674) 2026-01-30 10:34:54 -07:00
jiangyon.ren
4d2f8c111e [CK_TILE][FMHA] Add sparse attention VSA (#3341)
* add sparse attention VSA

* fix the pre-commit

* Add jenga test and pre-commit

* add bf16 for vsa

* add jenga support bf16

* remove lse arg

* split kernel code to block & kernel

* fix the pre-commit

* fix the pre-commit

* fix the copyrights

* fix the copyright

* fix the copyright & rename block to pipeline

* fix the copyright and pipeline

* remove lse & dropout & add fmt

* fix the jenga&VSA code review

* remove the useless code & resolved the comments

* remove useless code

* remove useless code

* Clean up code

* Remove more unused code

* Re-format .hpp

* Refactor codegen scripts

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2026-01-31 00:59:47 +08:00
Kiefer van Teutem
2377a62837 Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529)
* Adding remaining flavors for grouped conv fwd

As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu

* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.

* Do not build f8 / bf8 only flavor tests on RDNA3

* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.

* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.

* adding int8 and fp16 overloads to the elementwise operations

* fixed copilot nits

* Addressing review comments:

- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files

* clang-format

* reduced no of tests

---------

Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
2026-01-30 17:02:14 +01:00
Erwin Terpstra
6a6177a246 [CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603)
* chore: split block scale example instances in more separate files to speed up compile times

* wip: fp4 scaffolding for abquant

* feat: add fp4 decoding-while-loading to abquant pipeline

* feat: add support for fp4 CPU verification in abquant

* chore: add time tracking to reference calculation

* feat: add a4w4 test for blockscale gemm

* feat: optimize reference calculation by preconverting values to AccType

* feat: add fp4 to fp8 look-up table

* fix: reference to wrong ComputeDataType field in QuantProblem

* feat: type utilities for determining MFMA compute types

* feat: packed fp4 for abquant weight preshuffle

* feat: add separate tests for a4w4 base case, padding and preshuffleB

* fix: fp4 conversion on gfx950 attempting to use non-supported method

* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size

* chore: add fp4 preshuffleb mode to block scale example

* chore: sanity check for packed types being 1 byte

* chore: clarify tensor dimension indices with constants

* chore: replace traits check with specialized check for packed types

* style: some minor refactoring and cleanup

* fix: correct conversion table for FNUZ fp8

* chore: add fp4 instances to main abquant instances again

* chore: use same initialization branch for int4 and fp4

* chore: add missing initialization for fp4 in block scale gemm example

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-30 04:40:50 -07:00
Zoltán Lakatos
565fea2645 fix undefined behaviour in softmax kernel (#3683)
Co-authored-by: root <zoltan.lakatos@streamhpc.com>
2026-01-30 15:22:54 +08:00
MHYangAMD
6ff0737843 Fix redundant cast in model sensitive rmsnorm (#3681)
* Fix redundant cast

* Fix linting
2026-01-30 10:52:19 +08:00
Enrico Degregori
f16d9100e4 Multi AB support for wave transfer (#3578)
* Add multi AB support to wave transfer

* Improviments to multi ABD examples

* Add instances and use intrawave v1 instead of interwave

* Apply changes to other transfers

* Wave transfer: add support for multiple internal vgpr buffers

* Fix compilation error gfx11
2026-01-29 10:29:40 -08:00
Johannes Graner
fabac7e2c3 [Conv] Enable bwd weight splitk autodeduction with cap (#3656)
* Enable bwd weight splitk autodeduction with cap

* Fix error threshold calculations

* Add missing logic to wmma multiple d kernel

* Fix threshold calculation

* Update test with new applicability
2026-01-29 17:40:28 +00:00
Khushbu Agarwal
9b168082b7 [CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629)
* initial commit

* preshuffleQuant support for ABQuant

* fix mxfp4 to use correct QuantGroupSize

* addressing review comments and seperated Preshufflequant for A and B

* updated grouped gemm example for updated traits definition

* fix for CI failure

* updated grouped_gemm_abquant test for updated traits definition

* updated grouped_gemm_abquant test for updated traits definition
2026-01-28 19:45:09 -08:00
Jeff Huang
e3556fed04 Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657)
- Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather
- Add is_gather_dim() and get_gather_index() for multi-dim page lookup
- Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match
  GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1})
- Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread
- Add static_assert for RowMajor VLayout requirement in batch prefill

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-29 07:18:41 +08:00
Bartłomiej Kocot
83b58bb0c3 Grouped Conv Bwd Weight Direct Load (#3648)
* Grouped Conv Bwd Weight Direct Load

* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

* Implement group merging for bwd_weight and add instances

* Link direct load instances

* builder fixes

* fix

* fixes

* fix

---------

Co-authored-by: Graner, Johannes <johannes.graner@amd.com>
2026-01-28 15:31:54 -06:00