Commit Graph

722 Commits

Author SHA1 Message Date
Sami Aario
1ee1307d49 Candidate fix 13 2026-02-24 11:26:05 +00:00
Sami Aario
1a5ec9efdb Candidate fix 12 2026-02-24 10:06:23 +00:00
Sami Aario
f35c9da001 Candidate fix 11 2026-02-23 13:54:53 +00:00
Sami Aario
3dad12583b Candidate fix 10 2026-02-23 13:18:40 +00:00
Sami Aario
aebc095d0e Candidate fix 9 2026-02-23 12:22:47 +00:00
Sami Aario
c82c7fe2b3 Candidate fix 8 2026-02-23 10:35:51 +00:00
Sami Aario
c79ab1f84a Candidate fix 7 2026-02-23 09:03:20 +00:00
Sami Aario
1de8bc9501 Candidate fix 6 2026-02-20 16:25:36 +00:00
Sami Aario
a6ffc9c6e5 Candidate fix 5 2026-02-20 13:40:12 +00:00
Sami Aario
e2a85ee7a0 Candidate fix 4 2026-02-20 13:40:12 +00:00
Sami Aario
5d40ac6c1b Candidate fix 3 2026-02-20 13:40:12 +00:00
Sami Aario
709608f843 Candidate fix 2 2026-02-20 13:40:12 +00:00
Sami Aario
de1a228b34 Candidate fix 2026-02-20 13:40:12 +00:00
Sami Aario
8b462b04ce Clear load_tile_transpose_convert_with_offset 2026-02-20 13:40:12 +00:00
Sami Aario
3ec60914ad Add include statements added by remod.py 2026-02-03 13:52:41 +00:00
Sami Aario
31c91a9535 Formatting changes 2026-02-03 13:52:41 +00:00
Sami Aario
ad2d10a633 Switch to an implementation of DetermineWarpPrecType that explicitly defines the A and B types
- This is for improved clarity and finer control of the datatypes to use
2026-02-03 13:52:41 +00:00
Sami Aario
298fd29fba Add and use load_tile_transpose_convert for mixed precision transpose loading 2026-02-03 13:52:41 +00:00
Sami Aario
7fef648bca Refactor type conversions out of MakeBLdsBlockDescriptor, WIP! 2026-02-03 13:52:41 +00:00
Sami Aario
1b610f4aaf Add type conversions to V4 pipeline, WIP! 2026-02-03 13:52:40 +00:00
Sami Aario
3a792017fb Add functionality and tests for fp16 x fp8 and fp8 x fp16 2026-02-03 13:52:40 +00:00
Sami Aario
f8c4868a59 Add functionality and tests for bf16 x fp8 and fp8 x bf16 2026-02-03 13:52:40 +00:00
Sami Aario
3f4a85146c Add MFMA warp gemm for float, float, float, 32, 32, 16 2026-02-03 13:52:40 +00:00
Sami Aario
7f22e8c66a Add and use load_with_type_convert 2026-02-03 13:52:40 +00:00
Sami Aario
b41ed6e371 Introduce DetermineWarpPrecType for determining warp GEMM precision types 2026-02-03 13:52:40 +00:00
Sami Aario
f2fcc4a461 Add NumAccess as a template parameter to WarpGemmAttributeMfma::get_warp_dstr_encoding 2026-02-03 13:52:40 +00:00
Sami Aario
933e09f6c3 Rename the parameters of load_interleaved_pk_type and load_and_convert_tile 2026-02-03 13:52:40 +00:00
SamiAario-AMD
8c8715904e Merge branch 'develop' into LWPCK-3549-cleanups 2026-02-03 13:28:08 +02:00
Aviral Goel
3e77721755 feat: add split_k support for block scale gemm bquant mode. (#3653)
* WIP: add splitk to bquant

* feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types

* chore: remove temporary test script

* fix: incorrect tile window length for splitted bq tensor window

* chore: improve comments

* test: add unit tests to cover bquant splitk functionality

* fix: conflict resolution by renaming variables
2026-02-02 14:41:53 -08:00
Jan Patrick Lehr
069500464d [Compiler] Addressing new compiler warnings (#3640)
* [Compiler] Addressing new compiler warnings

Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.

The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.

* Update some more instances

* Adds file-level ignores via clang diagnostic pragma

The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.

It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.

* This adds the remaining instances

For a build on gfx90a.

* fix clang format

* Adding couple more instances from gfx1200 build

* Fixed another few instances

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-02 09:39:48 -08:00
Sami Aario
4eceb2fc69 Fix a build break 2026-02-02 15:24:35 +00:00
Sami Aario
aa247e2d63 Fix a build break 2026-02-02 14:30:39 +00:00
Sami Aario
70be645270 Fix a build break 2026-02-02 10:27:58 +00:00
Sami Aario
348b555cc3 Merge remote-tracking branch 'origin/develop' into LWPCK-3549-cleanups 2026-02-02 10:00:44 +00:00
ZheWang
e6bcd192d4 Mx fp6 flatmm (#3601)
* add fp6 data-type and support sync/async dwordx3 load/store

* clang-format

* pre-commit

* 1st commit

* default mnk pass ut

* fix a distrubution

* fix

* fix bdram distr

* update

* pass ut

* improve perf

* update

* clean code

* resolve copilot comment

* reslove comment

* clang-format

---------

Co-authored-by: ZheWang <zhewan@amd.com>
2026-02-02 16:04:40 +08:00
Po Yen Chen
8c1788757a [CK_TILE] Fix incompatible vector type arguments for the intrinsic calls (#3672)
* Change call to the intrinsics

* fix clang format

* Undo changes under include/ck/utility

* Use named variable as vector size

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-01-30 12:02:49 -08:00
jiangyon.ren
4d2f8c111e [CK_TILE][FMHA] Add sparse attention VSA (#3341)
* add sparse attention VSA

* fix the pre-commit

* Add jenga test and pre-commit

* add bf16 for vsa

* add jenga support bf16

* remove lse arg

* split kernel code to block & kernel

* fix the pre-commit

* fix the pre-commit

* fix the copyrights

* fix the copyright

* fix the copyright & rename block to pipeline

* fix the copyright and pipeline

* remove lse & dropout & add fmt

* fix the jenga&VSA code review

* remove the useless code & resolved the comments

* remove useless code

* remove useless code

* Clean up code

* Remove more unused code

* Re-format .hpp

* Refactor codegen scripts

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2026-01-31 00:59:47 +08:00
Erwin Terpstra
6a6177a246 [CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603)
* chore: split block scale example instances in more separate files to speed up compile times

* wip: fp4 scaffolding for abquant

* feat: add fp4 decoding-while-loading to abquant pipeline

* feat: add support for fp4 CPU verification in abquant

* chore: add time tracking to reference calculation

* feat: add a4w4 test for blockscale gemm

* feat: optimize reference calculation by preconverting values to AccType

* feat: add fp4 to fp8 look-up table

* fix: reference to wrong ComputeDataType field in QuantProblem

* feat: type utilities for determining MFMA compute types

* feat: packed fp4 for abquant weight preshuffle

* feat: add separate tests for a4w4 base case, padding and preshuffleB

* fix: fp4 conversion on gfx950 attempting to use non-supported method

* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size

* chore: add fp4 preshuffleb mode to block scale example

* chore: sanity check for packed types being 1 byte

* chore: clarify tensor dimension indices with constants

* chore: replace traits check with specialized check for packed types

* style: some minor refactoring and cleanup

* fix: correct conversion table for FNUZ fp8

* chore: add fp4 instances to main abquant instances again

* chore: use same initialization branch for int4 and fp4

* chore: add missing initialization for fp4 in block scale gemm example

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-30 04:40:50 -07:00
MHYangAMD
6ff0737843 Fix redundant cast in model sensitive rmsnorm (#3681)
* Fix redundant cast

* Fix linting
2026-01-30 10:52:19 +08:00
Khushbu Agarwal
9b168082b7 [CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629)
* initial commit

* preshuffleQuant support for ABQuant

* fix mxfp4 to use correct QuantGroupSize

* addressing review comments and seperated Preshufflequant for A and B

* updated grouped gemm example for updated traits definition

* fix for CI failure

* updated grouped_gemm_abquant test for updated traits definition

* updated grouped_gemm_abquant test for updated traits definition
2026-01-28 19:45:09 -08:00
Jeff Huang
e3556fed04 Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657)
- Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather
- Add is_gather_dim() and get_gather_index() for multi-dim page lookup
- Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match
  GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1})
- Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread
- Add static_assert for RowMajor VLayout requirement in batch prefill

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-29 07:18:41 +08:00
Sami Aario
fc1b683d18 Fix a build break 2026-01-28 15:44:02 +00:00
SamiAario-AMD
d0e9dc510e Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-28 17:14:23 +02:00
Yi DING
8e3d84aba3 [CK_TILE] ABQuant New Preshuffle (#3638)
* Refactor

* Gemm quant improvement

* Change preshuffle

* Fix

* Fix grouped gemm ut

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-27 23:46:49 -08:00
damien-lejeune
91e32f305f [CK Tile] multi reduce improvements (#3607)
* WIP: refactoring

* Swap operation/data nested loops order

* Improve memory coalescing

* Add comments

* Enforce same identity element for the reduce operations

* Re-add compile time constant

* Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
2026-01-27 12:56:09 -08:00
Illia Silin
b26cb596b0 fix some syntax errors (#3658) 2026-01-27 09:59:39 -08:00
SamiAario-AMD
72fa29bad5 Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-27 15:39:38 +02:00
Bartłomiej Kocot
3d67e6c492 [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624)
* [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err

* Update test_grouped_convnd_fwd_tile.cpp

* Update test_grouped_convnd_fwd_tile.cpp

* Update conv_tuning_params.hpp

* clang format fix

* Update CMakeLists.txt
2026-01-27 11:04:11 +02:00
Aviral Goel
b8751e505d feat: Add Interwave scheduler for aquant memory pipeline (#3540)
* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* test: add unit test for aquant memory pipeline

* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* fix: compilation error on gfx950

* chore: remove debug statements from the code

* test: resolve merge conflict

* test: remove non rcr unit tests from test suite
2026-01-26 11:27:42 -08:00
Thomas Ning
3900e1e7ce Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634)
* Finished the work

* Fix the pipeline
2026-01-26 10:29:28 -08:00