Commit Graph

232 Commits

Author SHA1 Message Date
Sami Aario
348b555cc3 Merge remote-tracking branch 'origin/develop' into LWPCK-3549-cleanups 2026-02-02 10:00:44 +00:00
ZheWang
e6bcd192d4 Mx fp6 flatmm (#3601)
* add fp6 data-type and support sync/async dwordx3 load/store

* clang-format

* pre-commit

* 1st commit

* default mnk pass ut

* fix a distrubution

* fix

* fix bdram distr

* update

* pass ut

* improve perf

* update

* clean code

* resolve copilot comment

* reslove comment

* clang-format

---------

Co-authored-by: ZheWang <zhewan@amd.com>
2026-02-02 16:04:40 +08:00
jiangyon.ren
4d2f8c111e [CK_TILE][FMHA] Add sparse attention VSA (#3341)
* add sparse attention VSA

* fix the pre-commit

* Add jenga test and pre-commit

* add bf16 for vsa

* add jenga support bf16

* remove lse arg

* split kernel code to block & kernel

* fix the pre-commit

* fix the pre-commit

* fix the copyrights

* fix the copyright

* fix the copyright & rename block to pipeline

* fix the copyright and pipeline

* remove lse & dropout & add fmt

* fix the jenga&VSA code review

* remove the useless code & resolved the comments

* remove useless code

* remove useless code

* Clean up code

* Remove more unused code

* Re-format .hpp

* Refactor codegen scripts

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2026-01-31 00:59:47 +08:00
Erwin Terpstra
6a6177a246 [CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603)
* chore: split block scale example instances in more separate files to speed up compile times

* wip: fp4 scaffolding for abquant

* feat: add fp4 decoding-while-loading to abquant pipeline

* feat: add support for fp4 CPU verification in abquant

* chore: add time tracking to reference calculation

* feat: add a4w4 test for blockscale gemm

* feat: optimize reference calculation by preconverting values to AccType

* feat: add fp4 to fp8 look-up table

* fix: reference to wrong ComputeDataType field in QuantProblem

* feat: type utilities for determining MFMA compute types

* feat: packed fp4 for abquant weight preshuffle

* feat: add separate tests for a4w4 base case, padding and preshuffleB

* fix: fp4 conversion on gfx950 attempting to use non-supported method

* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size

* chore: add fp4 preshuffleb mode to block scale example

* chore: sanity check for packed types being 1 byte

* chore: clarify tensor dimension indices with constants

* chore: replace traits check with specialized check for packed types

* style: some minor refactoring and cleanup

* fix: correct conversion table for FNUZ fp8

* chore: add fp4 instances to main abquant instances again

* chore: use same initialization branch for int4 and fp4

* chore: add missing initialization for fp4 in block scale gemm example

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-30 04:40:50 -07:00
Jeff Huang
e3556fed04 Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657)
- Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather
- Add is_gather_dim() and get_gather_index() for multi-dim page lookup
- Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match
  GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1})
- Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread
- Add static_assert for RowMajor VLayout requirement in batch prefill

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-29 07:18:41 +08:00
SamiAario-AMD
d0e9dc510e Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-28 17:14:23 +02:00
Yi DING
8e3d84aba3 [CK_TILE] ABQuant New Preshuffle (#3638)
* Refactor

* Gemm quant improvement

* Change preshuffle

* Fix

* Fix grouped gemm ut

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-27 23:46:49 -08:00
Illia Silin
b26cb596b0 fix some syntax errors (#3658) 2026-01-27 09:59:39 -08:00
SamiAario-AMD
72fa29bad5 Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-27 15:39:38 +02:00
ltqin
67f0b74ec6 Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit de5a1d730d.

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-23 09:03:22 -08:00
Po Yen Chen
de5a1d730d Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit dd0b4294af.
2026-01-22 21:21:19 -08:00
kensclin
31a35ecab4 GEMM Blockscale ABQuant Optimization (#3620)
* GEMM Blockscale ABQuant Optimization

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix precommit error

* clean

* Fix

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
2026-01-22 09:39:38 -08:00
ltqin
dd0b4294af Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel

* add block scale to kernel

* add smoke test

* format

* Revert "format"

This reverts commit 356c3c9706.

* only format my code

* format py

* fix auto not allowd in function prototype

* change instance tttt to ttff

* fix structured binding issue

* change s_acc elementwise op

* async pipeline add block scale

* add quantation P using shift exp2

* precompute (m - shift) once per row

* change blk scale seqstrt ptr name

* fix some name

* fix for  deduction guide

* fix some comments

* add P scale to qr_ksvs_pipeline

* add comment to idx_identity

* change the method of calculating descale block index

* unify naming style: use block_scale_ as name prefix

* unify naming style

* update the CHANGELOG.md

* Add FP8 block scale quantization support for FMHA forward kernel

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-21 20:58:26 -08:00
SamiAario-AMD
4b26eac92f Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-21 10:32:45 +02:00
Max Podkorytov
91b4102a59 Add persistent async input scheduler for GEMM kernels (#3520)
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
  chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks

Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
  chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
2026-01-20 10:37:09 -08:00
SamiAario-AMD
d71fe5b98c Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-19 17:16:37 +02:00
Adam Osewski
1a6d1b59ef [CK_BUILDER] Convolution forward transfer concepts. (#3535)
* Rename member variable to better reflect its actuall meaning.

* Add transfer checks for conv fwd xdl.

* Validate tensor layouts & vector size conv fwd v3.

* Add combined transfer concepts.

* Add transfer concepts for conv fwd factories.

* Fix clang format

* Add helper instruction to get max mem vector instruction width.

* Apply review comments.

* Rename thread cluster access(->arrange) order concept

* FIx merge artifacts.

* Add generic access order limits into block transfer concept.
2026-01-19 10:54:10 +01:00
SamiAario-AMD
ea4e543555 Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-14 10:59:05 +02:00
Thomas Ning
00c46785a8 Shuffle fix for gfx950 (#3491)
* solve compiler issue

* solve the gfx950 mfma shuffle regression

* refactor jenkinsfile to handle arch name better

* [CK TILE] set divisor to count of thread along k dimension

* fix the compiler error

* solve degradation

* Finish the multiplies fix

* fix the scales

* solve compilation error

* solve the composes

* solve the error of tile sweeper

* fix the test and example

* fix for gfx950

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
2026-01-13 09:21:29 -08:00
damien-lejeune
4216d43da8 Dlejeune/ck tile 2d multiple reductions (#3147)
* WIP

* Add Unit tests for the Multi Reduction Kernel

* clang format

* Rename multiblock to threadwise

* Multiblock WIP

* Fix multi reduce multi block unit tests

* Multi Reduce Tile Engine: WIP

* refactoring + try addressing precision error

* Fix multiops examples

* Cleanup

* Clean up tile engine's reduce op

* Update changelog

* Fix remod/clang

* Fix dates

* Fix documentation & missing file

* Fix comments

* Use the update_tile api in the multi-block kernel

* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests

* Add TileParitioner

* Cleanup

* Add warning when no data to process, in the example

* Refactoring Reduce kernel Tile Partioner + cleanup

* Move the tile partioner to its own file

* Add missing includes

* Fix copyright header with update_amd_copyright_headers.py

* Fix change of interface in Reduce2dProblem

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-01-09 11:16:37 +01:00
Sami Aario
321611081f Remove an unused overload of load_tile_transpose_with_offset 2026-01-07 19:44:00 +00:00
Sami Aario
8fc4030a57 Add an instance of load_tile_transpose that takes a reference to the output tensor as an input 2026-01-07 19:44:00 +00:00
Sami Aario
9559a93432 Make explicit that the tile window argument to load_tile_with_elementwise and the two load methods it uses are tuples 2026-01-07 16:21:32 +00:00
Sami Aario
825d17c3d7 Fix a comment 2026-01-07 16:21:32 +00:00
Sami Aario
4d77856be5 Make some functions return void explicitly instead of auto 2026-01-07 16:21:31 +00:00
Cong Ma
d7497d2694 [CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512)
Refactor function amd_buffer_load_invalid_element_return_zero to avoid
the inefficient ASM code generated by compiler.

Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage

Tested compilers:
- Rocm 7.0.1
- Rocm 7.1.1

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-07 00:05:56 -08:00
joyeamd
b78563b3d3 Merge some updates for ck_tile headers (#3342)
* fix some issues from internal branch

* update cshuffle_epilogue

* update cshuffle_epilogue

* update cshuffle

* update warp_gemm
2026-01-05 23:39:00 -08:00
Estevan Vedovelli
1224bc0a82 Add support to gfx1153 and fix gfx115X WMMA config (#3496)
* Support for gfx115X

* Changes for gfx115X

* Add gfx1153

* Update changelog

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-05 10:03:30 -08:00
Illia Silin
14668a56e3 remove the LLVM_MAIN_REVISION usage (#3487) 2025-12-24 16:49:35 -08:00
Jan Patrick Lehr
9bd67c2cf2 [CK-TILE] Guard against compiler lexer diagnostic (#3444)
* [CK-TILE] Guard against compiler lexer diagnostic

A recent change to Clang added a lexer-level diagnostic about that C2y
language feature. Since that is lexer level, the `__extension__`
compiler built-in does not work as it is only respected *after* the
lexer when parsing.

This change adds guarding pragmas to disable the diagnostic in the
lexer and not lead to warnings being treated as errors.

* Fixing still existing build issue

Once the one warning was removed, another one poppoed up. Both are
related to the same c2y feature. Thus, ignoring both.

* clang-format handling

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-19 17:32:20 -08:00
Illia Silin
2d9c962e2c get LLVM_MAIN_REVISION macro from compiler header (#3469) 2025-12-19 14:57:12 -08:00
Yi DING
2220cbaba7 [CK_TILE] MX Flatmm Use Byte Pointer Arithmetic for A Tensor (#3446)
* A as bytes

* Reformat with static_for_product
2025-12-19 10:28:13 +08:00
yadaish
c0ee71d735 Dev/a8w4 and a8w8splitk (#3447)
* Ck moe bs splitk pr (#3440)

* splitk kick-off. Compilation fail

* splitk hack pass

* fix scale offset calc.

* clang-format for a8w8_moe_blk_gemm1 splitk change

* fix testcase error

---------

Co-authored-by: oscar <huaiguxu@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>

* Zan/moe a8w4 (#3441)

* update

* update

* update ck moe a8w4

* update

* update

* update

* compile pass

* update

* update

* python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready

* support new a8w4 kernel

* update

* update ck_tile

* re format

* update

* update

* fix conflict

* fix build

* update ck_tile moe

* fix clang format

* fix the problem

* fix accruacy issue

* fix

---------

Co-authored-by: oscar <huaiguxu@amd.com>
Co-authored-by: huaiguxu <145733371+huaiguxu@users.noreply.github.com>
Co-authored-by: Zzz9990 <zanzhang@amd.com>
Co-authored-by: felix <felix.li@amd.com>
2025-12-19 09:26:52 +08:00
Bartłomiej Kocot
700b2ec9c0 Update AMD buffer coherency (#3403)
* Update AMD buffer coherency [AICK-421]

* fixes

* fix

* fixes

* fixes

* Add backward compatilibity

* fix

* fixes

* fix

* fix

* fix

* Update grouped_convolution_backward_weight_kernel.hpp
2025-12-18 10:16:22 +01:00
ltqin
92653168c2 flashattention fwd add (80, 96) instance (#3415)
* add hdim (96,96) instance

* change to (80,96)

* format py

* remove 96 in optdim

* when N=6 change to llvm_amdgcn_raw_buffer_load_i32x3
2025-12-17 09:16:11 -08:00
Aviral Goel
5e2d25e20f build: reduce build time for bquant tests by splitting into multiple cpp & support on other gfx10 case (#3395)
* build: reduce build time for bqaunt unit tests by splitting into multiple cpp

* reduce the test case & add the gfx10 support

* fix: copyright header for new file

* chore: add copyright to pass the CI

* build: Hot fix to reduce massive build time by just disabling the instances

* Update include/ck_tile/core/config.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: khushbu <khuagarw@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-15 07:19:29 -08:00
Illia Silin
b2925ee207 Fix compilation errors with latest clang22 version. (#3396)
* remove target attributes from deduction guides

* switch CK_TILE_HOST_DEVICE_EXTERN based on clang version
2025-12-11 08:09:29 -08:00
eliotwang
715671e419 Bf16*fp4 gemm (#2801)
* support bf16*mxfp4 gemm

* rebase bf16*fp4 example to develop branch

* Clean up commented debug code in GEMM kernel

* rename example folder

* support bf16*mxfp4 gemm

* rebase bf16*fp4 example to develop branch

* Clean up commented debug code in GEMM kernel

* rename example folder

* rebase to new develop

* fix clang format

* update code according to reviewer's comment

* Update README.md

* update code according to reviewer's comment

* update code according to reviewer's comment

* Update CMakeLists.txt

* Update README.md

* Update CMakeLists.txt

* Delete files

* Delete files

* Add unit tests

* Update test_gemm_quant_base.hpp

* merge bf16*fp4 example to develop branch

* fix clang format

* fix clang format

* Update CMakeLists.txt

* fix ci test

* fix clang format

* resolve conflicts

---------

Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-11 07:20:29 -08:00
Zzz9990
1aa93ef551 [CK_TILE MOE] add NT & preshuffle permute to cktile MOE (#3377)
* update coherence
---------

Co-authored-by: Zzz9990 <Zzz9990>
2025-12-10 10:03:28 +08:00
Thomas Ning
86a84ae611 Add the gfx1011 support on CK Tile with the SGPR builtin reading protection (#3350)
* Finish the fixes

* add the gfx1010 support macro

* Fix the compilation error
2025-12-05 14:18:30 -08:00
Po Yen Chen
05292b3604 [CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153)
* Let fmha_fwd_v3() compatible with fmha_fwd()

* Decouple get_fwd_blobs() and FmhaFwdKernel

* Decouple compatibility checks from get_fwd_blobs()

* Extract product feature checks out from get_fwd_blobs()

* Remove duplicated code in factories and redundant checks

* Remove FmhaFwdKernel<>::GetName()

* Let FmhaFwdApiPool support pipelines with different mask_impl

* Add tile setting for fmha fwd v3 pipeline

* Add fwd v3 instances to tile_example_fmha_fwd manually

* Remove unused function import

* Undo irrelevant changes

* Remove fwd v3 instances from tile_example_fmha_fwd

* Finish fmha fwd v3 kernel instance codegen

* Fix formatting

* Remove unused F_idx attribute

* Add is_generic_attention_mask<> traits

* Add constraints to the fmha fwd v3 pipeline

* Unify traits & problem used for fmha fwd v3

* Unify kernel launch code for fmha fwd v2 & v3

* Unify kernel template selection logic

* Use same kernel codegen template for both v2 & v3

* Rename api() property as render() method

* Allow specifying filter for fmha fwd api pool

* Allow specifying function name when rendering api pool items

* Separate fmha fwd v3 kernel dispatching logic from v2

* Remove lambda assignment

* Add simple v2/v3 dispatch logic

* Stop generating empty if-clauses

Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them.

* Use "".join() to concatenate fmha fwd api string content

* Add more feature checks for fmha fwd v3 pipeline

* Check features before dispatch to fmha_fwd_v3()

* Add more feature checks for fmha_fwd_v3()

* Add missing filter call

* Use Tuple to reserve the dtype orders

* Fix wrong pipeline matching logic

* Add fmha fwd v3 group mode instances

* Add functor_transform<>

* Add type constraints to make_tile_window()

* Remove fmha fwd v3 example

* Fix wrong product(aiter mha_fwd()) config

* Fix wrong fmha fwd v2/v3 selection logic

* Fix formatting

* Add comment to warning v3 kernel users

* Fix wrong codegen logics

* Remove unnecessary param

* Fix format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-05 10:31:12 +08:00
Yi DING
f211156ce6 [CK_Tile] Flatmm MX Cleanup & Explicite Offset Calculation (#3286) 2025-12-02 14:21:12 +08:00
Yi DING
9ed9539ddf [CK_TILE] Disable cast_tile_pk_fp16bf16_fp32 as It Causes Extra spills on Recent Compilers (#3327) 2025-12-01 14:48:22 +08:00
msaffari-amd
f875ab0bbc Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic-add (#3236)
* Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic

* correct clang-format

* removed unused rtol_atol variable from example code

* clang format correction

* remove unused varable max_accumulated_value from example
2025-11-28 09:43:01 +01:00
Matthias Gehre
678298d4c7 Add support for gfx1153 (#3306) 2025-11-27 08:48:00 +01:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
Christopher Millette
b9c6cb1452 First look at mfma / wmma unification (#2704)
* First look at mfma / wmma unification

* Refactor

* Re-org file structure

* Restructure transform selection and WaveWiseMma class

* Update license files. Add missing gfx1151 support. Change wave size for HOST to 1. Update datatypes naming consistency

* Fixes default MmaSelector implentation

* Adds unit tests for amdgcn_mma and arch

* Consolidate common arch id checks to constexpr functions. Strongly type ids as amdgcn_target_arch_id object.

* Refactor is_any_value_of

* Fixes mma_selector logic

* Fix typo

* Add mma selector test for tile decomposition

* Fix compilation of mma.hpp

* Revert back to c++17 compatibility

* Fix compiler error by returning index_t from get_warp_size()

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes compiler error for missing is_wave32() function

* Fixes compiler error for host wave_size() should be 64

* Fixes compiler errors where __cpp_concepts is not defined

* Fixes compiler errors where __cpp_concepts is not defined

* Fix test failure for host is wave64 by default

---------

Co-authored-by: Chris Millette <you@example.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-24 09:39:59 -08:00
Yi DING
8b284a63a4 [CK_TILE] Refine FP32 => FP16/BF16 Conversion (#3215)
* [CK_TILE] Refine FP32 => FP16/BF16 Conversion

* Thank you Copilot

* Rename fix

* Fix example

* Fix accu checking

* Fix

* Fix
2025-11-20 10:50:26 -08:00
Gavin Zhao
07314ac543 Add support for RDNA1 GPUs (#3220)
* Allow compilation for RDNA1 (__gfx101__)

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* More RDNA1 changes

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* Even more RDNA1 changes

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* cmake: skip build quantization for unsupported arches

* add gfx10-1-generic support as well

* add gfx1013 and complete gfx10-1-generic

* fix clang format

* enable DL kernels on gfx101x

---------

Signed-off-by: Gavin Zhao <git@gzgz.dev>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-11-20 10:45:57 -08:00
Yi DING
47e2ed838e [CK_TILE] Add Flatmm MX FP8 (#3208)
* Use async for flatmm mxfp4

* Fix preshuffle

* Add flatmm mxfp8

* Thanks, Copilot

* Thanks Copilot again~
2025-11-20 10:35:15 +08:00