Commit Graph

3095 Commits

Author SHA1 Message Date
Bartłomiej Kocot
a07c8e38bd Fix grouped conv bwd data wmma check (#3562) 2026-01-14 11:04:37 -08:00
Khushbu Agarwal
118afa455c [CK_Tile] Support for group size 128 for Preshuffle quant for 2d block scale gemm (#3462)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* G128 working for both prefill and decode shapes for preshufflequant

* clean up after merging with develop

* fixing group 64 for decode shapes

* non preshufflequant working for group size 128

* enable preshuffleb and preshufflequant with variour group sizes

* reduce build time by splitting example into diff datatype files

* Adding tests for preshuffleQuant

* address review comment

* fix for gfx1201

* compile time fix for gfx1201

* clang formatted

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
2026-01-14 10:00:19 -08:00
Ville Pietilä
1fc5a3f3ac Build CK on Windows (#3458)
* CMakeLists.txt hack for Windows.

* Add Windows build instructions.

* Fix  type issue with variadic min function.

* Use std::common_type to fix the variadic min/max functions.

* Enable CPU guard compilation on Windows.

* Suppress warnings related to std::getenv on Windows platform.

* Git ignore the output directory on Windows platform.

* Powershell script for running tests and generating reports.

* Improve test logging.

* Disable non-conv tests.

* Fix Debug build on Windows.

* More debug build changes.

* Update Windows build instructions.

* Enable all tests.

* Test fixes.

* Suppress not found linker options warning.

* Update unsigned long literals and format specifiers to work correctly in Windows

* Fix conv 3D bwd weight bilinear tests on Windows.

* Revert changes on .gitignore.

* Clean-up CMake project file for Windows builds.

* clang-format

* Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms.

* Fix building examples on Windows.

* Update Readme.

* Remove the suppression of the deprecated warnings.

* Remove Windows specific min/max implementations from CK Tile math core.

* Remove unnecessary no-op on Windows.

---------

Co-authored-by: User <user@example.com>
Co-authored-by: Ville Pietilä <none>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
Co-authored-by: Ville Pietilä <>
2026-01-14 07:31:45 -08:00
Johannes Graner
f173642087 [CK] Refactor GPU verification kernel to gather error stats on GPU (#3551)
* Refactor GPU verification kernel to gather erorr stats on GPU

* Check if result is all zero

* non-negative error count doesn't need custom Atomics

* Remove unnecessary AtomicMaxFloat function

* Simpler warp reduction, remove passed flag

* Move verification header to include

* Fix header path in test

* Fix block reduction loop
2026-01-14 16:04:50 +01:00
Johannes Graner
3ccb15ea02 [CK Profiler] Initialize tensors on GPU in CK profiler (#3550)
* Initialize tensors on GPU in CK profiler

* Kick CI
2026-01-14 16:04:14 +01:00
Linjun-AMD
717ed0b59f [CK_TILE][FMHA] Enable gpt-oss sink (#3490)
* Enable gptoss sink

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add gptoss sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update CHANGELOG.md

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix test args error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update test_fmha_fwd.cpp

* update sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Revert "update sink test"

This reverts commit 970b4f1686.

* update sink test

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update valid sink_v in splitkv pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update example_fmha_fwd.cpp

* fix lse error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix clangformat error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix aiter scale error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qr_ks_vs.hpp

* div scale_s for sink_value

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update fmha_fwd_runner.hpp

* update sink_value with bias

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Fix typo in dropout parameter in fmha_batch_prefill_kernel

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update example_fmha_fwd.cpp

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* optimized some code

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* fix splitkv error

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update sink reference

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update fmha_fwd_runner.hpp

* Update smoke_test_fwd_sink.sh

---------

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-01-14 21:32:06 +08:00
Enrico Degregori
693ff3bbb3 Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465)
- Add support for direct store in epilogue instead of cshuffle
 - Add padding support for wave transfer without transpose
 - Add wave transfer with interleaved layout to support direct store
 - Enable new functionalities on GEMMs
 - Add optional new functionality support for grouped convolution fwd
 - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)
2026-01-14 11:02:19 +01:00
Thrupti Raj Lakshmana Gowda
51027474af [CK TILE ENGINE] CI fix for Basic Tile Engine (#3554)
* memory op changes

* memory op changes

* Fixing TILE_ENGINE_BASIC in Tile Engine

* Removing gfx90a from Tile Engine Run

* [CK TILE ENGINE] increasing ci configs for BASIC case

* Setting RUN_TILE_ENGINE_BASIC_TESTS to ON by default

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2026-01-13 16:20:30 -08:00
Thomas Ning
00c46785a8 Shuffle fix for gfx950 (#3491)
* solve compiler issue

* solve the gfx950 mfma shuffle regression

* refactor jenkinsfile to handle arch name better

* [CK TILE] set divisor to count of thread along k dimension

* fix the compiler error

* solve degradation

* Finish the multiplies fix

* fix the scales

* solve compilation error

* solve the composes

* solve the error of tile sweeper

* fix the test and example

* fix for gfx950

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
2026-01-13 09:21:29 -08:00
Ville Pietilä
9908a87c31 [CK_BUILDER] Add bwd weight factories (#3509)
* Add placeholder test.

* Initial conv bwd weight factory.

* Conv builder test refactoring.

* Add missing pieces to bwd weight factory.

* Improve compile time erros message when no matching factory is found.

* Use amcro to ensure automatic macthing between concepts are their string representations.

* Improve compile time diagnostics.

* Small improvements.

* Improve missing member/wrong type compile-time errors.

* Improve compile time diagnostics.

* Concept bug fixes.

* Remove debug assert.

* Update algorithm signature diagnostics.

* Factory bug fixes.

* First functional version of bwd weight conv factory.

* Refactor handing of GEMM-K batch template parameter in conv bwd weight factory.

* Concept improvements.

* Improve concept diagnostics.

* Introduve a common size type for concepts.

* Update compiletime diagnostics to use the size type.

* Update conv specialization enum.

* Fix fwd conv builder tests.

* Fix smoke tests.

* Separate bwd weigth and bwd data tests into separate targets.

* Clean-up CK Tile builder tests.

* Add bwd weight XDL CShuffle V3 factory.

* Build conv bwd weigth v3 instances successfully.

* Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.

* Test fix.

* Add instance traits for bwd weight algorithms.

* Add unit tests for instance strings.

* Build new instance traits unit tests but exclude WMMA for now.

* Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.

* Conv bwd weight DL factory.

* Final implementation for bwd weight DL factory.

* Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance.

* Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle

* Treat ref algorithm the same way as real algorithms in the dispatcher.

* Refactor large tensor support and WMMA configuration.

* Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3.

* Update Readme.

* Fix WMMA bwd weight tests.

* Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3.

* Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle.

* Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle.

* Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3

* Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and  compute types for input and output tensor in bwd weigth convs.

* Fix fwd factories after refactoring.

* clang-format

* Move compile-time diagnostics to a separate branch.

* Fix ref algorithm dispatching.

* Fix smoke tests.

* clang-format

* Fix factory for regular WMMA conv bwd weight.

* Clarify builder Readme.

* Remove obsolete test file.

* Fix test after merge.

* clang-format

* Remove the C++26 extensions.

* Unify conv elementwise ops and layout definitions for fwd and bwd directions.

* Remove old layout and elementwise ops.

* Unify handling of conv tensor types between fwd and bwd directions.

* Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank.

* Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms.

* clang-format

---------

Co-authored-by: Ville Pietilä <>
2026-01-13 18:12:38 +02:00
Po Yen Chen
710fa1fd3d fix incorrect List import in reduce_parameter.py (#3555) 2026-01-13 20:03:05 +05:30
Erwin Terpstra
eb041079a3 Implement grouped gemm tile loop for RDNA4 (#3304)
* feat: grouped gemm tile loop support for RDNA4

* fix: removed extra parameter from grouped gemm example instance

* fix: FP8 check incorrectly enabling FP8 on RDNA3
2026-01-13 07:14:23 +01:00
Jeff Huang
141f77aa12 [CK Tile] Fix FMHA LSE calculation and potential division by zero (#3326)
This commit addresses numerical stability issues in the BlockFmhaPipelineQRKSVS pipeline when bias has -inf masking values:
1. Explicitly handle the case where the accumulated exponential sum (l) is zero. In this case, the LSE is now correctly set to negative infinity, preventing log(0) errors.
2. Extend the zero-check protection in the normalization step to cover the ELEMENTWISE_BIAS case, preventing potential division by zero.
2026-01-13 13:52:26 +08:00
Jeff Huang
c9f112b026 [FMHA] Support page_size=1 (linear layout) in batch prefill pipeline (#3545)
- Enable page_size=1 support in batch prefill codegen (linear layout only).
- Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly.
- Relax `kPageBlockSize` alignment assertion for the page_size=1 case.
2026-01-13 12:04:43 +08:00
ZheWang
a575acb245 fix mxfp8-gemm example failure (#3531)
Co-authored-by: ZheWang <zhewan@amd.com>
2026-01-13 10:26:45 +08:00
Aviral Goel
5aaa031350 WIP: extract MakeALdsDescriptor() from child to parent class for code readability (#3392)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-12 09:51:58 -08:00
Aviral Goel
e809861d49 refactor: remove Default scheduler implementation as it not used anymore (#3542)
* refactor: remove Default scheduler implementation as it not used anymore

* refactor: remove dead code from gemm universal kernel

* chore: add descriptive comments about amd intrinsic hardware sync instructions

* fix: label existing memory pipeline for aquant as intrawave
2026-01-12 09:51:06 -08:00
Johannes Graner
18c2ff6019 [CK profiler] Perform verification on GPU when using GPU reference (#3482)
* Simple verification kernel for ckProfiler

* Verification kernel unit tests

* Explicit synchronization

* Address review comments
2026-01-12 12:12:41 +01:00
kabrahamAMD
20f66c1e6b adressed review comments from PR3459 (#3526)
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
2026-01-12 09:47:00 +01:00
Robin Voetter
b352a68606 ck-builder: tensor input/output reflection (#3536)
This adds some utilities to automatically generate UniqueInputs,
UniqueOutputs, alloc_inputs, alloc_outputs, and validate, based
on a Inputs::reflect() and Outputs::reflect().
2026-01-12 09:45:53 +01:00
yadaish
32408c8bc0 moe fp8 blockscale use nt (#3524)
* nt on fp8 blockscale

* some improve and tests needs to be fixed

* update

* fix format

* revert useless change

* revert any change in amd_buffer_coherence
2026-01-12 10:48:10 +08:00
damien-lejeune
4216d43da8 Dlejeune/ck tile 2d multiple reductions (#3147)
* WIP

* Add Unit tests for the Multi Reduction Kernel

* clang format

* Rename multiblock to threadwise

* Multiblock WIP

* Fix multi reduce multi block unit tests

* Multi Reduce Tile Engine: WIP

* refactoring + try addressing precision error

* Fix multiops examples

* Cleanup

* Clean up tile engine's reduce op

* Update changelog

* Fix remod/clang

* Fix dates

* Fix documentation & missing file

* Fix comments

* Use the update_tile api in the multi-block kernel

* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests

* Add TileParitioner

* Cleanup

* Add warning when no data to process, in the example

* Refactoring Reduce kernel Tile Partioner + cleanup

* Move the tile partioner to its own file

* Add missing includes

* Fix copyright header with update_amd_copyright_headers.py

* Fix change of interface in Reduce2dProblem

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2026-01-09 11:16:37 +01:00
Robin Voetter
e3884bbf05 [CK_BUILDER] Debug utilities (#3528)
* ck-builder: make toString to_string

We are using snake case for CK-Builder

* ck-builder: add debug.hpp with tensor descriptor printing function

This adds some initial functionality to debug.hpp, a header which will
be used to house some debug utilities.

* ck-builder: abstract nd-iteration

Abstracting this makes it easier to test, clearer, and allows us to
use it elsewhere (such as in debug.hpp soon)

* ck-builder: tensor printing

* ck-builder: rename INT32 to I32

This makes it more in line with the other data type definitions.
2026-01-08 10:14:13 +01:00
Thrupti Raj Lakshmana Gowda
770a14494e Removing memop from chshuffle (#3530) 2026-01-07 23:34:43 -08:00
Johannes Graner
ee2c35b92d [CK] Allow tensors larger than 2GB in grouped conv bwd weight (#3169)
* Take split_k into account when checking 2GB tensor limit.

* Revert "Take split_k into account when checking 2GB tensor limit."

This reverts commit adf35c91be.

* Optimize grouped conv bwd wei split_k off calc

(cherry picked from commit 6f61dd56c5)

* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

(cherry picked from commit b33877c10f)

* Fix tensor descriptors and stride calculations

* Don't miss half of the elements

* Fix buffer size calculations

* Disable hack if stride not divisible by k_batch

* Clean up comments

* Disallow hack in non-contiguous edge cases

* Index -> Dim

* Fix broken test

* Refactor applicability checks into separate function

* fix missed variable name

* Fix variable name in info print

* update V3 2GB check

* No more regression, use templates instead

* Code deduplication

* Regression fix for cshuffle

* arch-guarded atomic_add implementations for gfx11

* Similar for half(4|8)_t as well

* Only use both offset hacks at the same time

* Revert "arch-guarded atomic_add implementations for gfx11"

This reverts commit 3883fe6935.
This reverts commit 5311ec608d.

* Reapply "arch-guarded atomic_add implementations for gfx11"

This reverts commit 1972adeddc.

* Only remove float4 atomic_add

* Refactor to single flag

* Consolidate template parameters

* Consolidate flag in transformers

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2026-01-08 08:02:02 +01:00
Bartłomiej Kocot
bc497beffb [CK TILE] Fix grouped conv kernels splitk and double lds (#3527) 2026-01-08 07:59:38 +01:00
Bartłomiej Kocot
f449a5faaa Disable fp32 atomic adds on gfx11 (#3510)
* Disable fp32 atomic adds on gfx11

* Fixes is supported
2026-01-07 15:32:04 -08:00
Enrico Degregori
aad4cf0985 Wmma support for gemm_bias_add_reduce (#3316)
* Add tests for gemm_bias_add_reduce

* Initial working implementation

* Generalize implementation of reduce epilogue

* Add tests for all layouts

* Add instances

* Fix test archs

* Fix xdl bug

* Remove library/profiler duplications

* Fix num_byted error profiler

* Fix typos

* Fix copyright
2026-01-07 10:27:16 -08:00
Erwin Terpstra
f9c6ba0403 Implement grouped gemm fastgelu for RDNA4 (#3303)
* Implement grouped gemm fastgelu for RDNA4

* chore: some cleanup and minor inconsistencies in grouped gemm profiler

* chore: clarified logic and reporting of supported instance warnings
2026-01-07 10:20:44 -08:00
John Shumway
a7d6b1e700 Add unit test coverage for conversion to convolution traits (#3515)
Our concept-base conversions are fragile and too complex. We want to refactor to straightforward functions
for each intance trace class template. This change adds unit test coverage to make that refactoring safer.
2026-01-07 07:44:21 -08:00
Johannes Graner
0a474aa62f [CI, CK examples] Disable time_kernel for CI tests and examples (#3464)
* Disable kernel timing in tests

* default time_kernel = false in old CK examples
2026-01-07 16:30:57 +01:00
BrianHarrisonAMD
e8cc75aefb Enable offload-compress for Windows if avaliable (#3521) 2026-01-07 07:05:03 -08:00
Cong Ma
d7497d2694 [CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512)
Refactor function amd_buffer_load_invalid_element_return_zero to avoid
the inefficient ASM code generated by compiler.

Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage

Tested compilers:
- Rocm 7.0.1
- Rocm 7.1.1

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-07 00:05:56 -08:00
Khushbu Agarwal
aaa35f0bbf [CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* clean up after merging with develop

* added comments for the tile window and tile distribution encoding

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
2026-01-06 12:46:59 -08:00
kyle-256
76696ace44 [CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452)
* update grouped_gemm blockwise kernel

* update config

* update kernel

* update examples

* remove test code for now

* sync test files with origin/develop

* update example

* fix code lint

* fix code-lint

* update test code

* run clang format

* run pre-commit

* update api
2026-01-06 12:36:04 -08:00
kensclin
2309c86054 [CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495)
* [CK_TILE] add preshuffleB mode for ABQuant GEMM

* fix precommit error

* use template method call for cvt_scale_to_fp32

* fix precommit error

* add test code

* fix precommit error

* switch abquant  gemmconfig to default

* Add changelog.md

* fix precommit error

* fix conflict
2026-01-06 12:35:01 -08:00
John Shumway
960ef551bf Fix build error from extra comma (#3516)
The newer rocm compiler gives an error with a trailing comma in testing::AllOf.
2026-01-06 11:08:54 -08:00
Illia Silin
2ffbf7f476 add tabulate package to aiter docker (#3519) 2026-01-06 09:36:54 -08:00
Robin Voetter
1c433c64ec [CK_BUILDER] Integrate reference conv with testing (#3511)
* ck-builder: explicitly delete forward declarations

Before, these functions were seen as a forward declaration for an existing function.
If no actual implementation overload could be found, these would be selected and
a linker error or warning would be generated. By marking these functions as explicitly
deleted, they incorrect invocations are generated as compile error instead.

* ck-builder: ckt::run plumbing for reference conv

This implements the ckt::run plumbing for the reference convolution
implementation and sets up the first complete end-to-end test.

* ck-builder: make validation system check for all-zeros

When both the actual and reference output are both all zero bits,
there is probably something wrong in the test framework.

* ck-builder: proper implementation+tests for TensorDescriptor::is_packed

* ck-builder: fix typos
2026-01-06 09:29:06 +01:00
joyeamd
b78563b3d3 Merge some updates for ck_tile headers (#3342)
* fix some issues from internal branch

* update cshuffle_epilogue

* update cshuffle_epilogue

* update cshuffle

* update warp_gemm
2026-01-05 23:39:00 -08:00
joyeamd
2b563ad048 Joye/revise wp pipeline (#3493)
* [CK_TILE] unify double and single lds implementation (#108)

Unify LDS buffer management API for single and double buffering modes

This change consolidates the Local Data Store (LDS) buffer management by:

Merging single and double LDS buffer APIs into a unified interface
Implementing ping-pong address calculation in pipeline when double LDS is enabled
Computing pong buffer addresses dynamically using base address offsets

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update wp_pipeline

* fix a c++17 issue

* update for ci errors

* fix ci issues

* include a header to fix ci errors

* fix some rebase issues

* update with rebase

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-05 13:49:26 -08:00
Estevan Vedovelli
1224bc0a82 Add support to gfx1153 and fix gfx115X WMMA config (#3496)
* Support for gfx115X

* Changes for gfx115X

* Add gfx1153

* Update changelog

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-05 10:03:30 -08:00
Bartłomiej Kocot
bbf0b1a3b3 Fix large tensor grouped conv bwd data test (#3513) 2026-01-05 09:42:02 -08:00
Robin Voetter
e6e7dc2910 [CK_BUILDER] validation (#3471)
This pull request builds on #3267 by proving the "validation" infrastructure, the means to compare a set of `Outputs`.

The design of the validation infrastructure is relatively straight forward:
- Each SIGNATURE should come with a `validate()` implementation, which should be implemented in a similar way that the other functions/types from `testing.hpp` are implemented.
- `validate()` returns a `ValidationReport`, which is a structure that keeps all relevant information about comparing the tensors from two `Outputs`. Note that crucially, `validate()` should not do any reporting by itself. Rather, glue logic should be implemented by the user to turn `ValidationReport` into a relevant error message.
- You can see this clue code for CK-Builder itself in `testing_utils.hpp`, its `MatchesReference()`. This functionality is relatively barebones right now, it will be expanded upon in a different PR to keep the scope of this one down.

The comparison is done on the GPU (using an atomic for now), to keep tests relatively quick. Some notable items from this PR:
- To help compare the tensors and with writing tests, I've written a generic function `tensor_foreach` which invokes a callback on every element of a tensor.
- For that it was useful that the `TensorDescriptor` has a rank which is known at compile-time, so I've changed the implementation of `TensorDescriptor` for that. I felt like it was a better approach than keeping it dynamic, for multiple reasons:
  - This is C++ and we should use static typing where possible and useful. This way, we don't have to implement runtime assertions about the tensor rank.
  - We know already know the rank of tensors statically, as it can be derived from the SIGNATURE.
  - It simpifies the implementation of `tensor_foreach` and other comparison code.
- There are a lot of new tests for validating the validation implementation, validating validation validation tests (Only 3 recursive levels though...). For a few of those functions, I felt like it would be useful to expose them to the user.
- Doc comments everywhere.
2026-01-05 04:57:34 -08:00
Jeff Huang
cc75a1dc5f [FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter

* add is_sglang_layout to  parameters

* add kv_offset_array_transform to batch async for page size 16

* add kv_last_page_lens to kernel

* change kv layout to [num_total_pages, page_block_size, hdim]

* format

* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
  batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)

* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024

* fix python format

* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]

* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
   traits
3. Add some static check for Page size, vector size, hdim, ..., etc.

* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration

Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.

**Changes:**
*   Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
*   Updated Kernel, Pipeline, and Traits to template on these Enums.
*   Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
*   Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
*   Updated CodeGen scripts to support new parameters.

This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.

* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.

* fix vgpr spill count

* remove unnecessary t2s functions

* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py

* support linear kv cache layout

* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.

* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.

* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.

* update changelog

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2026-01-05 18:41:47 +08:00
Max Podkorytov
e339101e9c [CK-Tile] move out memory operation from cshuffle epilogue class (#3359)
* initial poc

* factor out common parts in operator()

* cv4

* rest of the universal gemm pipelines

* fix test

* remove boilerplate from tile engine

* fix example

* fix example

* format

* fix tests build for gemm

* remove base pipeline codegen from gemm instance builder

* unify v3 logic with the rest of universal gemm pipelines

* fix build for multi abd test

* fix test gemm multi d

* fix build for weight preshuffle

* fix grouped gemm test

* fix grouped gemm multi d test

* fix grouped gemm preshuffle

* fix grouped gemm example except for quant

* fix gemm preshuffle

* fix splitk 2 stage example

* fix batched gemm example

* fix multid example

* fix multiabd example

* fix batched gemm test

* fixup

* fix examples build

* fix grouped gemm test build

* fix smoke builder

* hacky poc

* fix tile engine

* kill the lambda

* maybe fix test build

* more fixes

* clang-format

* save temp

* clang-format

* mostly fix examples

* clang-format

* remove dead code

* more cleanup

* fix fmha bwd build (default epilogue set/add appears to be broken)

* fix default epilogue tests but not correctness

* clang-format

* fix bquant

* clang-format

* cleanup dead code

* rearrange make windows for readability

* restore changes to IsSupportedArgument

* fix smoke-builder

* clang-format

* fixup rename class

* build fixes

* clang-format

* fix builder

* fixup

* remove set from builder tests

* fix test

* clang-format

* re-refactor the kernels

* clang-format

* fix header license

* remove memory operation from conv bwd test

* clang-format

* clang-format example,include

* clang-format test

* build fixes

* clang-format

* solve compilation error

* fix the CI

* solve compilation error

* clang format

* solve merge conflict

* solve merge conflict

* solve the gfx11 error

* solve test error

* moar build fixes

* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-04 03:28:14 -08:00
John Afaganis
ec23be0b9d Update unsigned long literals and format specifiers to work correctly in Windows (#3483)
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.
2026-01-02 22:16:41 -07:00
John Shumway
4670df5ca6 [CK_BUILDER] Remove cmath include (#3508)
Remove the dependency from device_tensor_generator.hpp and fix a typo from a previous force push. The changes replace standard library math functions with their ck::math equivalents and define PI as a local constant instead of computing it using std::acos.

Key changes:

* Removed #include header dependency
* Replaced std::acos(-1.0) with hardcoded PI constant 3.141592653f
* Replaced std::sqrt, std::cos, and std::sin with ck::math equivalents
2026-01-02 16:58:35 -08:00
John Shumway
355ce9230d Remove non-standard M_PI (#3507)
Just use PI=acos(-1.0) as a local static constexpr. This has been causing build issues on windows.
2026-01-02 14:21:46 -08:00
John Shumway
1da340031c Enable math defines for MSVC. (#3503)
The symbol M_PI is breaking the build on Windows.  The _USE_MATH_DEFINES macro enables M_PI and other math constants on Windows. (I'm guessing this is more idomatic than the old trick of using PI=acos(-1.0).)

https://learn.microsoft.com/en-us/cpp/c-runtime-library/math-constants?view=msvc-170

Co-authored-by: BradPepersAMD <Brad.Pepers@amd.com>
2026-01-02 14:36:42 -05:00