Commit Graph

2900 Commits

Author SHA1 Message Date
Sami Aario
d75d38bf05 Add DstDataType as a template parameter to load_tile_with_elementwise, and use it for type conversion 2026-01-08 07:53:33 +00:00
Sami Aario
2e798d15e1 Add functionality and tests for fp16 x fp8 and fp8 x fp16 2026-01-08 07:53:33 +00:00
Sami Aario
7fdf8222c2 Add functionality and tests for bf16 x fp8 and fp8 x bf16 2026-01-08 07:53:33 +00:00
Sami Aario
0b0ddf1a38 Add MFMA warp gemm for float, float, float, 32, 32, 16 2026-01-08 07:53:33 +00:00
Sami Aario
7bb452d9b8 Refactor type conversions out of MakeBLdsBlockDescriptor, WIP! 2026-01-08 07:53:33 +00:00
Sami Aario
fc82ebc174 Introduce DetermineWarpPrecType for determining warp GEMM precision types 2026-01-08 07:53:33 +00:00
SamiAario-AMD
e62c96f1dd Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-08 09:48:15 +02:00
Thrupti Raj Lakshmana Gowda
770a14494e Removing memop from chshuffle (#3530) 2026-01-07 23:34:43 -08:00
SamiAario-AMD
0a4388d4cc Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-08 09:08:21 +02:00
Johannes Graner
ee2c35b92d [CK] Allow tensors larger than 2GB in grouped conv bwd weight (#3169)
* Take split_k into account when checking 2GB tensor limit.

* Revert "Take split_k into account when checking 2GB tensor limit."

This reverts commit adf35c91be.

* Optimize grouped conv bwd wei split_k off calc

(cherry picked from commit 6f61dd56c5)

* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

(cherry picked from commit b33877c10f)

* Fix tensor descriptors and stride calculations

* Don't miss half of the elements

* Fix buffer size calculations

* Disable hack if stride not divisible by k_batch

* Clean up comments

* Disallow hack in non-contiguous edge cases

* Index -> Dim

* Fix broken test

* Refactor applicability checks into separate function

* fix missed variable name

* Fix variable name in info print

* update V3 2GB check

* No more regression, use templates instead

* Code deduplication

* Regression fix for cshuffle

* arch-guarded atomic_add implementations for gfx11

* Similar for half(4|8)_t as well

* Only use both offset hacks at the same time

* Revert "arch-guarded atomic_add implementations for gfx11"

This reverts commit 3883fe6935.
This reverts commit 5311ec608d.

* Reapply "arch-guarded atomic_add implementations for gfx11"

This reverts commit 1972adeddc.

* Only remove float4 atomic_add

* Refactor to single flag

* Consolidate template parameters

* Consolidate flag in transformers

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2026-01-08 08:02:02 +01:00
Bartłomiej Kocot
bc497beffb [CK TILE] Fix grouped conv kernels splitk and double lds (#3527) 2026-01-08 07:59:38 +01:00
Bartłomiej Kocot
f449a5faaa Disable fp32 atomic adds on gfx11 (#3510)
* Disable fp32 atomic adds on gfx11

* Fixes is supported
2026-01-07 15:32:04 -08:00
SamiAario-AMD
b91efe5b07 Merge branch 'develop' into LWPCK-3549-cleanups 2026-01-07 21:44:58 +02:00
Sami Aario
2edd077b50 Adjust whitespace with clang-format 2026-01-07 19:44:00 +00:00
Sami Aario
ca17ac3358 When possible, use the overload of load_tile_transpose that does not require assignment 2026-01-07 19:44:00 +00:00
Sami Aario
321611081f Remove an unused overload of load_tile_transpose_with_offset 2026-01-07 19:44:00 +00:00
Sami Aario
8fc4030a57 Add an instance of load_tile_transpose that takes a reference to the output tensor as an input 2026-01-07 19:44:00 +00:00
Sami Aario
63a455952a No need to specify DstDataType in load_and_convert_tile as WarpTile knows its DataType 2026-01-07 19:44:00 +00:00
Sami Aario
3d55a1e682 No need to specify SrcDataType in load_and_convert_tile as WarpWindow knows its DataType 2026-01-07 19:44:00 +00:00
Sami Aario
514035e6cf In BQuantGemmPipelineAgBgCrCompV3, always convert BDatatype pk_int4_t to ADataType regardless of BLayout 2026-01-07 19:44:00 +00:00
Enrico Degregori
aad4cf0985 Wmma support for gemm_bias_add_reduce (#3316)
* Add tests for gemm_bias_add_reduce

* Initial working implementation

* Generalize implementation of reduce epilogue

* Add tests for all layouts

* Add instances

* Fix test archs

* Fix xdl bug

* Remove library/profiler duplications

* Fix num_byted error profiler

* Fix typos

* Fix copyright
2026-01-07 10:27:16 -08:00
Erwin Terpstra
f9c6ba0403 Implement grouped gemm fastgelu for RDNA4 (#3303)
* Implement grouped gemm fastgelu for RDNA4

* chore: some cleanup and minor inconsistencies in grouped gemm profiler

* chore: clarified logic and reporting of supported instance warnings
2026-01-07 10:20:44 -08:00
Sami Aario
9af4498194 Remove the defaults for SrcDataType and DstDataType in GemmPipelineAgBgCrImplBase::GlobalPrefetch 2026-01-07 16:21:32 +00:00
Sami Aario
9633d3f5bb In GetAWindows and GetBWindows, use DataType from LDS tensor view 2026-01-07 16:21:32 +00:00
Sami Aario
9559a93432 Make explicit that the tile window argument to load_tile_with_elementwise and the two load methods it uses are tuples 2026-01-07 16:21:32 +00:00
Sami Aario
cfa11f2d1f Rename InterleavedPKTypeLoader to ConverterLoader, and load_int4_tile to load_and_convert_tile 2026-01-07 16:21:32 +00:00
Sami Aario
3a094e2f8b Include ck_tile/core.hpp in load_interleaved_pk_type.hpp for better IDE integration 2026-01-07 16:21:32 +00:00
Sami Aario
74533b4755 Rename load_interleaved_pk_type to load_and_convert_tile 2026-01-07 16:21:32 +00:00
Sami Aario
994b8f4c22 Minor refactoring of load_interleaved_pk_type 2026-01-07 16:21:32 +00:00
Sami Aario
ca71cd75fc Reduce the scope of KPack in MakeALdsBlockDescriptor 2026-01-07 16:21:32 +00:00
Sami Aario
825d17c3d7 Fix a comment 2026-01-07 16:21:32 +00:00
Sami Aario
bda5a7aa2d Add braces 2026-01-07 16:21:31 +00:00
Sami Aario
969156985b Use decltype for consistency in Interwave variant of BlockGemmImpl 2026-01-07 16:21:31 +00:00
Sami Aario
4d77856be5 Make some functions return void explicitly instead of auto 2026-01-07 16:21:31 +00:00
John Shumway
a7d6b1e700 Add unit test coverage for conversion to convolution traits (#3515)
Our concept-base conversions are fragile and too complex. We want to refactor to straightforward functions
for each intance trace class template. This change adds unit test coverage to make that refactoring safer.
2026-01-07 07:44:21 -08:00
Johannes Graner
0a474aa62f [CI, CK examples] Disable time_kernel for CI tests and examples (#3464)
* Disable kernel timing in tests

* default time_kernel = false in old CK examples
2026-01-07 16:30:57 +01:00
BrianHarrisonAMD
e8cc75aefb Enable offload-compress for Windows if avaliable (#3521) 2026-01-07 07:05:03 -08:00
Cong Ma
d7497d2694 [CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512)
Refactor function amd_buffer_load_invalid_element_return_zero to avoid
the inefficient ASM code generated by compiler.

Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage

Tested compilers:
- Rocm 7.0.1
- Rocm 7.1.1

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-07 00:05:56 -08:00
Khushbu Agarwal
aaa35f0bbf [CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* clean up after merging with develop

* added comments for the tile window and tile distribution encoding

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
2026-01-06 12:46:59 -08:00
kyle-256
76696ace44 [CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452)
* update grouped_gemm blockwise kernel

* update config

* update kernel

* update examples

* remove test code for now

* sync test files with origin/develop

* update example

* fix code lint

* fix code-lint

* update test code

* run clang format

* run pre-commit

* update api
2026-01-06 12:36:04 -08:00
kensclin
2309c86054 [CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495)
* [CK_TILE] add preshuffleB mode for ABQuant GEMM

* fix precommit error

* use template method call for cvt_scale_to_fp32

* fix precommit error

* add test code

* fix precommit error

* switch abquant  gemmconfig to default

* Add changelog.md

* fix precommit error

* fix conflict
2026-01-06 12:35:01 -08:00
John Shumway
960ef551bf Fix build error from extra comma (#3516)
The newer rocm compiler gives an error with a trailing comma in testing::AllOf.
2026-01-06 11:08:54 -08:00
Illia Silin
2ffbf7f476 add tabulate package to aiter docker (#3519) 2026-01-06 09:36:54 -08:00
Robin Voetter
1c433c64ec [CK_BUILDER] Integrate reference conv with testing (#3511)
* ck-builder: explicitly delete forward declarations

Before, these functions were seen as a forward declaration for an existing function.
If no actual implementation overload could be found, these would be selected and
a linker error or warning would be generated. By marking these functions as explicitly
deleted, they incorrect invocations are generated as compile error instead.

* ck-builder: ckt::run plumbing for reference conv

This implements the ckt::run plumbing for the reference convolution
implementation and sets up the first complete end-to-end test.

* ck-builder: make validation system check for all-zeros

When both the actual and reference output are both all zero bits,
there is probably something wrong in the test framework.

* ck-builder: proper implementation+tests for TensorDescriptor::is_packed

* ck-builder: fix typos
2026-01-06 09:29:06 +01:00
joyeamd
b78563b3d3 Merge some updates for ck_tile headers (#3342)
* fix some issues from internal branch

* update cshuffle_epilogue

* update cshuffle_epilogue

* update cshuffle

* update warp_gemm
2026-01-05 23:39:00 -08:00
joyeamd
2b563ad048 Joye/revise wp pipeline (#3493)
* [CK_TILE] unify double and single lds implementation (#108)

Unify LDS buffer management API for single and double buffering modes

This change consolidates the Local Data Store (LDS) buffer management by:

Merging single and double LDS buffer APIs into a unified interface
Implementing ping-pong address calculation in pipeline when double LDS is enabled
Computing pong buffer addresses dynamically using base address offsets

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update wp_pipeline

* fix a c++17 issue

* update for ci errors

* fix ci issues

* include a header to fix ci errors

* fix some rebase issues

* update with rebase

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-05 13:49:26 -08:00
Estevan Vedovelli
1224bc0a82 Add support to gfx1153 and fix gfx115X WMMA config (#3496)
* Support for gfx115X

* Changes for gfx115X

* Add gfx1153

* Update changelog

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-05 10:03:30 -08:00
Bartłomiej Kocot
bbf0b1a3b3 Fix large tensor grouped conv bwd data test (#3513) 2026-01-05 09:42:02 -08:00
Robin Voetter
e6e7dc2910 [CK_BUILDER] validation (#3471)
This pull request builds on #3267 by proving the "validation" infrastructure, the means to compare a set of `Outputs`.

The design of the validation infrastructure is relatively straight forward:
- Each SIGNATURE should come with a `validate()` implementation, which should be implemented in a similar way that the other functions/types from `testing.hpp` are implemented.
- `validate()` returns a `ValidationReport`, which is a structure that keeps all relevant information about comparing the tensors from two `Outputs`. Note that crucially, `validate()` should not do any reporting by itself. Rather, glue logic should be implemented by the user to turn `ValidationReport` into a relevant error message.
- You can see this clue code for CK-Builder itself in `testing_utils.hpp`, its `MatchesReference()`. This functionality is relatively barebones right now, it will be expanded upon in a different PR to keep the scope of this one down.

The comparison is done on the GPU (using an atomic for now), to keep tests relatively quick. Some notable items from this PR:
- To help compare the tensors and with writing tests, I've written a generic function `tensor_foreach` which invokes a callback on every element of a tensor.
- For that it was useful that the `TensorDescriptor` has a rank which is known at compile-time, so I've changed the implementation of `TensorDescriptor` for that. I felt like it was a better approach than keeping it dynamic, for multiple reasons:
  - This is C++ and we should use static typing where possible and useful. This way, we don't have to implement runtime assertions about the tensor rank.
  - We know already know the rank of tensors statically, as it can be derived from the SIGNATURE.
  - It simpifies the implementation of `tensor_foreach` and other comparison code.
- There are a lot of new tests for validating the validation implementation, validating validation validation tests (Only 3 recursive levels though...). For a few of those functions, I felt like it would be useful to expose them to the user.
- Doc comments everywhere.
2026-01-05 04:57:34 -08:00
Jeff Huang
cc75a1dc5f [FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter

* add is_sglang_layout to  parameters

* add kv_offset_array_transform to batch async for page size 16

* add kv_last_page_lens to kernel

* change kv layout to [num_total_pages, page_block_size, hdim]

* format

* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
  batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)

* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024

* fix python format

* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]

* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
   traits
3. Add some static check for Page size, vector size, hdim, ..., etc.

* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration

Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.

**Changes:**
*   Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
*   Updated Kernel, Pipeline, and Traits to template on these Enums.
*   Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
*   Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
*   Updated CodeGen scripts to support new parameters.

This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.

* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.

* fix vgpr spill count

* remove unnecessary t2s functions

* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py

* support linear kv cache layout

* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.

* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.

* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.

* update changelog

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2026-01-05 18:41:47 +08:00