Commit Graph

1371 Commits

Author SHA1 Message Date
Sami Aario
ca17ac3358 When possible, use the overload of load_tile_transpose that does not require assignment 2026-01-07 19:44:00 +00:00
Sami Aario
321611081f Remove an unused overload of load_tile_transpose_with_offset 2026-01-07 19:44:00 +00:00
Sami Aario
8fc4030a57 Add an instance of load_tile_transpose that takes a reference to the output tensor as an input 2026-01-07 19:44:00 +00:00
Sami Aario
63a455952a No need to specify DstDataType in load_and_convert_tile as WarpTile knows its DataType 2026-01-07 19:44:00 +00:00
Sami Aario
3d55a1e682 No need to specify SrcDataType in load_and_convert_tile as WarpWindow knows its DataType 2026-01-07 19:44:00 +00:00
Sami Aario
514035e6cf In BQuantGemmPipelineAgBgCrCompV3, always convert BDatatype pk_int4_t to ADataType regardless of BLayout 2026-01-07 19:44:00 +00:00
Sami Aario
9af4498194 Remove the defaults for SrcDataType and DstDataType in GemmPipelineAgBgCrImplBase::GlobalPrefetch 2026-01-07 16:21:32 +00:00
Sami Aario
9633d3f5bb In GetAWindows and GetBWindows, use DataType from LDS tensor view 2026-01-07 16:21:32 +00:00
Sami Aario
9559a93432 Make explicit that the tile window argument to load_tile_with_elementwise and the two load methods it uses are tuples 2026-01-07 16:21:32 +00:00
Sami Aario
cfa11f2d1f Rename InterleavedPKTypeLoader to ConverterLoader, and load_int4_tile to load_and_convert_tile 2026-01-07 16:21:32 +00:00
Sami Aario
3a094e2f8b Include ck_tile/core.hpp in load_interleaved_pk_type.hpp for better IDE integration 2026-01-07 16:21:32 +00:00
Sami Aario
74533b4755 Rename load_interleaved_pk_type to load_and_convert_tile 2026-01-07 16:21:32 +00:00
Sami Aario
994b8f4c22 Minor refactoring of load_interleaved_pk_type 2026-01-07 16:21:32 +00:00
Sami Aario
ca71cd75fc Reduce the scope of KPack in MakeALdsBlockDescriptor 2026-01-07 16:21:32 +00:00
Sami Aario
825d17c3d7 Fix a comment 2026-01-07 16:21:32 +00:00
Sami Aario
bda5a7aa2d Add braces 2026-01-07 16:21:31 +00:00
Sami Aario
969156985b Use decltype for consistency in Interwave variant of BlockGemmImpl 2026-01-07 16:21:31 +00:00
Sami Aario
4d77856be5 Make some functions return void explicitly instead of auto 2026-01-07 16:21:31 +00:00
Cong Ma
d7497d2694 [CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512)
Refactor function amd_buffer_load_invalid_element_return_zero to avoid
the inefficient ASM code generated by compiler.

Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage

Tested compilers:
- Rocm 7.0.1
- Rocm 7.1.1

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-07 00:05:56 -08:00
Khushbu Agarwal
aaa35f0bbf [CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* clean up after merging with develop

* added comments for the tile window and tile distribution encoding

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
2026-01-06 12:46:59 -08:00
kyle-256
76696ace44 [CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452)
* update grouped_gemm blockwise kernel

* update config

* update kernel

* update examples

* remove test code for now

* sync test files with origin/develop

* update example

* fix code lint

* fix code-lint

* update test code

* run clang format

* run pre-commit

* update api
2026-01-06 12:36:04 -08:00
kensclin
2309c86054 [CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495)
* [CK_TILE] add preshuffleB mode for ABQuant GEMM

* fix precommit error

* use template method call for cvt_scale_to_fp32

* fix precommit error

* add test code

* fix precommit error

* switch abquant  gemmconfig to default

* Add changelog.md

* fix precommit error

* fix conflict
2026-01-06 12:35:01 -08:00
joyeamd
b78563b3d3 Merge some updates for ck_tile headers (#3342)
* fix some issues from internal branch

* update cshuffle_epilogue

* update cshuffle_epilogue

* update cshuffle

* update warp_gemm
2026-01-05 23:39:00 -08:00
joyeamd
2b563ad048 Joye/revise wp pipeline (#3493)
* [CK_TILE] unify double and single lds implementation (#108)

Unify LDS buffer management API for single and double buffering modes

This change consolidates the Local Data Store (LDS) buffer management by:

Merging single and double LDS buffer APIs into a unified interface
Implementing ping-pong address calculation in pipeline when double LDS is enabled
Computing pong buffer addresses dynamically using base address offsets

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update wp_pipeline

* fix a c++17 issue

* update for ci errors

* fix ci issues

* include a header to fix ci errors

* fix some rebase issues

* update with rebase

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-05 13:49:26 -08:00
Estevan Vedovelli
1224bc0a82 Add support to gfx1153 and fix gfx115X WMMA config (#3496)
* Support for gfx115X

* Changes for gfx115X

* Add gfx1153

* Update changelog

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-05 10:03:30 -08:00
Jeff Huang
cc75a1dc5f [FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter

* add is_sglang_layout to  parameters

* add kv_offset_array_transform to batch async for page size 16

* add kv_last_page_lens to kernel

* change kv layout to [num_total_pages, page_block_size, hdim]

* format

* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
  batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)

* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024

* fix python format

* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]

* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
   traits
3. Add some static check for Page size, vector size, hdim, ..., etc.

* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration

Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.

**Changes:**
*   Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
*   Updated Kernel, Pipeline, and Traits to template on these Enums.
*   Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
*   Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
*   Updated CodeGen scripts to support new parameters.

This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.

* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.

* fix vgpr spill count

* remove unnecessary t2s functions

* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py

* support linear kv cache layout

* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.

* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.

* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.

* update changelog

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2026-01-05 18:41:47 +08:00
Max Podkorytov
e339101e9c [CK-Tile] move out memory operation from cshuffle epilogue class (#3359)
* initial poc

* factor out common parts in operator()

* cv4

* rest of the universal gemm pipelines

* fix test

* remove boilerplate from tile engine

* fix example

* fix example

* format

* fix tests build for gemm

* remove base pipeline codegen from gemm instance builder

* unify v3 logic with the rest of universal gemm pipelines

* fix build for multi abd test

* fix test gemm multi d

* fix build for weight preshuffle

* fix grouped gemm test

* fix grouped gemm multi d test

* fix grouped gemm preshuffle

* fix grouped gemm example except for quant

* fix gemm preshuffle

* fix splitk 2 stage example

* fix batched gemm example

* fix multid example

* fix multiabd example

* fix batched gemm test

* fixup

* fix examples build

* fix grouped gemm test build

* fix smoke builder

* hacky poc

* fix tile engine

* kill the lambda

* maybe fix test build

* more fixes

* clang-format

* save temp

* clang-format

* mostly fix examples

* clang-format

* remove dead code

* more cleanup

* fix fmha bwd build (default epilogue set/add appears to be broken)

* fix default epilogue tests but not correctness

* clang-format

* fix bquant

* clang-format

* cleanup dead code

* rearrange make windows for readability

* restore changes to IsSupportedArgument

* fix smoke-builder

* clang-format

* fixup rename class

* build fixes

* clang-format

* fix builder

* fixup

* remove set from builder tests

* fix test

* clang-format

* re-refactor the kernels

* clang-format

* fix header license

* remove memory operation from conv bwd test

* clang-format

* clang-format example,include

* clang-format test

* build fixes

* clang-format

* solve compilation error

* fix the CI

* solve compilation error

* clang format

* solve merge conflict

* solve merge conflict

* solve the gfx11 error

* solve test error

* moar build fixes

* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2026-01-04 03:28:14 -08:00
John Afaganis
ec23be0b9d Update unsigned long literals and format specifiers to work correctly in Windows (#3483)
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.
2026-01-02 22:16:41 -07:00
John Shumway
4670df5ca6 [CK_BUILDER] Remove cmath include (#3508)
Remove the dependency from device_tensor_generator.hpp and fix a typo from a previous force push. The changes replace standard library math functions with their ck::math equivalents and define PI as a local constant instead of computing it using std::acos.

Key changes:

* Removed #include header dependency
* Replaced std::acos(-1.0) with hardcoded PI constant 3.141592653f
* Replaced std::sqrt, std::cos, and std::sin with ck::math equivalents
2026-01-02 16:58:35 -08:00
John Shumway
355ce9230d Remove non-standard M_PI (#3507)
Just use PI=acos(-1.0) as a local static constexpr. This has been causing build issues on windows.
2026-01-02 14:21:46 -08:00
John Shumway
1da340031c Enable math defines for MSVC. (#3503)
The symbol M_PI is breaking the build on Windows.  The _USE_MATH_DEFINES macro enables M_PI and other math constants on Windows. (I'm guessing this is more idomatic than the old trick of using PI=acos(-1.0).)

https://learn.microsoft.com/en-us/cpp/c-runtime-library/math-constants?view=msvc-170

Co-authored-by: BradPepersAMD <Brad.Pepers@amd.com>
2026-01-02 14:36:42 -05:00
Ville Pietilä
6e8c401e33 [CK_BUILDER] Instance traits for conv bwd weight algorithms (#3498)
Added instance traits for the following bwd weight conv algorithms

DeviceGroupedConvBwdWeight_Xdl_CShuffleV3
DeviceGroupedConvBwdWeight_Wmma_CShuffleV3
DeviceGroupedConvBwdWeight_Wmma_CShuffle
DeviceGroupedConvBwdWeight_TwoStage_Xdl_CShuffle
DeviceGroupedConvBwdWeight_TwoStage_Wmma_CShuffleV3
DeviceGroupedConvBwdWeight_DL
DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle
DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
Added also unit tests for instance traits of those bwd weigth algorithms that are currently exposed by the narrow CK build for MIOpen.
---------

Co-authored-by: Ville Pietilä <>
2025-12-31 15:41:15 -08:00
kabrahamAMD
f86bbb1aef [CK_Builder] [testing] Integrate device random generators (#3427)
Implemented device random number generators for ck tensors.
Includes tests and integration to ck builder testing interface.
2025-12-30 10:03:05 -08:00
Bartłomiej Kocot
2b8302eb6d Fix grouped conv wrw kernels names (#3494) 2025-12-30 16:45:39 +01:00
ApoorvaKalyani
53a1e4f551 Grouped convolution backward data WMMA v3 implementation (#3460)
* Added device level implementation for bwd_data_wmma_v3.

* Added first instance of bwd_data_wmma_v3(f16).

* Add support for bwd data in gridwise implementation

Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight

* Initial device implementation of bwd data

* Remove unused template parameters in device impl

* Add one instance for different layout

initial check of device implementation

* Add tests for splitk and for different layouts

* Appended more instances to wmma_v3_f16.

* Added conv_2d bf16 wmma_v3 instances.

* Added conv_3d_bf16 wmma_v3_instances.

* Added conv_3d_f16_wmma_v3_instances.

* Added SplitN test cases for wmma.

* Conv3d_bwd_data_scale_wmma_v3 instances.

* Conv3d_bwd_data_bilinear_wmma_v3_instances

* Renaming the device level instances file to common name , since it is defined for different DataTypes.

* Renaming the instances and fixing typo

* Added the test cases to regression test list

* NCHW support for wmma_v3

* Examples for bf16 and f16 bwd_data_wmma_v3

* Added transpose conditons for device impl

* fixing bugs

* Added the gemm_args array implmentation

* WIP debug conv bwd

* fix splitk

* Grouped gemm fix

* Update CmakeLists with EOF

* Added more instances for tests

* Fixed the run time error in examples and removed 3d conv examples.

* Fixed a typo.

* Updated CmakeLists to removed the 3d convultion deleted files

* Added print error statements for unsupoorted argument

* Added the merge conflict related changes

* Fixed compilation error

* Fixed the InstanceFactory duplication error.

* Removed the print statements and added logs to Arg function

* All the merge conflict related errors resolved

* Added d_tensor tests.

* Added the missing example types of wmm_v3

* Merge error fix

* Corrected the instance name

* Reverted the bias relu change

* Revereted the transpose load local change

* Updated the regression test list with bwd_data_scale

* Revert "Revereted the transpose load local change"

This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.

* Revert "Merge error fix"

This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.

* Reverting the local change

* Added merge error fix

* Build error fix due to merge conflicts

* Added bias_relu example for wmma_v3

* Modified the main method in dtensor tests

* Updated the dtensor tests to pick all the shapes

* Updated the dtensor test shapes.

* Updated the mem operations in tests.

* Added reference func

* Fixed typos in device impl

* Added new header file and modified the include file for 3d tests

* Renamed the test file and added reference func call.

* clang format fix

* Added ignore params

* Modified device impl and tests

* Removed debug print statements and updated dtensor test shapes

* Fixing merge conflicts

* Fixing more merge conflicts

* Fixed copyrights

* Updated the tuned instances to bilinear and scale.

* Adding tuned instances to vanilla wmma_v3

* Removed all unused instances and modified test layouts.

* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.

* Fix clang format

* Updated tuned f16/-genric instances

* Formatting the instances file

* Fixed copyrights and clang issues

* Nonsense commit to force git to force

* Removed the transpose instances

* Added verified genric instances

* Fixing namespace errors

* Added todo for failing shapes

* Formatting instance file

* Fix instance list formatting

* Removing unnecessary formats

* Renamed the common file

* Unification of xdl and wmma bwd_data tests

* Updated Cmake

* Added all layout types and deleted code.

* Updated Cmake to add the condition to all tests.

---------

Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
2025-12-30 16:25:08 +01:00
yadaish
dae85ead64 [CK_TILE] support split-k a16w4 gemm1 (#3389)
* initial version to support moe gemm1 split-k

* add missing args

* fix build warning

* update reference

* for split-k disable bias and weight

* remove debug log

* fix format

* fix div by zero errors

* fix cmake config

* update

* resolve conflicts

* remove useless changes

* reformat

* fix

* remove useless changes

* fix ci

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu>
2025-12-29 23:05:35 +08:00
Kiefer van Teutem
88ae445580 Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470)
* Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests.

* Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.
2025-12-29 12:58:29 +01:00
Yi DING
b0ea67e377 [CK_TILE] MX FLATMM Fix M Padding (#3489)
* Fix M Padding

* Fix tensor desc ele space size
2025-12-29 09:09:12 +08:00
Erwin Terpstra
e08efa551f [CK_TILE] Grouped gemm quant tensor layouts (#3414)
* feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time

* chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times

* chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems

* fix: missing check for transposed load in bquant pipeline

* chore: lower unit test tensors dimensions a bit for faster tests

* chore: set grouped gemm example M/N warp tile to 16

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-12-24 23:01:23 -08:00
Illia Silin
14668a56e3 remove the LLVM_MAIN_REVISION usage (#3487) 2025-12-24 16:49:35 -08:00
kensclin
7f68f3c4fa Enable padding blockscale for abquant (#3453)
* Enable padding blockscale for abquant

* run clang-format

* Reduce unnecessary testing

* remove cout
2025-12-24 09:12:40 -08:00
Po Yen Chen
1c3151963b [CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425)
* Add fp8bf16 support for batch_prefill

* Fix wrong scale_s re-compute logic in batch_prefill

* Fix wrong scale_s re-compute logic in fmha fwd

* Fix batch_prefill codegen error

* Remove no-longer used GetName() function

* Add fp8 logits=True instances

* Update CHANGELOG.md
2025-12-24 10:34:06 +08:00
jakpiase
c0797c1671 [CK_TILE] Minor splitk bugfix for gemms and conv (#3387)
* fix for splitk if splitk < grid

* add different splitk implementation

* minor bugfix for streamk gemm

* Add test

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-12-24 00:10:13 +01:00
Johannes Graner
e1381d6a71 [CK grouped gemm] Fix grouped gemm two stage HasMainK0BlockLoop (#3466)
* Re-enable two stage kernel

* Only disable on HasMainKBlockLoop mismatch

* Address PR comments
2025-12-23 11:33:09 +01:00
jakpiase
ead81d1b0b [CK_TILE] Add splitk support to ck tile conv bwd data (#3353)
* add splitk support to ck tile conv bwd data

* add reviewers suggestions

* minor fix

* removed splitkbatchoffset struct
2025-12-23 10:03:42 +01:00
Lyu, Xudong
8b73633e65 fix: handle void return type in TailHandler error path with ROCm6 compiler (clang++) (#3477)
Replace `decltype(TailHandler<>(...)){}` with direct function call
to fix compilation error when return type is void.

Co-authored-by: Yi DING <yi.ding@amd.com>
2025-12-23 15:03:18 +08:00
Bartłomiej Kocot
2955d77f3c Fix grouped conv fwd wmma porting (#3479)
* Fix grouped conv fwd wmma porting

* add more limitations
2025-12-22 21:32:48 +01:00
Jan Patrick Lehr
9bd67c2cf2 [CK-TILE] Guard against compiler lexer diagnostic (#3444)
* [CK-TILE] Guard against compiler lexer diagnostic

A recent change to Clang added a lexer-level diagnostic about that C2y
language feature. Since that is lexer level, the `__extension__`
compiler built-in does not work as it is only respected *after* the
lexer when parsing.

This change adds guarding pragmas to disable the diagnostic in the
lexer and not lead to warnings being treated as errors.

* Fixing still existing build issue

Once the one warning was removed, another one poppoed up. Both are
related to the same c2y feature. Thus, ignoring both.

* clang-format handling

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-12-19 17:32:20 -08:00
Bartłomiej Kocot
cbc8335964 Improve XDL to WMMA porting for grouped conv fwd (#3456)
Refactors the way the number of XDL (matrix multiply-accumulate) instructions per wave is calculated and used in the grouped convolution forward implementations, especially to better support WMMA (Wave Matrix Multiply-Accumulate) instructions and 16x16 tiles. 
The changes use MXdlPerWave instead of NXdlPerWave to increase number of waves per M dim.
2025-12-19 15:58:51 -07:00
Illia Silin
2d9c962e2c get LLVM_MAIN_REVISION macro from compiler header (#3469) 2025-12-19 14:57:12 -08:00