Commit Graph

2880 Commits

Author SHA1 Message Date
Aviral Goel
8dceee271e WIP: extract MakeALdsDescriptor() from child to parent class for code readability (#3392)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 5aaa031350]
2026-01-12 09:51:58 -08:00
Aviral Goel
23a1768487 refactor: remove Default scheduler implementation as it not used anymore (#3542)
* refactor: remove Default scheduler implementation as it not used anymore

* refactor: remove dead code from gemm universal kernel

* chore: add descriptive comments about amd intrinsic hardware sync instructions

* fix: label existing memory pipeline for aquant as intrawave

[ROCm/composable_kernel commit: e809861d49]
2026-01-12 09:51:06 -08:00
Johannes Graner
32e0beb399 [CK profiler] Perform verification on GPU when using GPU reference (#3482)
* Simple verification kernel for ckProfiler

* Verification kernel unit tests

* Explicit synchronization

* Address review comments

[ROCm/composable_kernel commit: 18c2ff6019]
2026-01-12 12:12:41 +01:00
kabrahamAMD
529fbdc771 adressed review comments from PR3459 (#3526)
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>

[ROCm/composable_kernel commit: 20f66c1e6b]
2026-01-12 09:47:00 +01:00
Robin Voetter
61e6e155b0 ck-builder: tensor input/output reflection (#3536)
This adds some utilities to automatically generate UniqueInputs,
UniqueOutputs, alloc_inputs, alloc_outputs, and validate, based
on a Inputs::reflect() and Outputs::reflect().

[ROCm/composable_kernel commit: b352a68606]
2026-01-12 09:45:53 +01:00
yadaish
981c891757 moe fp8 blockscale use nt (#3524)
* nt on fp8 blockscale

* some improve and tests needs to be fixed

* update

* fix format

* revert useless change

* revert any change in amd_buffer_coherence

[ROCm/composable_kernel commit: 32408c8bc0]
2026-01-12 10:48:10 +08:00
damien-lejeune
58d8d793b1 Dlejeune/ck tile 2d multiple reductions (#3147)
* WIP

* Add Unit tests for the Multi Reduction Kernel

* clang format

* Rename multiblock to threadwise

* Multiblock WIP

* Fix multi reduce multi block unit tests

* Multi Reduce Tile Engine: WIP

* refactoring + try addressing precision error

* Fix multiops examples

* Cleanup

* Clean up tile engine's reduce op

* Update changelog

* Fix remod/clang

* Fix dates

* Fix documentation & missing file

* Fix comments

* Use the update_tile api in the multi-block kernel

* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests

* Add TileParitioner

* Cleanup

* Add warning when no data to process, in the example

* Refactoring Reduce kernel Tile Partioner + cleanup

* Move the tile partioner to its own file

* Add missing includes

* Fix copyright header with update_amd_copyright_headers.py

* Fix change of interface in Reduce2dProblem

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 4216d43da8]
2026-01-09 11:16:37 +01:00
Robin Voetter
1a4deaded3 [CK_BUILDER] Debug utilities (#3528)
* ck-builder: make toString to_string

We are using snake case for CK-Builder

* ck-builder: add debug.hpp with tensor descriptor printing function

This adds some initial functionality to debug.hpp, a header which will
be used to house some debug utilities.

* ck-builder: abstract nd-iteration

Abstracting this makes it easier to test, clearer, and allows us to
use it elsewhere (such as in debug.hpp soon)

* ck-builder: tensor printing

* ck-builder: rename INT32 to I32

This makes it more in line with the other data type definitions.

[ROCm/composable_kernel commit: e3884bbf05]
2026-01-08 10:14:13 +01:00
Thrupti Raj Lakshmana Gowda
f8d1442908 Removing memop from chshuffle (#3530)
[ROCm/composable_kernel commit: 770a14494e]
2026-01-07 23:34:43 -08:00
Johannes Graner
9d6add54e5 [CK] Allow tensors larger than 2GB in grouped conv bwd weight (#3169)
* Take split_k into account when checking 2GB tensor limit.

* Revert "Take split_k into account when checking 2GB tensor limit."

This reverts commit adf35c91be.

* Optimize grouped conv bwd wei split_k off calc

(cherry picked from commit 2115642ee59050dabd81393c1b8f03b34adc05aa)

* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp

(cherry picked from commit 900d4d4b466f5730ae1189370d3c96267c35ea69)

* Fix tensor descriptors and stride calculations

* Don't miss half of the elements

* Fix buffer size calculations

* Disable hack if stride not divisible by k_batch

* Clean up comments

* Disallow hack in non-contiguous edge cases

* Index -> Dim

* Fix broken test

* Refactor applicability checks into separate function

* fix missed variable name

* Fix variable name in info print

* update V3 2GB check

* No more regression, use templates instead

* Code deduplication

* Regression fix for cshuffle

* arch-guarded atomic_add implementations for gfx11

* Similar for half(4|8)_t as well

* Only use both offset hacks at the same time

* Revert "arch-guarded atomic_add implementations for gfx11"

This reverts commit 3883fe6935.
This reverts commit 5311ec608d.

* Reapply "arch-guarded atomic_add implementations for gfx11"

This reverts commit 1972adeddc.

* Only remove float4 atomic_add

* Refactor to single flag

* Consolidate template parameters

* Consolidate flag in transformers

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: ee2c35b92d]
2026-01-08 08:02:02 +01:00
Bartłomiej Kocot
5e0d3e77b9 [CK TILE] Fix grouped conv kernels splitk and double lds (#3527)
[ROCm/composable_kernel commit: bc497beffb]
2026-01-08 07:59:38 +01:00
Bartłomiej Kocot
dcc6ce0e22 Disable fp32 atomic adds on gfx11 (#3510)
* Disable fp32 atomic adds on gfx11

* Fixes is supported

[ROCm/composable_kernel commit: f449a5faaa]
2026-01-07 15:32:04 -08:00
Enrico Degregori
5a3fc30228 Wmma support for gemm_bias_add_reduce (#3316)
* Add tests for gemm_bias_add_reduce

* Initial working implementation

* Generalize implementation of reduce epilogue

* Add tests for all layouts

* Add instances

* Fix test archs

* Fix xdl bug

* Remove library/profiler duplications

* Fix num_byted error profiler

* Fix typos

* Fix copyright

[ROCm/composable_kernel commit: aad4cf0985]
2026-01-07 10:27:16 -08:00
Erwin Terpstra
2379b5e6e0 Implement grouped gemm fastgelu for RDNA4 (#3303)
* Implement grouped gemm fastgelu for RDNA4

* chore: some cleanup and minor inconsistencies in grouped gemm profiler

* chore: clarified logic and reporting of supported instance warnings

[ROCm/composable_kernel commit: f9c6ba0403]
2026-01-07 10:20:44 -08:00
John Shumway
a89756823c Add unit test coverage for conversion to convolution traits (#3515)
Our concept-base conversions are fragile and too complex. We want to refactor to straightforward functions
for each intance trace class template. This change adds unit test coverage to make that refactoring safer.

[ROCm/composable_kernel commit: a7d6b1e700]
2026-01-07 07:44:21 -08:00
Johannes Graner
acf98936bc [CI, CK examples] Disable time_kernel for CI tests and examples (#3464)
* Disable kernel timing in tests

* default time_kernel = false in old CK examples

[ROCm/composable_kernel commit: 0a474aa62f]
2026-01-07 16:30:57 +01:00
BrianHarrisonAMD
edc3e4a870 Enable offload-compress for Windows if avaliable (#3521)
[ROCm/composable_kernel commit: e8cc75aefb]
2026-01-07 07:05:03 -08:00
Cong Ma
cdd9dafe6a [CK TILE] Refactor function amd_buffer_load_invalid_element_return_zero (#3512)
Refactor function amd_buffer_load_invalid_element_return_zero to avoid
the inefficient ASM code generated by compiler.

Compiler generates suboptimal assembly for ternary operator, causing excessive VGPR usage

Tested compilers:
- Rocm 7.0.1
- Rocm 7.1.1

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: d7497d2694]
2026-01-07 00:05:56 -08:00
Khushbu Agarwal
c33704febc [CK_Tile] Support for various group sizes Preshuffle quant for 2d block scale gemm (#3445)
* formatted

* formatted

* formatting

* formatting

* formatting

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* enable prefill shapes

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Add support for rowcol and tensor GEMM operations

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update README

* adding preshuffle quant as new parameter and its associated new files

* remove debugging statements

* adding test

* enable preshuffle quant with permuteN

* updating readme and correcponding gemmconfigs

* updating cmake file

* fixing CI failures for grouped quant gemm

* debugging permuteN

* debugging

* debugging PermuteN

* initial commit

* resolving merge conflicts

* adding test cases

* initial commit with prints

* debugging

* fine-grained working

* debugging medium grained

* fixing the tile window

* formatting

* enabling prefill shapes

* working prefill shapes

* formatted

* clean up

* code cleanup

* bug fix after merging with develop

* clean up after merging with develop

* added comments for the tile window and tile distribution encoding

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>

[ROCm/composable_kernel commit: aaa35f0bbf]
2026-01-06 12:46:59 -08:00
kyle-256
9489e197c3 [CKTILE] Support A/B Quantization in Blockscale Grouped Gemm (#3452)
* update grouped_gemm blockwise kernel

* update config

* update kernel

* update examples

* remove test code for now

* sync test files with origin/develop

* update example

* fix code lint

* fix code-lint

* update test code

* run clang format

* run pre-commit

* update api

[ROCm/composable_kernel commit: 76696ace44]
2026-01-06 12:36:04 -08:00
kensclin
df198bd5af [CK_TILE] add preshuffleB mode for ABQuant GEMM (#3495)
* [CK_TILE] add preshuffleB mode for ABQuant GEMM

* fix precommit error

* use template method call for cvt_scale_to_fp32

* fix precommit error

* add test code

* fix precommit error

* switch abquant  gemmconfig to default

* Add changelog.md

* fix precommit error

* fix conflict

[ROCm/composable_kernel commit: 2309c86054]
2026-01-06 12:35:01 -08:00
John Shumway
946a6e7df0 Fix build error from extra comma (#3516)
The newer rocm compiler gives an error with a trailing comma in testing::AllOf.

[ROCm/composable_kernel commit: 960ef551bf]
2026-01-06 11:08:54 -08:00
Illia Silin
acb2292b46 add tabulate package to aiter docker (#3519)
[ROCm/composable_kernel commit: 2ffbf7f476]
2026-01-06 09:36:54 -08:00
Robin Voetter
ffc30531ac [CK_BUILDER] Integrate reference conv with testing (#3511)
* ck-builder: explicitly delete forward declarations

Before, these functions were seen as a forward declaration for an existing function.
If no actual implementation overload could be found, these would be selected and
a linker error or warning would be generated. By marking these functions as explicitly
deleted, they incorrect invocations are generated as compile error instead.

* ck-builder: ckt::run plumbing for reference conv

This implements the ckt::run plumbing for the reference convolution
implementation and sets up the first complete end-to-end test.

* ck-builder: make validation system check for all-zeros

When both the actual and reference output are both all zero bits,
there is probably something wrong in the test framework.

* ck-builder: proper implementation+tests for TensorDescriptor::is_packed

* ck-builder: fix typos

[ROCm/composable_kernel commit: 1c433c64ec]
2026-01-06 09:29:06 +01:00
joyeamd
e36567f015 Merge some updates for ck_tile headers (#3342)
* fix some issues from internal branch

* update cshuffle_epilogue

* update cshuffle_epilogue

* update cshuffle

* update warp_gemm

[ROCm/composable_kernel commit: b78563b3d3]
2026-01-05 23:39:00 -08:00
joyeamd
9516169aaf Joye/revise wp pipeline (#3493)
* [CK_TILE] unify double and single lds implementation (#108)

Unify LDS buffer management API for single and double buffering modes

This change consolidates the Local Data Store (LDS) buffer management by:

Merging single and double LDS buffer APIs into a unified interface
Implementing ping-pong address calculation in pipeline when double LDS is enabled
Computing pong buffer addresses dynamically using base address offsets

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update wp_pipeline

* fix a c++17 issue

* update for ci errors

* fix ci issues

* include a header to fix ci errors

* fix some rebase issues

* update with rebase

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 2b563ad048]
2026-01-05 13:49:26 -08:00
Estevan Vedovelli
604ba0e9cf Add support to gfx1153 and fix gfx115X WMMA config (#3496)
* Support for gfx115X

* Changes for gfx115X

* Add gfx1153

* Update changelog

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 1224bc0a82]
2026-01-05 10:03:30 -08:00
Bartłomiej Kocot
502914e556 Fix large tensor grouped conv bwd data test (#3513)
[ROCm/composable_kernel commit: bbf0b1a3b3]
2026-01-05 09:42:02 -08:00
Robin Voetter
14a149bab6 [CK_BUILDER] validation (#3471)
This pull request builds on #3267 by proving the "validation" infrastructure, the means to compare a set of `Outputs`.

The design of the validation infrastructure is relatively straight forward:
- Each SIGNATURE should come with a `validate()` implementation, which should be implemented in a similar way that the other functions/types from `testing.hpp` are implemented.
- `validate()` returns a `ValidationReport`, which is a structure that keeps all relevant information about comparing the tensors from two `Outputs`. Note that crucially, `validate()` should not do any reporting by itself. Rather, glue logic should be implemented by the user to turn `ValidationReport` into a relevant error message.
- You can see this clue code for CK-Builder itself in `testing_utils.hpp`, its `MatchesReference()`. This functionality is relatively barebones right now, it will be expanded upon in a different PR to keep the scope of this one down.

The comparison is done on the GPU (using an atomic for now), to keep tests relatively quick. Some notable items from this PR:
- To help compare the tensors and with writing tests, I've written a generic function `tensor_foreach` which invokes a callback on every element of a tensor.
- For that it was useful that the `TensorDescriptor` has a rank which is known at compile-time, so I've changed the implementation of `TensorDescriptor` for that. I felt like it was a better approach than keeping it dynamic, for multiple reasons:
  - This is C++ and we should use static typing where possible and useful. This way, we don't have to implement runtime assertions about the tensor rank.
  - We know already know the rank of tensors statically, as it can be derived from the SIGNATURE.
  - It simpifies the implementation of `tensor_foreach` and other comparison code.
- There are a lot of new tests for validating the validation implementation, validating validation validation tests (Only 3 recursive levels though...). For a few of those functions, I felt like it would be useful to expose them to the user.
- Doc comments everywhere.

[ROCm/composable_kernel commit: e6e7dc2910]
2026-01-05 04:57:34 -08:00
Jeff Huang
4f3995a3e3 [FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter

* add is_sglang_layout to  parameters

* add kv_offset_array_transform to batch async for page size 16

* add kv_last_page_lens to kernel

* change kv layout to [num_total_pages, page_block_size, hdim]

* format

* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
  batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)

* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024

* fix python format

* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]

* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
   traits
3. Add some static check for Page size, vector size, hdim, ..., etc.

* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration

Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.

**Changes:**
*   Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
*   Updated Kernel, Pipeline, and Traits to template on these Enums.
*   Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
*   Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
*   Updated CodeGen scripts to support new parameters.

This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.

* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.

* fix vgpr spill count

* remove unnecessary t2s functions

* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py

* support linear kv cache layout

* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.

* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.

* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.

* update changelog

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: cc75a1dc5f]
2026-01-05 18:41:47 +08:00
Max Podkorytov
6cf89bbca9 [CK-Tile] move out memory operation from cshuffle epilogue class (#3359)
* initial poc

* factor out common parts in operator()

* cv4

* rest of the universal gemm pipelines

* fix test

* remove boilerplate from tile engine

* fix example

* fix example

* format

* fix tests build for gemm

* remove base pipeline codegen from gemm instance builder

* unify v3 logic with the rest of universal gemm pipelines

* fix build for multi abd test

* fix test gemm multi d

* fix build for weight preshuffle

* fix grouped gemm test

* fix grouped gemm multi d test

* fix grouped gemm preshuffle

* fix grouped gemm example except for quant

* fix gemm preshuffle

* fix splitk 2 stage example

* fix batched gemm example

* fix multid example

* fix multiabd example

* fix batched gemm test

* fixup

* fix examples build

* fix grouped gemm test build

* fix smoke builder

* hacky poc

* fix tile engine

* kill the lambda

* maybe fix test build

* more fixes

* clang-format

* save temp

* clang-format

* mostly fix examples

* clang-format

* remove dead code

* more cleanup

* fix fmha bwd build (default epilogue set/add appears to be broken)

* fix default epilogue tests but not correctness

* clang-format

* fix bquant

* clang-format

* cleanup dead code

* rearrange make windows for readability

* restore changes to IsSupportedArgument

* fix smoke-builder

* clang-format

* fixup rename class

* build fixes

* clang-format

* fix builder

* fixup

* remove set from builder tests

* fix test

* clang-format

* re-refactor the kernels

* clang-format

* fix header license

* remove memory operation from conv bwd test

* clang-format

* clang-format example,include

* clang-format test

* build fixes

* clang-format

* solve compilation error

* fix the CI

* solve compilation error

* clang format

* solve merge conflict

* solve merge conflict

* solve the gfx11 error

* solve test error

* moar build fixes

* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: e339101e9c]
2026-01-04 03:28:14 -08:00
John Afaganis
077d75cea0 Update unsigned long literals and format specifiers to work correctly in Windows (#3483)
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.



[ROCm/composable_kernel commit: ec23be0b9d]
2026-01-02 22:16:41 -07:00
John Shumway
9e9cadefb5 [CK_BUILDER] Remove cmath include (#3508)
Remove the dependency from device_tensor_generator.hpp and fix a typo from a previous force push. The changes replace standard library math functions with their ck::math equivalents and define PI as a local constant instead of computing it using std::acos.

Key changes:

* Removed #include header dependency
* Replaced std::acos(-1.0) with hardcoded PI constant 3.141592653f
* Replaced std::sqrt, std::cos, and std::sin with ck::math equivalents

[ROCm/composable_kernel commit: 4670df5ca6]
2026-01-02 16:58:35 -08:00
John Shumway
853f3c6776 Remove non-standard M_PI (#3507)
Just use PI=acos(-1.0) as a local static constexpr. This has been causing build issues on windows.

[ROCm/composable_kernel commit: 355ce9230d]
2026-01-02 14:21:46 -08:00
John Shumway
86b1f5749b Enable math defines for MSVC. (#3503)
The symbol M_PI is breaking the build on Windows.  The _USE_MATH_DEFINES macro enables M_PI and other math constants on Windows. (I'm guessing this is more idomatic than the old trick of using PI=acos(-1.0).)

https://learn.microsoft.com/en-us/cpp/c-runtime-library/math-constants?view=msvc-170

Co-authored-by: BradPepersAMD <Brad.Pepers@amd.com>

[ROCm/composable_kernel commit: 1da340031c]
2026-01-02 14:36:42 -05:00
Joseph Macaranas
506a19a7e7 Update TheRock CI SHA 20260102 (#3506)
- TheRock CI compilation passed with the changes.

[ROCm/composable_kernel commit: cc1392a405]
2026-01-02 14:23:43 -05:00
Ville Pietilä
ba9dbd433a [CK_BUILDER] Instance traits for conv bwd weight algorithms (#3498)
Added instance traits for the following bwd weight conv algorithms

DeviceGroupedConvBwdWeight_Xdl_CShuffleV3
DeviceGroupedConvBwdWeight_Wmma_CShuffleV3
DeviceGroupedConvBwdWeight_Wmma_CShuffle
DeviceGroupedConvBwdWeight_TwoStage_Xdl_CShuffle
DeviceGroupedConvBwdWeight_TwoStage_Wmma_CShuffleV3
DeviceGroupedConvBwdWeight_DL
DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle
DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
Added also unit tests for instance traits of those bwd weigth algorithms that are currently exposed by the narrow CK build for MIOpen.
---------

Co-authored-by: Ville Pietilä <>

[ROCm/composable_kernel commit: 6e8c401e33]
2025-12-31 15:41:15 -08:00
DarylHawkinsAMD
67b61ccf5c Temporarily disable kernel instances that won't build on gfx1101 on Windows (#3499)
## Proposed changes

This source file won't build for gfx1101 on Windows. It builds successfully on other gfx110X architectures, and also builds successfully on gfx1101 on Linux.

This is the compile error:

```
[composable_kernel] FAILED: library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj 
[composable_kernel] ccache B:\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_WMMA -DCK_USE_XDL -DDPP_KERNELS -DLLVM_MAIN_REVISION=524190 -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -IC:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/library/include -IC:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/include -IB:/build/ml-libs/composable_kernel/build/include -IB:/build/base/half/stage/include -isystem B:/build/core/clr/dist/include -DWIN32 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_WARNINGS -DNOMINMAX -fms-extensions -fms-compatibility -D_ENABLE_EXTENDED_ALIGNED_STORAGE  -Wno-documentation-unknown-command -Wno-documentation-pedantic -Wno-unused-command-line-argument -Wno-explicit-specialization-storage-class -Wno-ignored-attributes -Wno-unknown-attributes -Wno-duplicate-decl-specifier --hip-path=B:/build/core/clr/dist --hip-device-lib-path=B:/build/core/clr/dist/lib/llvm/amdgcn/bitcode -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=c++20   -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -x hip --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx1103 --offload-arch=gfx1100 --offload-arch=gfx1101 --offload-arch=gfx1102 --offload-arch=gfx1103 -MD -MT library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj -MF library\src\tensor_operation_instance\gpu\grouped_conv3d_bwd_weight_bilinear\CMakeFiles\device_grouped_conv3d_bwd_weight_bilinear_instance.dir\wmma\device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj.d -o library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/CMakeFiles/device_grouped_conv3d_bwd_weight_bilinear_instance.dir/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp.obj -c C:/home/runner/_work/TheRock/TheRock/ml-libs/composable_kernel/library/src/tensor_operation_instance/gpu/grouped_conv3d_bwd_weight_bilinear/wmma/device_grouped_conv3d_bwd_weight_wmma_bilinear_ndhwgc_gkzyxc_ndhwgk_f16_instance.cpp
[composable_kernel] error: Illegal instruction detected: Operand has incorrect register class.

[composable_kernel] V_CMP_NE_U32_e32 0, $src_private_base, implicit-def $vcc, implicit $exec

[composable_kernel] 1 error generated when compiling for gfx1101.
```

This appears to be a compiler bug and we'll follow up to get a proper fix landed, but for the purposes of landing some work to enable gfx1151 support in TheRock we'd like to disable building of these kernels on this architecture temporarily.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

- [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally
- [X] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run.
- [x] I have added inline documentation which enables the maintainers with understanding the motivation
- [X] I have removed the stale documentation which is no longer relevant after this pull request
- [X] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
- [X] I have run `clang-format` on all changed files
- [X] Any dependent changes have been merged


[ROCm/composable_kernel commit: f3e4d46faa]
2025-12-31 13:12:45 -07:00
kabrahamAMD
d7f7c1b6db [CK_Builder] [testing] Integrate device random generators (#3427)
Implemented device random number generators for ck tensors.
Includes tests and integration to ck builder testing interface.

[ROCm/composable_kernel commit: f86bbb1aef]
2025-12-30 10:03:05 -08:00
Bartłomiej Kocot
c5245882c3 Fix grouped conv wrw kernels names (#3494)
[ROCm/composable_kernel commit: 2b8302eb6d]
2025-12-30 16:45:39 +01:00
ApoorvaKalyani
bac1ccbf8b Grouped convolution backward data WMMA v3 implementation (#3460)
* Added device level implementation for bwd_data_wmma_v3.

* Added first instance of bwd_data_wmma_v3(f16).

* Add support for bwd data in gridwise implementation

Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight

* Initial device implementation of bwd data

* Remove unused template parameters in device impl

* Add one instance for different layout

initial check of device implementation

* Add tests for splitk and for different layouts

* Appended more instances to wmma_v3_f16.

* Added conv_2d bf16 wmma_v3 instances.

* Added conv_3d_bf16 wmma_v3_instances.

* Added conv_3d_f16_wmma_v3_instances.

* Added SplitN test cases for wmma.

* Conv3d_bwd_data_scale_wmma_v3 instances.

* Conv3d_bwd_data_bilinear_wmma_v3_instances

* Renaming the device level instances file to common name , since it is defined for different DataTypes.

* Renaming the instances and fixing typo

* Added the test cases to regression test list

* NCHW support for wmma_v3

* Examples for bf16 and f16 bwd_data_wmma_v3

* Added transpose conditons for device impl

* fixing bugs

* Added the gemm_args array implmentation

* WIP debug conv bwd

* fix splitk

* Grouped gemm fix

* Update CmakeLists with EOF

* Added more instances for tests

* Fixed the run time error in examples and removed 3d conv examples.

* Fixed a typo.

* Updated CmakeLists to removed the 3d convultion deleted files

* Added print error statements for unsupoorted argument

* Added the merge conflict related changes

* Fixed compilation error

* Fixed the InstanceFactory duplication error.

* Removed the print statements and added logs to Arg function

* All the merge conflict related errors resolved

* Added d_tensor tests.

* Added the missing example types of wmm_v3

* Merge error fix

* Corrected the instance name

* Reverted the bias relu change

* Revereted the transpose load local change

* Updated the regression test list with bwd_data_scale

* Revert "Revereted the transpose load local change"

This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.

* Revert "Merge error fix"

This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.

* Reverting the local change

* Added merge error fix

* Build error fix due to merge conflicts

* Added bias_relu example for wmma_v3

* Modified the main method in dtensor tests

* Updated the dtensor tests to pick all the shapes

* Updated the dtensor test shapes.

* Updated the mem operations in tests.

* Added reference func

* Fixed typos in device impl

* Added new header file and modified the include file for 3d tests

* Renamed the test file and added reference func call.

* clang format fix

* Added ignore params

* Modified device impl and tests

* Removed debug print statements and updated dtensor test shapes

* Fixing merge conflicts

* Fixing more merge conflicts

* Fixed copyrights

* Updated the tuned instances to bilinear and scale.

* Adding tuned instances to vanilla wmma_v3

* Removed all unused instances and modified test layouts.

* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.

* Fix clang format

* Updated tuned f16/-genric instances

* Formatting the instances file

* Fixed copyrights and clang issues

* Nonsense commit to force git to force

* Removed the transpose instances

* Added verified genric instances

* Fixing namespace errors

* Added todo for failing shapes

* Formatting instance file

* Fix instance list formatting

* Removing unnecessary formats

* Renamed the common file

* Unification of xdl and wmma bwd_data tests

* Updated Cmake

* Added all layout types and deleted code.

* Updated Cmake to add the condition to all tests.

---------

Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>

[ROCm/composable_kernel commit: 53a1e4f551]
2025-12-30 16:25:08 +01:00
yadaish
fc3ffa0d75 [CK_TILE] support split-k a16w4 gemm1 (#3389)
* initial version to support moe gemm1 split-k

* add missing args

* fix build warning

* update reference

* for split-k disable bias and weight

* remove debug log

* fix format

* fix div by zero errors

* fix cmake config

* update

* resolve conflicts

* remove useless changes

* reformat

* fix

* remove useless changes

* fix ci

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: root <root@smci355-ccs-aus-m01-25.cs-aus.dcgpu>

[ROCm/composable_kernel commit: dae85ead64]
2025-12-29 23:05:35 +08:00
JH-Leon-KIM-AMD
3772cf9dd4 [CK_BUILDER] Add GPU Reference Algorithm to CK Builder (#3381)
* [CK_BUILDER] Integrate GPU reference as ConvAlgorithm

Add GPU reference as a ConvAlgorithm specialization, enabling:
- Unified Builder API for reference and optimized kernels
- Future ckProfiler integration for validation
- First step toward numerical validation in Builder tests

Changes:
- Add ConvAlgorithmSpecialization::REFERENCE enum
- Add ConvAlgorithm_Reference struct
- Add IsReferenceAlgorithm concept
- Create 3 reference factories (Forward, BwdData, BwdWeight)
- Wire into conv_dispatcher
- Add proof-of-concept test (passing)

Test result: Can instantiate reference through Builder API

* Add GPU reference execution tests

- Reference kernel executes through Builder (459ms)
- Both reference and optimized can instantiate
- Tests passing

Next: Implement utilities for comparison

* Optimized Builder kernel execution works

- MakeArgument pattern implemented
- Builder-generated kernel executes successfully
- Tests passing (451ms execution)

Next: Add comparison

* VALIDATION COMPLETE: Builder == Reference

Builder-generated kernel output matches GPU reference!

Test: Validate_Optimized_vs_Reference_Forward_2D_FP16
Result: PASS ✓

This proves CK Builder generates correct code!

* Update to new Builder API

All tests passing

* Rename test file for clarity

test_builder_kernel_execution -> test_builder_kernel_validation

* Add all 3 directions support

- Forward, Backward Data, Backward Weight
- All reference factories working
- Dispatcher wired for all directions
- 9 tests passing

Tests:
- test_reference_execution: 3 tests (all directions)
- test_optimized_execution: 3 tests (all directions)
- test_builder_kernel_validation: 3 tests (fwd validated, bwd placeholders)

* Add backward direction support

- Backward data and weight dispatcher wiring
- Fix factories for new API
- All 3 directions tested
- 9 tests passing

* Refactor: Change IsReferenceAlgorithm from concept to consteval function

Address review feedback: Use consteval function in dispatcher instead of
concept, matching the pattern for other algorithms (Tile, XDL, WMMA, DL).

- Remove IsReferenceAlgorithm concept from conv_algorithm_concepts.hpp
- Add IsReferenceAlgorithm() consteval function to conv_dispatcher.hpp
- Update dispatcher to use function call: IsReferenceAlgorithm<T>()
- Remove redundant algorithm checks from reference factory requires clauses

All tests passing (9/9).

* Move Tile algorithm check outside direction block to support all directions

* Implement MakeInvokerPointer interface and add random input validation

- Implement full Argument/Invoker structs for old CK interface (not just nullptr)
- Refactor with reference_common.hpp to reduce code duplication
- Add random input validation tests: Builder vs direct GPU reference (all directions)
- Fix layout: GNHWC -> NHWGC to match reference kernel expectations
- All 12 tests pass with IDENTICAL results on random input

* Move ConvAlgorithm_Reference to test/impl/conv_algorithm_types.hpp

Keep types.hpp for data types only (enums), move algorithm descriptors
to conv_algorithm_types.hpp as suggested by review.

* Add static_assert to ensure reference factories only accept PassThrough operations

Reference implementation doesn't support fused elementwise operations.
Add compile-time validation to fail early with clear error message if
non-PassThrough operations are specified on input, weight, or output.

* Add InstanceTraits support for reference kernels

- Store SIGNATURE/ALGORITHM/VERSION in Instance for reflection
- Create shared ReferenceCommonTraits base for common properties
- Add 3 direction-specific InstanceTraits specializations in one file
- Include data type and layouts in instance_string output

* Remove optimized kernel validation tests from reference-only branch

* Use existing layout helper and organize reference tests

Use LayoutToCK from conv_tensor_layout.hpp and move reference InstanceTraits
test to validation folder.

* Merge develop branch

Fix DataType switch for new mixed precision types.

* Fix comment spacing for CI

* Convert IsReferenceAlgorithm from function to concept

* Add reference tests to CI smoke tests

* Consolidate 3 reference factories into single unified factory

---------

Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>

[ROCm/composable_kernel commit: a0acc83a72]
2025-12-29 16:11:08 +02:00
Kiefer van Teutem
04d4dd1ada Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470)
* Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests.

* Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.

[ROCm/composable_kernel commit: 88ae445580]
2025-12-29 12:58:29 +01:00
Yi DING
9045cafc8c [CK_TILE] MX FLATMM Fix M Padding (#3489)
* Fix M Padding

* Fix tensor desc ele space size

[ROCm/composable_kernel commit: b0ea67e377]
2025-12-29 09:09:12 +08:00
joyeamd
38a547df56 enable f8 tests (#3488)
[ROCm/composable_kernel commit: a3916a8d16]
2025-12-27 00:21:56 -08:00
Yi DING
d80a3f9c70 [CK_TILE] Align FMHA BWD Reference with Kernel Implementation (#3486)
[ROCm/composable_kernel commit: 7ce532eac7]
2025-12-25 16:12:36 +08:00
Erwin Terpstra
bd73699148 [CK_TILE] Grouped gemm quant tensor layouts (#3414)
* feat: add RRR, CRR, CCR layouts for a/b quant grouped gemm tests and examples. Refactor example setup to improve compile time

* chore: split out bquant preshuffle test, and reduce tile size to 128 to temporarily solve slow compile times

* chore: set m/n warp tile to 16 as configurations with 32 seem to have some support problems

* fix: missing check for transposed load in bquant pipeline

* chore: lower unit test tensors dimensions a bit for faster tests

* chore: set grouped gemm example M/N warp tile to 16

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: e08efa551f]
2025-12-24 23:01:23 -08:00
Illia Silin
21d679acab remove the LLVM_MAIN_REVISION usage (#3487)
[ROCm/composable_kernel commit: 14668a56e3]
2025-12-24 16:49:35 -08:00
Thrupti Raj Lakshmana Gowda
b17fa5656f [CK TILE ENGINE] CI configuration with basic cases (#3475)
* [CK TILE ENGINE] Adding GEMM BASIC TEST in Kenkins

* fix RUN_TILE_ENGINE_BASIC_TESTS name typo

* [CK Tile Engine] Updating basic CI

* Resolving merging issues

* Resolving merging issues

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 62a8ec155f]
2025-12-24 10:45:56 -08:00