[CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight
instances (#7937)
## Motivation
CK Tile had no pipeline competitive with old CK's wavelet on the
RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a
wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive
kernel for spatial small-K shapes.
## Technical Details
- New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`):
workgroup split into math waves (LDS read + MFMA) and load waves (DRAM
read + LDS write).
- VGPR role-split: `operator()` has two top-level mutually-exclusive
`is_math` branches so the allocator overlays both roles onto the same
physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy.
Correctness depends on identical `block_sync_lds` counts on both arms
plus a matching load-wave barrier stub in the epilogue
(`cshuffle_epilogue.hpp`).
- Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`):
`kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub.
Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36
C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27%
(googlenet K=320 by 16-23%); the role-split roughly halves the parity
gap vs old CK on the 13x13 fp16 shape.
## Test Plan
- `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16,
`split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a
broad 2D sweep.
- Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64}
(barrier-parity / deadlock check).
- `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet
instances.
## Test Result
- All wavelet instances CPU-verify correct across the split-K sweep; no
hangs (dual-arm barrier sequence matches).
- Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best
non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better
vs old CK on the majority of spatial shapes.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Integrate CK Tile Dispatcher code generation into
CK Tile Profiler (#7284)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
CK Tile is going to be delivered to hipDNN via CK Dispatcher. Currently
the CK Tile Profiler using CK Builder for generating the profiled
instances from the configuration files that identify the instances that
old CK exposes. We need to replace this instance generation with the CK
Tile Dispatcher codegen.
## Technical Details
The old CK Profiler config files are converted to JSON files that the CK
Tile Dispatcher can digest. The conversion script for configurations is
stored to source control in case we need to update the JSON
configurations later. The dispatcher generates instance libraries per
conv direction (fwd, bwd data, and bwd weight) that are linked to the CK
Profiler executable. I also implemented codegne for the stream-K and
depthwise conv instances. The proposed solution replaces the CK Builder
codegen with the CK Tile Dispatcher codegen.
There are two new methods that are exposed via the dispatcher backend
- `is_supported` - required to enabled the profiler workflow where we
check the applicability of the kernel instance before running it.
- `get_instance_string` - this mainly for verification. This provide the
CK Builder instance string for verifying that the old CK Builder based
profiler and the new CK Tile Dispatcher based profiler have the same
instances.
The rules that limit the generated instances are now collected to a
single location under the dispacther. The CK Builder codegen uses these,
which ensures that the two codegen pipelines are in sync. The next step
(different PR) is to remove the CK Builder codegen pipeline altogether.
## Test Plan
Verified that the old CK Builder based profiler and the new CK Tile
Dispatcher based profiler have the same instances, that is, the
Dispatcher based codgen can generate the same instances as the old CK
Builder.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] suppress compiler warnings while building pytorch. (#7760)
## Motivation
Recently added compiler flags that are required to suppress false
warnings by latest staging compiler are not recognized by older compiler
versions and are triggering an avalanche of warnings. Previous attempt
to suppress them by using -Wno-unknown-warning-option flag didn't help,
because that flag wasn't recognized either and just added more warnings.
I've verified that current approach by checking the clang version
actually works as intended and makes the warnings go away.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] Grouped Convolution Global Load/Store support (#7631)
## Motivation
Grouped Convolution Global Load/Store support to cover large tensor
cases.
## Technical Details
Utilize global load for grouped convolution forwad kernels. Update
Indexes to use int64.
## Test Plan
- test utils
- test conv kernels in next pr with instances
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1255
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
[CK] upgrade CI to rocm7.13 as default compiler (#7612)
## Motivation
Upgrade the default docker and compiler version in CI to rocm7.13.
In order to pass all the checks I had to also clean up a lot of
non-ascii characters in the source code comments and modify a couple of
tests that were affected by a new compiler logic.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
[CK Tile] Enable hardware OOB buffer load offset trick by default (#7466)
## Summary
Enables `CK_TILE_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK`
inside `config.hpp`.
### Background
When loading from global memory with out-of-bound (OOB) check, CK Tile
must suppress invalid lanes. The previous default used a software
branch:
```cpp
// Old path (oob_conditional_check, no trick)
if(!src_thread_element_valid) { return zeros; }
return amd_buffer_load_impl(...);
```
This generates divergent control flow, the compiler emits exec-mask save/restore and per-lane comparison SALU instructions one set per buffer load that touches a padded dimension.
### Change
With the trick enabled, invalid lanes are suppressed entirely in hardware:
```cpp
// New path (trick enabled)
uint32_t shift = src_thread_element_valid ? 0 : 0x80000000;
return amd_buffer_load_impl(resource, shift + offset, 0);
```
The `0x80000000` offset overflows the buffer descriptor's declared size, causing the hardware to silently return zero for that lane - no branch, no exec mask manipulation. This matches the behavior of old CK XDL kernels, which use an unconditional load followed by a `v_cndmask` select.
### Expected impact
Eliminates ALU overhead from OOB validity branches which reduces the kernel execution time, especially for memory-bound cases.
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
[CK TILE] Increase default kPerXdl for grouped convolution instances (#7465)
## Summary
Increases the default `kPerXdl` used in CK Tile grouped convolution
instance generation for forward, backward-data, and backward-weight
operations.
### Changes in `generate_instances.py`
- **Larger default `kPerXdl` for all fp16/bf16 tile sizes**:
`get_k_mfma()` now
returns `32` for `m/nPerXdl = 16` and `16` for `m/nPerXdl = 32`.
- **Cap `kPerXdl` to `kPerBlock`**: All three parsers
(`parse_fwd_instances`, `parse_bwd_weight_instances`,
`parse_bwd_data_instances`) now clamp the computed value with `min(...,
k_per_block)` to prevent generating invalid instances where `kPerXdl >
kPerBlock`.
### Expected impact
Higher `kPerXdl` increases the number of MFMA instructions issued per
warp per inner-loop iteration, improving arithmetic intensity and
reducing pipeline stall overhead for memory-bound shapes.
[CK Tile] Support multi-vector reads in static encoding patterns (#7528)
## Motivation
The thread-raked / warp-raked / block-raked static tile distribution
patterns in `ck_tile` silently produce wrong results when the contiguous
tile dimension is larger than `warp_size * vector_size`, because the
encoding has no per-thread iteration dimension along X.
Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in
`ConvConfigComputeV3`, the grouped convolution backward-weight example
reports about 50 percent wrong values, with errors starting exactly at
the `X0*X1 = 64` boundary. The second pass over the contiguous dim is
never performed.
This PR extends the encoding so multi-vector reads in the contiguous
tile dimension are supported, while keeping every existing call site
bit-for-bit identical.
## Technical Details
Three files changed.
### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp`
Add a per-thread X iteration dimension in all three raked
specializations:
- `X0 = min(warp_size, XPerTile / X1)` — threads in X dim
- `X1 = min(LargestVec, VecSize)` — vector size per access
- `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new)
`X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both
`make_2d_static_tile_distribution()` and
`make_shuffled_2d_static_tile_distribution()`.
The new encoding places `X2` in the middle of the Ys iteration list,
which preserves reverse symmetry between the regular `<..., X2, X1>` and
shuffled `<X1, X2, ...>` encodings.
Patterns updated: `thread_raked`, `warp_raked`, `block_raked`.
### 2. `include/ck_tile/core/tensor/transpose_tile.hpp`
Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch
alongside the existing `NDimY == 2` branch. The original branch is
byte-for-byte unchanged.
Both branches dispatch to the same `transpose_tile2d_impl_in_thread`,
whose body has always been NDimY-generic (iterates with `static_for<0,
NDimY, 1>` and `number<NDimY>{}`).
### 3.
`experimental/grouped_convolution_tile_instances/generate_instances.py`
Removed the two now-obsolete skip guards in `parse_bwd_weight_instances`
and `parse_bwd_data_instances`:
```python
if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector):
print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.")
continue
```
Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating,
irregular-load shapes, scalar-per-vector > tile size) are kept
untouched.
### Compatibility
Strict. Every existing caller has `X2 == 1` and therefore hits the
original encoding path verbatim. No upstream config or pipeline behavior
changes.
## Test Plan
The grouped convolution example is the natural exerciser since
`GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for
both A and B tiles, and all three conv directions share the same
`ConvConfigComputeV3`.
For each test below we ran:
```
./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_fwd [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_bwd_data [-prec={fp16,bf16}]
```
with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both
code paths:
| Test | M / N / K | VecA/B/C | A path | B path | dtype |
|------|-------------|----------|------------|----------------|-------------|
| T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 |
| T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 |
| T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 |
| T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)|
| T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3
dir) |
A larger T4a (256/256/128) was attempted to stress both A and B with
X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128
KB > 64 KB limit), independent of this PR.
For the generator change we ran:
```
python3 generate_instances.py --mode profiler --direction all
```
and verified `Skipping instance ... with multiple warps per continous
tile dim` no longer appears (count went from non-zero to 0); other skip
categories are unchanged.
`clang-format-18` was applied to both modified `.hpp` files (matches the
repo's `.clang-format`).
## Test Result
- T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`.
Confirms existing callers are unaffected.
- T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3
encoding + transpose branch.
- T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3
`correct`.
- T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs
`correct`.
- Generator: 0 `multiple warps per continous tile dim` skips remaining;
other skips unchanged.
Sample run output (T4b, bf16, bwd_data):
```
shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32
pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0
Vector size A: 1, Vector size B: 1, Vector size C: 1
0.934907 ms, 8.34683 TFlops, 34.3178 GB/s
Relative error threshold: 0.00390625 Absolute error threshold: 0.25
The CPU verification result is: correct
```
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
[CK][CK Tile] Grouped Conv Backward Weight Streamk instances (#5904)
## Motivation
Add streamk instance to grouped convolution backward weight profiler.
## Technical Details
- New instances for grouped conv backward weight with streamk
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Graner, Johannes <johannes.graner@amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
[CK_TILE] Add depthwise conv2d forward kernel (FP16/FP32) (#6838)
## Motivation
CK currently has no kernel optimized for depthwise convolution
(G=C_in=C_out, C=K=1 per group) and existing generic paths perform
poorly for this workload. This PR adds a dedicated depthwise conv
forward kernel in CK Tile.
## Technical Details
Adds a dedicated depthwise conv2d forward op to CK Tile that performs
direct convolution rather than falling back to the generic GEMM path.
The kernel is templatized by filter size, stride, and data type, and
compiled into ~60 instances covering common configurations (kernel
3/5/7/9, stride 1/2, FP16/FP32). Supports both CDNA (gfx942/gfx950) and
RDNA (gfx1100/gfx1200) architectures.
## Test Plan
- [x] Correctness and performance validated on gfx942, gfx950, and
gfx1100, with ckProfiler `grouped_conv_fwd` as baseline.
- [ ] MI300A (gfx942) and gfx1200 validation.
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1137
---------
Co-authored-by: GenDu <Gen.Du@amd.com>
[CK] add composable kernel support on gfx1250 (#6978)
## Motivation
Add composable kernel support on gfx1250.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>
[CK] Suppress new staging compiler errors (#7384)
## Motivation
This should make new builds with staging compiler pass.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Add conv bwd data tests (#5646)
## Motivation
This PR adds tests for CK Tile's convolution backward data operation to
enable functionality regression tracking and error-detection.
## Technical Details
Currently only NHWGC/GKCYX/NHWGK and NDHWGC/GKCZYX/NDHWGK(2 dim and 3
dim channel-last) layouts are being tested, since only they are
implemented in CK Tile. Current tests support FP16, BF16 and FP32
datatypes and various different convolutions scenarios. The tested
instances are listed in
`experimental/grouped_convolution_tile_instances` directory.
## Test Result
All implemented tests are working properly and passing.
---------
Co-authored-by: Ville Pietilä <>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
[CK] fix clang lifetime bound error in ck_builder. (#5571)
## Motivation
This resolves the compilation error with latest develop compiler branch.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516)
## Motivation
We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things
- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.
## Technical Details
Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.
---------
Co-authored-by: Ville Pietilä <>
[CK][CK Tile] Move grouped conv cpp instances to build dir (#5609)
## Motivation
Move grouped conv .cpp instances to build dir. Fix generate instances
script.
## Technical Details
Avoid CI problem when instances in experimental directory are not
removed
## Test Plan
test_grouped_convnd_*_tile
## Test Result
Pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK][CK Tile] Fix dram step for KM/KN layouts in V1 pipeline (#5470)
## Motivation
Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile
window.
## Technical Details
- Fix dram step for KM/KN layouts in V1 pipeline
- Disable instances which use more threads than warp size in continous
dim (not supported in ck tile yet)
- Use 1x1 specialization for explicit gemm
- Use two stage for vectorsize =1 and sizeof(datatype) ==2
- remove not needed check sinze GetVectorSizeA/B check if vector size is
fixed
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
passed locally
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-966
[CK][CK Tile] Grouped Convolution backward weight profiler flush cache (#5454)
## Motivation
Flush cache to get more stable results during profiling old ck and ck
tile.
## Technical Details
Flush cache before each kernel call and one more first run.
## Test Plan
test_grouped_conv_bwd_weight_tile
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-966
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
[CK][CK Tile] Grouped Convolution Backward Weight set of fixes (#5387)
## Motivation
Grouped Convolution Backward Weight split k fixes for CK tile kernels
## Technical Details
- get k batch from kargs to get deduced k batch
- multiply zeroing size by data type size
- disable v6 (producing a incorrect results)
## Test Plan
test_grouped_convnd_bwd_weight_tile
## Test Result
Pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Ville Pietilä <>
[CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK Tile profiler (#5237)
## Motivation
PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The
two-stage kernels were not supported in the initial PR. This PR adds the
the missing bwd weight two-stage kernels to the CK Profiler.
## Technical Details
Extended the CK Tile conv builder factory to build also the elementwise
ops required for the two-stage kernels. Extended the CK Builder for CK
Tile instance to accept the two-stage flag as part of the algorithm
configuration.
## Test Plan
Added units tests for CK Builder that verify the two-stage kernel
construction.
## Test Result
If CI passes, the added unit tests are passing.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Ville Pietilä <>
[CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114)
## Motivation
Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details
- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
pass
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-786
[CK_BUILDER] Add DeviceGroupedConvFwdMultipleABD_Wmma_CShuffle_V3 to CK Builder (#5284)
Add factory, InstanceTraits, and conv traits support for the WMMA V3
forward convolution kernel, enabling the CK Builder to generate and
dispatch this kernel variant used by MIOpen on gfx11/gfx12 GPUs.
## Motivation
As reported in issue #4944, MIOpen includes WMMA V3 forward convolution
kernels, so this PR adds support for those kernels similarly to other
supported kernels.
## Technical Details
This follows the same implementation as the other kernels. I added some
support for reflection, but I left a few todos since we need to
generalize our convolution traits to generalize across WMMA/MFMA and
CK/CKTile.
## Test Plan
Added faster tests to `ninja smoke-builder` that check the
instance-traits logic, and I added longer tests that instantiate
kernels, following the existing pattern in other kernals.
## Test Result
I tested all code with `ninja check-builder` on a gfx1101 build and ran
on gfx1101.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
[CK][CK Tile] Add grouped conv backward weight tile test and fix tr load in BASE_V1 pipeline (#5115)
## Motivation
Test grouped conv backward weight from ck tile and fix incorrect values.
## Technical Details
- Add test for CI
- Add daily tests
- Fix transpose load in BASE_V1 pipeline
## Test Plan
test_grouped_convnd_backward_weight_tile
## Test Result
in progress
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-783
[CK_BUILDER] Clean up ConvDescription output formatting (#5085)
The `ConvDescription::getDetailedDescription()` output had several
issues that made it harder to read and potentially misleading:
1. **Bug fix**: The LDS padding field was incorrectly displaying
`dst_scalar_per_vector_k1` instead of the actual `lds_padding` value
2. **Noise reduction**: Optional parameters that weren't set were
printing unhelpful messages like "Struct does not contain optional
gemm_padding argument" — these add clutter without providing value to
the reader
3. **Formatting inconsistencies**: Trailing spaces after colons (e.g.,
`"Warp Gemm parameters: "`) and a stray trailing `×` in tile dimensions
4. **Missing thread cluster lengths**: The threads per axis are not
shown.
**Changes**:
- **Fixed the LDS padding bug** by using
`traits_.a_tile_transfer.transfer_params.lds_padding` and
`traits_.b_tile_transfer.transfer_params.lds_padding` instead of
duplicating `dst_scalar_per_vector_k1`
- **Simplified optional parameter handling**: Changed from printing
"Struct does not contain..." messages to simply omitting absent optional
values. Also switched from `.value_or()` to direct dereference (`*`)
since we're already inside an `if` check
- **Cleaned up formatting**: Removed trailing spaces after colons and
the extra `×` at the end of tile dimension lists
- **Added missing thread cluster lengths**: Added X×Y×Z" display for
both A and B tile transfer sections.
- **Fixed typo**: "Do Padd Gemm" → "Do Pad Gemm"
- **Fixed typo**: "scr" → "src"
- **Fixed typo**: "tensros" → "tensors"
- `ninja smoke-builder` ✓
- `ninja check-builder` ✓
The test file updates reflect the corrected expected output, which now
shows the actual `lds_padding` values (0 or 1), shows thread cluster
lenths, and omits the verbose "Struct does not contain..." lines.
**Note**: This PR follows PR #5083.
Proof of concept for removing forward declarations (#5135)
## Motivation
Currently, we forward declare CK device operation templates in
CK-Builder's reflection code:
9b168082b7/experimental/builder/include/ck_tile/builder/reflect/instance_traits_device_grouped_conv_bwd_weight_xdl_cshuffle.hpp (L13-L57)
This is mainly required to break a circular dependency in reflection.
The architecture of that is as follows:
MyDeviceOp implements GetInstanceString(). This is typically defined
directly in the class definition (no forward declaration).
GetInstanceString() calls instance_string<MyDeviceOp>()
instance_string<MyDeviceOp>() calls
InstanceTraits<MyDeviceOp>::instance_string()
InstanceTraits has a specialization for MyDeviceOp which implements
instance_string()
So order for GetInstanceString() to work properly, InstanceTraits must
already be defined. And for InstanceTraits to be defined, the device op
needs to be defined. In order to do that, we are currently using
aforementioned forward declaration.
## Technical Details
C++'s lazy template evaluation is used by calling into an as-of-yet
undefined function static member function of
`InstanceTraits<MyDeviceOp>` in `GetInstanceString()`, and then
specializing `InstanceTraits` only _after that_. The caveat here is that
both the device op itself as well as the instance traits specialization
must be in scope, otherwise there would be an undefined function error.
In practise, we can solve that either by placing the instance traits
directly into the file that defines `MyDeviceOp`, or possibly by using a
`.inc` file to keep the concerns separated.
## Test Plan
The results were verified by running the existing regression tests for
CK Builder
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Márton Bidlek <marton.bidlek@streamhpc.com>
[CK_BUILDER] Simplify the TreeFormatter. (#5083)
My original design wrote each line streaming, so developers had to keep
track of the indentation depth and remember when to use `writelast` for
the last element at a depth. This was a source of a lot of cosmetic
output errors, and that is likely to get more complicated as we add
optional branches.
We switch to a tree-building interface with a simple `add` method. The
only cost is that we have to defer string building, which is a good
tradeoff for our use case.
Tested with `ninja smoke-builder`.
[CK_BUILDER] Add BILINEAR and ADD_CLAMP elementwise operation mappings to CK builder (#5026)
## Motivation
The CK kernels that MIOpen consumes use the BILINEAR and ADD_CLAMP
operations. The operation mappings in the CK Builder API need to be in
place to be able to instantiate those kernels using the builder.
## Technical Details
Add the BILINEAR and ADD_CLAMP operation mappings to the builder
## Test Plan
* Added builder tests for new helpers
## Test Result
* New tests pass locally, waiting for test run
## Submission Checklist
- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_TILE] Add CK Tile bwd weight profiler (#4797)
## Motivation
To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.
## Technical Details
- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.
---------
Co-authored-by: Ville Pietilä <>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
[CK_BUILDER] Update developer notes in the CK Builder source directories (#5038)
## Motivation
This PR updates the developer notes for the CK Tile builder. It captures the current state of the implementation in more detail, and frames the description around the need to have true facade.
There is no functional change, only better alignment of developer notes with the current code.
This doc clearly explains the current technical debt: that we have created many facades that expose
the implementation details. There is an expanded section on reflection that explains how unified
reflection will help clarify the unified builder design.
Additional changes are just better accounting for the current state of the code, including previously
undocumented operations. A few typos and cosmetic issues are cleaned up, too.
[CK_BUILDER] ck builder conv transfer fix (#4750)
## Motivation
This PR fixes how CK Builder is validating transfer vector size and adds
proper validation for LDS transfer vector size as well.
## Changes:
* [__source vector dim__] -- Before this PR the data transfer validation
logic didn't allow to set the source vectorized dimension to 1. However
there are CK instances that are doing this when the group merging is
used. This is used only for
`DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle` kernel.
* [__valid vector size__] -- Before this PR the validation logic
concerned only single instruction maximum vector size. However our
buffer loading logic has implemented support for loading more values
through multiple buffer instructions. This again was discovered to be
used in some of the convolution instances. Thus this behavior was
reflected in validation logic.
* [__valid LDS vector size__] -- Before this PR the LDS vector size
validation was done in the same way as VMEM. This PR adds proper LDS
vector size validation based on the available LDS instruction sizes.
## Test Plan
Run CK BUILDER conv fwd factories tests
## Test Result
All CK BUILDER conv fwd factories work (except DL one & ck tile since
they're not yet added now)
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK_Builder] added bwd data kernels to builder factory (#4582)
This PR adds bwd data wmma and xdl kernels to the ck builder, their
instance and conv traits as well as tests for the above.
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
Co-authored-by: John Shumway <jshumway@amd.com>
[CK] Small improvements for grouped conv backward weight (#4872)
## Motivation
Improvements for CK Tile convolution builder run function and atol/rtol
calculations.
## Technical Details
- Add preprocessing function for wrw when k_batch is larger than 1 for
builder run function
- Divide num acums by number of groups to get real number of accums
## Test Plan
CI wrw tests
## Test Result
pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-783
[CK][CK TILE] Improve oob check (#4791)
## Motivation
Improve OOB checks. Remove permutes which have been generated by thread
buffer zero clear. at now in assembly there is only condmask instead of
permute + condmask.
Change number of KPack for generated instances
## Technical Details
Remove permute instructions from assembly
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: jakpiase <jakpia21@gmail.com>
[CK_BUILDER] Fix two staging-compiler errors in CK builder code (#4598)
This PR fixes two compiler warnings that report as errors with the
latest compiler:
1. In tensor descriptor, the `operator[]` accessor needs a
`[[clang::lifetimebound]]` attribute.
2. In the unit tests for device buffer, there is a test that explicitly
checks for an error on a pointer that went out of scope, so it needs a
to disable `-Wlifetime-safety-permissive` in the test code.
I ran the CK `smoke-builder` tests with the staging compiler to verify.
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
[CK_BUILDER] Move some smoke tests that require GPU
Since the CK builder is focused on compile time logic, let's keep the
`smoke-builder` target CPU-only so that it can be ran without a CPU.
Alternatively, we could define a `smoke-cpu-builder` or some special
subtarget, but it's probably simpler to just stick to CPU for this. (My
thinking is that in general GPU testing will be heavier than the smoke
tests. Further, the GPU testing code will likely move outside of the
builder once builder code is moved out of experimental.)
This PR clarifies that CPU-only intention for `smoke-builder` and moves
some GPU testing code to `smoke-regression`.
[CK][CK TILE] Add has hot loop check for pipeline v1
## Motivation
Add has hot loop check for pipeline v1 (v1 basic and v1 basic async).
Enable more tests which have been fixed by this change.
## Technical Details
Hot loop has been executed without num loop check.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
Passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-651
AICK-663
[CK][CK Tile] Temporary disable grouped conv fwd tile comp
async instances (#4457)
## Motivation
[CK][CK Tile] Temporary disable grouped conv fwd tile comp async
instances due to the failures
## Technical Details
disable configs to not comple these instances
## Test Plan
test_grouped_convnd_fwd_Tile
## Test Result
pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[CK] CK Tile grouped convolution direct load
## Motivation
CK Tile grouped convolution forward direct load support.
## Technical Details
Basic pipeline for direct load and new instances for forward for v1 and
v4 pipelines.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-130
[CK_TILE] Add support and tests for V6 pipeline in conv fwd
(#4357)
Added support for conv v6 pipeline in ck tile's convolution forward
kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
and should be faster than other pipelines for some cases. This PR also
adds tests inside profiler that's currently inside experimental
directory, so now we should be able to detect regressions easier.
Force merging because I verified this fix manually:
git checkout develop
git pull
ninja smoke-builder (failed to build, as expected)
git checkout rvoetter/ckb-fix
ninja smoke-builder (passed!)
* ck-builder: tensor copy function
This function copies one tensor to another, so that the memory
layout can be changed between them.
* ck-builder: fix ck::bhalf literals
These types don't work properly.
* ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it
This reduces the amount of duplicated code a bit.
* ck-builder: add flat tensor iterator
This "iterator" type pretends to be a pointer, useful for passing
tensors to functions expecting pointer-like types.
* ck-builder: integrate validation with ck gpu verification
By templating the gpu_verify function over iterators, we can use
the new FlatTensorIterator to adapt the function to multi-
dimensional tensors without changing either implementation
too much.
* ck-builder: add check_by_accumulations
This changes the gpu_verification.hpp code to also accept "iterator"
types for the relevant gpu_verify and gpu_reduce_max functions.
* ck: fix test_gpu_verification GenerateRandomData for bhalf
is_integer_it<bhalf_t> yields true, but it is not actually
an integer.
* ck: make gpu_verification kernels be proper persistent kernels
Previously these were using a hardcoded value for the grid size. This
commit changes that so that the grid size is automatically derived
from the kernel's occupancy and the number of multiprocessors on
the GPU.
* ck: clean up gpu_verification.hpp using block_reduce
This implements a small generic block reduce function, and rewrites
the rest of gpu_verification.hpp using that function to clean it up
a bit.
* ck-builder: doc typos
* ck-builder: update testing readme with validation interface.
* ck-builder: rebase fixes + review comments
* ck-builder: fix device integer generation with float types
Passing bfloat here causes a nans due to type_convert performing
a bitcast.
* ck: another bhalf_t bug
CK expects that int-generation with ck::bhalf_t yields bhalf integers,
not unsigned integers. This makes the logic of FillUniformRandInteger
compatible with GeneratorTensor_2<InDataType>, however idiotic that
may be.
* added reflection for conv_fwd_multiple_d_wmma_cshuffle.hpp
* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle
* added reflection for device_grouped_conv_bwd_weight_xdl_cshuffle v3
* added reflection of max_transpose parameters
* fix printing of std optional parameters
* fix use of undefined ck::index
* added conv traits for device_grouped_conv_bwd_weight_multiple_d_xdl_cshuffle
* added xdl two stage instance to reflection
* added additional variables
* added reflection for grouped_conv_bwd_weight_multiple_d_wmma_cshuffle, _v3, grouped_conv_two_stage_wmma_cshuffle_v3,
* added reflection for device_grouped_conv_bwd_weigh_wmma_cshuffle_v3
* added reflection for bwd_weight_wmma_cshuffle
* added comments back in
* add printed output for optional parameters
* update README
* fix typo
* added num_gemm_k_prefetch_stage and small fixes
* modified test string due to reflection of new parameter
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
* ck-builder: restructure testing conv
In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.
* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp
This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.
* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3
Turns out that the supplied layout isn't actually supported...
* ck-builder: ck and reference conv integration for bwd weight
* ck-builder: ck bwd weight execution test
* ck-builder: ckt::run support for ck-tile bwd weight
* ck-builder: ck tile bwd weight execution test
* ck-builder: extra debug printing in MatchesReference
* ck-builder: make ckt::run return RunResult
This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.
* ck-builder: RunResult matcher
Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.
* ck-builder: doc fixes
* ck-builder: add missing headers
* ck-builder: remove SPATIAL_DIM parameter from ConvTensorLayouts
This information is already in the SIGNATURE, so its pointless to pass it
separately. This streamlines the interface of those functions a bit. Also
touches up the style of those files in general.
* ck-builder: implement reference conv using old ck
The old ck implementation is more featureful and better tested.
* ck-builder: replace test_reference_execution reference with old ck
This strips out the ck-tile gpu reference implementation completely.
* ck-builder: clean up test_reference_execution
- Remove unneccesary messages
- Replace EXPECT_TRUE(true) with EXPECT_NO_THROW()