Commit Graph

29 Commits

Author SHA1 Message Date
Bartłomiej Kocot
7c2b979de2 [rocm-libraries] ROCm/rocm-libraries#8573 (commit 04c9f1d)
[CK][CK Tile] Drop profiler for experimental builder codegen
 (#8573)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Switch to dispatcher profiler for ck tile conv.

## Technical Details

- Switch to dispatcher profiler for ck tile conv.
- Drop profiler for experimental codegen
- Minor fixes for bwd data printing
- Minor fixes for 3d conv in dispatcher codegen

## Test Plan

test_grouped_conv*tile

## Test Result

Passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-19 09:38:44 +00:00
Ville Pietilä
60b276647b [rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39)
[CK Tile] Rule-based configuration generation in CK
 Dispatcher codegen (#8157)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile Dispatcher code generation for CK Tile Profiler relies on
flat JSON files to list the generated configurations. This approach has
the following problems

- The JSON files are verbose
- The JSON files get easily out of sync with the CK Builder .config
files from which they were generated from.
- The JSON file based configuration make it hard to list explicitly the
rules that govern the instance generation.

## Technical Details

Replaced the JSON files with a rule based configuration. To preserve the
existing functionality, the `profiler` and the `tests` instance sets are
generated directly from the CK Builder config files. The JSON config
files are removed from source control, and the "on-the-fly" generation
guarantees that the Dispatcher codegen uses up to date configurations.

This is PR introduces six different rule sets for the CK Tile Dispatcher
code generation

1. `profiler`: matches with the old JSON set of profiler configurations.
2. `tests`: matches with the old JSON set of tests configurations.
3. `full`: full configuration set created from a rule-based config
selection
4. `full-tests`: a subset of `full` for generating configurations for
convolution integration tests.
5. `tiny`: a subset of `full-tests` to produce the minimal set of
configurations to test the Dispatcher codegen.
6. `default`: the default rules, which corresponds to the existing
heuristic rules for configuration selection. This ensures that ML based
kernel selection doesn't get broken.

The main use of the `full` rule set is to define a reasonable solution
space for the possible implicit GEMM configurations. We start from the
configurations that allowed by the device architecture. The `full` rule
set defines the relevant tile sizes for each convolution direction. From
the tile size we have a curated mapping to the number of waves over the
different GEMM axes, i.e., we describe how many waves each GEMM
dimensions corresponds to. The GEMM-K wave tile dimension can be
computed from the other parameters and does not need to be listed
explicitly.

An orthogonal axis to the tiling strategy is the vectorization strategy.
This mainly defined by the data type and hardware as in general, we want
to use the maximum possible load widths. The maximum sizes for each
convolution direction variant are defined by the implicit GEMM matrix
dimensions. For cases where have a low number of channels per
convolution group, we need smaller vector load sizes. These are captured
by the `VecStrategy` enumeration in the codegen rules.

The problem with the rule based configuration selection is that we "over
generate" configurations. The old JSON configurations compose
approximately 25% of all configuration that the `full` rule set creates.
The additional configurations are valid, but they many not provide any
performance benefits. Hence, we keep the `profiler` and `tests` rule set
for now to avoid building an excessive amount configurations by default.
The `full` rule set can be taken into use by specifying CMake
configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the
`tests` rule set is used, i.e., we don't change the existing bahaviour.

## Test Plan

Added a new stage in the CI/CD pipeline that ensures the Dispatcher
codegen rules are up to date. Otherwise the functionality is covered by
the existing CI/CD tests. There are no functional changes to the
convolution kernels. Only how the different instances are generated.

## Test Result

If the CK Tile conv instances build without errors, the Dispatcher
codegen is generating valid code. If all tests in CI/CD pipeline are
passing, the Dispatcher codegen generates valid instances.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-18 01:22:50 +00:00
Johannes Graner
01cca38c8e [rocm-libraries] ROCm/rocm-libraries#8220 (commit 4c04a3a)
[CK Tile] WAVELET pipeline for backward-data grouped
 convolution (#8220)

## Motivation

On the RetinaNet shapes (gfx950, fp16) CK Tile backward-data conv was
~18% behind classic
CK, with the gap concentrated in the K=2376 3x3 detection-head family
where bwd_data spends
most of its time. The WAVELET GEMM pipeline already gives uplift for
forward and
backward-weight conv; this ports it to backward-data and consolidates
the now-shared
machinery across all three directions.

## Technical Details

- Backward-data wavelet support in the tile kernel: launch extra load
waves when the
pipeline exposes `LaunchBlockSize`, and split the epilogue into math
waves (run the
  CShuffle epilogue) and load waves (`RunBarrierStub`).
- Register 7 WAVELET instances (fp16 and bf16), tuned for
backward-data's tall-skinny GEMM
rather than the forward tile shapes: a big-M `256/128/64` workhorse, a
`VecA=4` variant for
the `K % 8 != 0` shapes, and a `NumGroupsToMerge=32` variant for grouped
(depthwise-style)
  shapes.
- Implement the native backward-data instance parser in
`generate_instances.py`.
- Deduplicate the wavelet machinery shared by forward, backward-data,
and backward-weight:
`GroupedConvLaunchBlockSize`, `is_wavelet_pipeline`, and
`RunWaveletAwareEpilogue` in
`grouped_convolution_utils.hpp`; the three native instance parsers
collapse to one
  parameterized parser. The three kernels now call the shared helpers.

## Test Plan

- Rebuild the full profiler instance pools for all three directions
(fp16/bf16/fp32,
nhwgc/ndhwgc) to exercise the shared helpers across every instantiation.
- Tile GTests on gfx950: `test_grouped_convnd_fwd_tile`,
`test_grouped_convnd_bwd_data_tile`,
  `test_grouped_convnd_bwd_weight_tile`.
- Per-shape sweep of the 35 RetinaNet backward-data shapes vs classic CK
and the
non-wavelet tile pool (`profile_wavelet_bwd_data.py`); correctness
spot-checked with
GPU-reference verification on the new big-M and NumGroupsToMerge
instances.

## Test Result

- GTests pass: forward 9/9, backward-data 6/6, backward-weight 6/6.
- Backward-data perf (3x3 g=1 region, geomean classic/tile): 0.88 ->
1.11, i.e. the tile
path goes from ~12% slower than classic to ~8% faster. The largest
single backward-data
shape (256x100x100->2376) moves from 11% slower than classic to 12.5%
faster.
- The dedup refactor preserves behavior (net -174 lines across the
kernels/generator),
  confirmed by the full rebuild and the GTests above.

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-13 00:10:50 +00:00
Johannes Graner
0b3c297ee2 [rocm-libraries] ROCm/rocm-libraries#8009 (commit 26ab70d)
[CK Tile] Add WAVELET pipeline for forward grouped
 convolution (#8009)

## Motivation

CK Tile forward grouped convolution trails classic CK on 3x3
convolutions whose
output-channel count is not divisible by 8, where the narrow output
store limits
the compute CShuffle epilogue. This ports the WAVELET pipeline (added
for
backward-weight in #7937) to the forward kernel to close that gap.

## Technical Details

- Kernel (`grouped_convolution_forward_kernel.hpp`): WAVELET
load/math-wave wiring,
mirroring the backward-weight implementation; the non-WAVELET path is
unchanged.
- Generator: implement `parse_native_fwd_instance`, the forward
native-instance parser.
- Registered WAVELET instances: profiler bf16 3 / fp16 5, tests 1 each.

WAVELET requires input channels divisible by 8 (it does not apply to
depthwise).
The bf16/fp16 instance asymmetry is intentional and measured: the VecC=8
tiles
never beat the compute pool in bf16 but win about 20% of divisible-by-8
3x3 shapes
in fp16, so VecC=8 is registered for fp16 only.

## Test Plan

- Correctness (CPU reference) for every registered profiler instance,
across VecC variants.
- Per-shape best-instance performance sweep over the 34 RetinaNet shapes
(bf16) and
a 200-shape cross-model sweep (bf16 and fp16), compared against classic
CK.

## Test Result

- Correctness: PASS for all instances.
- RetinaNet (bf16, vs classic CK): faster on 28 of 34 shapes, geomean
+19.5%; the
not-divisible-by-8 shapes up to 3.7x. One 1x1 stride-2 shape stays ~20%
behind
  classic CK, unrelated to WAVELET.
- Cross-model (200 shapes): WAVELET wins 3x3 not-divisible-by-8 in both
dtypes
(up to 61% over the next-best compute instance); for divisible-by-8 3x3
it wins
  about 20% of shapes in fp16 (3-11%) and none in bf16.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 08:57:39 +00:00
Johannes Graner
b7c8fb164f [rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d)
[CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight
 instances (#7937)

## Motivation

CK Tile had no pipeline competitive with old CK's wavelet on the
RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a
wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive
kernel for spatial small-K shapes.

## Technical Details

- New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`):
workgroup split into math waves (LDS read + MFMA) and load waves (DRAM
read + LDS write).
- VGPR role-split: `operator()` has two top-level mutually-exclusive
`is_math` branches so the allocator overlays both roles onto the same
physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy.
Correctness depends on identical `block_sync_lds` counts on both arms
plus a matching load-wave barrier stub in the epilogue
(`cshuffle_epilogue.hpp`).
- Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`):
`kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub.

Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36
C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27%
(googlenet K=320 by 16-23%); the role-split roughly halves the parity
gap vs old CK on the 13x13 fp16 shape.

## Test Plan

- `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16,
`split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a
broad 2D sweep.
- Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64}
(barrier-parity / deadlock check).
- `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet
instances.

## Test Result

- All wavelet instances CPU-verify correct across the split-K sweep; no
hangs (dual-arm barrier sequence matches).
- Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best
non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better
vs old CK on the majority of spatial shapes.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-02 08:51:17 +00:00
Ville Pietilä
78d657c4f7 [rocm-libraries] ROCm/rocm-libraries#7284 (commit e7d25b2)
[CK_TILE] Integrate CK Tile Dispatcher code generation into
 CK Tile Profiler (#7284)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

CK Tile is going to be delivered to hipDNN via CK Dispatcher. Currently
the CK Tile Profiler using CK Builder for generating the profiled
instances from the configuration files that identify the instances that
old CK exposes. We need to replace this instance generation with the CK
Tile Dispatcher codegen.

## Technical Details
The old CK Profiler config files are converted to JSON files that the CK
Tile Dispatcher can digest. The conversion script for configurations is
stored to source control in case we need to update the JSON
configurations later. The dispatcher generates instance libraries per
conv direction (fwd, bwd data, and bwd weight) that are linked to the CK
Profiler executable. I also implemented codegne for the stream-K and
depthwise conv instances. The proposed solution replaces the CK Builder
codegen with the CK Tile Dispatcher codegen.

There are two new methods that are exposed via the dispatcher backend

- `is_supported` - required to enabled the profiler workflow where we
check the applicability of the kernel instance before running it.
- `get_instance_string` - this mainly for verification. This provide the
CK Builder instance string for verifying that the old CK Builder based
profiler and the new CK Tile Dispatcher based profiler have the same
instances.

The rules that limit the generated instances are now collected to a
single location under the dispacther. The CK Builder codegen uses these,
which ensures that the two codegen pipelines are in sync. The next step
(different PR) is to remove the CK Builder codegen pipeline altogether.

## Test Plan

Verified that the old CK Builder based profiler and the new CK Tile
Dispatcher based profiler have the same instances, that is, the
Dispatcher based codgen can generate the same instances as the old CK
Builder.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-28 21:03:37 +00:00
jakpiase
309d823056 [rocm-libraries] ROCm/rocm-libraries#7466 (commit cc2861f)
[CK Tile] Enable hardware OOB buffer load offset trick by default (#7466)

## Summary

Enables `CK_TILE_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK`
inside `config.hpp`.

  ### Background

When loading from global memory with out-of-bound (OOB) check, CK Tile
must suppress invalid lanes. The previous default used a software
branch:

  ```cpp
  // Old path (oob_conditional_check, no trick)
  if(!src_thread_element_valid) { return zeros; }
      return amd_buffer_load_impl(...);
```

  This generates divergent control flow, the compiler emits exec-mask save/restore and per-lane comparison SALU instructions one set per buffer load that touches a padded dimension.

  ### Change

  With the trick enabled, invalid lanes are suppressed entirely in hardware:
```cpp
  // New path (trick enabled)
uint32_t shift = src_thread_element_valid ? 0 : 0x80000000;
return amd_buffer_load_impl(resource, shift + offset, 0);
```
  The `0x80000000` offset overflows the buffer descriptor's declared size, causing the hardware to silently return zero for that lane - no branch, no exec mask manipulation. This matches the behavior of old CK XDL kernels, which use an unconditional load followed by a `v_cndmask` select.

  ### Expected impact

  Eliminates ALU overhead from OOB validity branches which reduces the kernel execution time, especially for memory-bound cases.

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2026-05-21 12:06:40 +02:00
jakpiase
9cf49cd322 [rocm-libraries] ROCm/rocm-libraries#7465 (commit 81f1cf0)
[CK TILE] Increase default kPerXdl for grouped convolution instances  (#7465)

## Summary
Increases the default `kPerXdl` used in CK Tile grouped convolution
instance generation for forward, backward-data, and backward-weight
operations.
### Changes in `generate_instances.py`
- **Larger default `kPerXdl` for all fp16/bf16 tile sizes**:
`get_k_mfma()` now
returns `32` for `m/nPerXdl = 16` and `16` for `m/nPerXdl = 32`.
- **Cap `kPerXdl` to `kPerBlock`**: All three parsers
(`parse_fwd_instances`, `parse_bwd_weight_instances`,
`parse_bwd_data_instances`) now clamp the computed value with `min(...,
k_per_block)` to prevent generating invalid instances where `kPerXdl >
kPerBlock`.
### Expected impact
Higher `kPerXdl` increases the number of MFMA instructions issued per
warp per inner-loop iteration, improving arithmetic intensity and
reducing pipeline stall overhead for memory-bound shapes.
2026-05-21 12:05:09 +02:00
JH-Leon-KIM-AMD
720ceb6500 [rocm-libraries] ROCm/rocm-libraries#7528 (commit b4cae6f)
[CK Tile] Support multi-vector reads in static encoding patterns  (#7528)

## Motivation

The thread-raked / warp-raked / block-raked static tile distribution
patterns in `ck_tile` silently produce wrong results when the contiguous
tile dimension is larger than `warp_size * vector_size`, because the
encoding has no per-thread iteration dimension along X.

Concretely, with `M_Tile=N_Tile=128`, `VectorSize{A,B,C}=1` in
`ConvConfigComputeV3`, the grouped convolution backward-weight example
reports about 50 percent wrong values, with errors starting exactly at
the `X0*X1 = 64` boundary. The second pass over the contiguous dim is
never performed.

This PR extends the encoding so multi-vector reads in the contiguous
tile dimension are supported, while keeping every existing call site
bit-for-bit identical.

## Technical Details

Three files changed.

### 1. `include/ck_tile/core/algorithm/static_encoding_pattern.hpp`

Add a per-thread X iteration dimension in all three raked
specializations:

- `X0 = min(warp_size, XPerTile / X1)` — threads in X dim
- `X1 = min(LargestVec, VecSize)` — vector size per access
- `X2 = XPerTile / (X0 * X1)` — number of X-iters per thread (new)

`X2` is gated with `if constexpr (X2 == 1) { old } else { new }` in both
`make_2d_static_tile_distribution()` and
`make_shuffled_2d_static_tile_distribution()`.

The new encoding places `X2` in the middle of the Ys iteration list,
which preserves reverse symmetry between the regular `<..., X2, X1>` and
shuffled `<X1, X2, ...>` encodings.

Patterns updated: `thread_raked`, `warp_raked`, `block_raked`.

### 2. `include/ck_tile/core/tensor/transpose_tile.hpp`

Added a parallel `else if constexpr (... && NDimY == 3 && ...)` branch
alongside the existing `NDimY == 2` branch. The original branch is
byte-for-byte unchanged.

Both branches dispatch to the same `transpose_tile2d_impl_in_thread`,
whose body has always been NDimY-generic (iterates with `static_for<0,
NDimY, 1>` and `number<NDimY>{}`).

### 3.
`experimental/grouped_convolution_tile_instances/generate_instances.py`

Removed the two now-obsolete skip guards in `parse_bwd_weight_instances`
and `parse_bwd_data_instances`:

```python
if m_per_block > (warp_size * a_scalar_per_vector) or n_per_block > (warp_size * b_scalar_per_vector):
    print(f"Skipping instance {instance_id} with multiple warps per continous tile dim since it's not supported yet.")
    continue
```

Other unrelated skips (V5 / V6 / ASYNC_V4 pipeline gating,
irregular-load shapes, scalar-per-vector > tile size) are kept
untouched.

### Compatibility

Strict. Every existing caller has `X2 == 1` and therefore hits the
original encoding path verbatim. No upstream config or pipeline behavior
changes.

## Test Plan

The grouped convolution example is the natural exerciser since
`GroupedConvUniversalPipelineAgBgCrPolicy` selects `thread_raked` for
both A and B tiles, and all three conv directions share the same
`ConvConfigComputeV3`.

For each test below we ran:

```
./build/bin/tile_example_grouped_conv_bwd_weight [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_fwd        [-prec={fp16,bf16}]
./build/bin/tile_example_grouped_conv_bwd_data   [-prec={fp16,bf16}]
```

with `ConvConfigComputeV3` tile/vector parameters tweaked to cover both
code paths:

| Test | M / N / K | VecA/B/C | A path | B path | dtype |

|------|-------------|----------|------------|----------------|-------------|
| T1 | 16/64/32 | 4/8/4 | old (X2=1) | old (X2=1) | fp16 |
| T2 | 128/128/64 | 2/2/2 | old (X2=1) | old (X2=1) | fp16 |
| T3 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 |
| T5 | 256/256/64 | 1/1/1 | old (X2=1) | new (X2=4) | fp16 (3 dir)|
| T4b | 128/128/128 | 1/1/1 | new (X2=2) | new (X2=2) | fp16 + bf16 (3
dir) |

A larger T4a (256/256/128) was attempted to stress both A and B with
X2>1 on bigger tiles but was blocked by the gfx942 hardware LDS cap (128
KB > 64 KB limit), independent of this PR.

For the generator change we ran:

```
python3 generate_instances.py --mode profiler --direction all
```

and verified `Skipping instance ... with multiple warps per continous
tile dim` no longer appears (count went from non-zero to 0); other skip
categories are unchanged.

`clang-format-18` was applied to both modified `.hpp` files (matches the
repo's `.clang-format`).

## Test Result

- T1 and T2 (compat-strict, every X2 is 1, old code path): `correct`.
Confirms existing callers are unaffected.
- T3 (X2=4 on B only): `correct`. First true exercise of the new NDimY=3
encoding + transpose branch.
- T5 (T3 across `fwd` + `bwd_data` + `bwd_weight`, fp16): all 3
`correct`.
- T4b (X2>1 on both A and B, fp16 + bf16, all 3 directions): all 6 runs
`correct`.
- Generator: 0 `multiple warps per continous tile dim` skips remaining;
other skips unchanged.

Sample run output (T4b, bf16, bwd_data):

```
shape: tile_gemm_shape_128x128x128x4_1x4x1_16x16x32
pipeline: pipeline_AgBgCrCompV3_128x128x128_256_1x1x1_1x4_1x1x1_..._DoubleSmemBuffer_0
Vector size A: 1, Vector size B: 1, Vector size C: 1
0.934907 ms, 8.34683 TFlops, 34.3178 GB/s
Relative error threshold: 0.00390625 Absolute error threshold: 0.25
The CPU verification result is: correct
```

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-20 17:25:22 +03:00
Bartłomiej Kocot
cc5c79a1e7 [rocm-libraries] ROCm/rocm-libraries#5904 (commit f4e261a)
[CK][CK Tile]  Grouped Conv Backward Weight Streamk instances (#5904)

## Motivation

Add streamk instance to grouped convolution backward weight profiler.

## Technical Details

- New instances for grouped conv backward weight with streamk

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Graner, Johannes <johannes.graner@amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-05-16 10:49:18 +02:00
Bartłomiej Kocot
067e5e0ca4 [rocm-libraries] ROCm/rocm-libraries#6838 (commit ff7a665)
[CK_TILE] Add depthwise conv2d forward kernel (FP16/FP32) (#6838)

## Motivation

CK currently has no kernel optimized for depthwise convolution
(G=C_in=C_out, C=K=1 per group) and existing generic paths perform
poorly for this workload. This PR adds a dedicated depthwise conv
forward kernel in CK Tile.

## Technical Details

Adds a dedicated depthwise conv2d forward op to CK Tile that performs
direct convolution rather than falling back to the generic GEMM path.
The kernel is templatized by filter size, stride, and data type, and
compiled into ~60 instances covering common configurations (kernel
3/5/7/9, stride 1/2, FP16/FP32). Supports both CDNA (gfx942/gfx950) and
RDNA (gfx1100/gfx1200) architectures.

## Test Plan

- [x] Correctness and performance validated on gfx942, gfx950, and
gfx1100, with ckProfiler `grouped_conv_fwd` as baseline.
- [ ] MI300A (gfx942) and gfx1200 validation.

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1137

---------

Co-authored-by: GenDu <Gen.Du@amd.com>
2026-05-15 15:47:55 +02:00
Bartłomiej Kocot
9348b8eb82 [rocm-libraries] ROCm/rocm-libraries#5646 (commit 05680a4)
[CK_TILE] Add conv bwd data tests (#5646)

## Motivation
This PR adds tests for CK Tile's convolution backward data operation to
enable functionality regression tracking and error-detection.

## Technical Details
Currently only NHWGC/GKCYX/NHWGK and NDHWGC/GKCZYX/NDHWGK(2 dim and 3
dim channel-last) layouts are being tested, since only they are
implemented in CK Tile. Current tests support FP16, BF16 and FP32
datatypes and various different convolutions scenarios. The tested
instances are listed in
`experimental/grouped_convolution_tile_instances` directory.

## Test Result
All implemented tests are working properly and passing.

---------

Co-authored-by: Ville Pietilä <>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
2026-04-21 21:49:19 +00:00
Ville Pietilä
9e28c5ffea [rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda)
[CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516)

## Motivation

We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things

- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.

## Technical Details

Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.

---------

Co-authored-by: Ville Pietilä <>
2026-03-25 14:34:13 +00:00
Bartłomiej Kocot
8acb392dcd [rocm-libraries] ROCm/rocm-libraries#5609 (commit 95afb2c)
[CK][CK Tile] Move grouped conv cpp instances to build dir (#5609)

## Motivation

Move grouped conv .cpp instances to build dir. Fix generate instances
script.

## Technical Details

Avoid CI problem when instances in experimental directory are not
removed

## Test Plan

test_grouped_convnd_*_tile

## Test Result

Pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-20 13:19:31 +00:00
Bartłomiej Kocot
e92c207845 [rocm-libraries] ROCm/rocm-libraries#5470 (commit fe3405d)
[CK][CK Tile] Fix dram step for KM/KN layouts in V1 pipeline (#5470)

## Motivation

Fix v1 pipeline for KM/KN layouts by passing correct step for dram tile
window.

## Technical Details

- Fix dram step for KM/KN layouts in V1 pipeline
- Disable instances which use more threads than warp size in continous
dim (not supported in ck tile yet)
- Use 1x1 specialization for explicit gemm
- Use two stage for vectorsize =1 and sizeof(datatype) ==2
- remove not needed check sinze GetVectorSizeA/B check if vector size is
fixed

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-966
2026-03-19 11:59:44 +00:00
Bartłomiej Kocot
8ebabd19d2 [rocm-libraries] ROCm/rocm-libraries#5387 (commit 0c259bd)
[CK][CK Tile] Grouped Convolution Backward Weight set of fixes (#5387)

## Motivation

Grouped Convolution Backward Weight split k fixes for CK tile kernels

## Technical Details

- get k batch from kargs to get deduced k batch
- multiply zeroing size by data type size
- disable v6 (producing a incorrect results)

## Test Plan

test_grouped_convnd_bwd_weight_tile

## Test Result

Pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Ville Pietilä <>
2026-03-13 10:18:19 -06:00
Ville Pietilä
3d298d13e4 [rocm-libraries] ROCm/rocm-libraries#5237 (commit ef10dc6)
[CK_TILE, CK_BUILDER] Add two-stage bwd weight kernels to CK Tile profiler (#5237)

## Motivation

PR #4797 added CK Tile bwd weight kernels to the CK Profiler. The
two-stage kernels were not supported in the initial PR. This PR adds the
the missing bwd weight two-stage kernels to the CK Profiler.

## Technical Details

Extended the CK Tile conv builder factory to build also the elementwise
ops required for the two-stage kernels. Extended the CK Builder for CK
Tile instance to accept the two-stage flag as part of the algorithm
configuration.

## Test Plan

Added units tests for CK Builder that verify the two-stage kernel
construction.

## Test Result

If CI passes, the added unit tests are passing.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Ville Pietilä <>
2026-03-12 19:20:15 -06:00
Bartłomiej Kocot
3741885b52 [rocm-libraries] ROCm/rocm-libraries#5114 (commit 59b8cb5)
[CK][CK Tile] Improvements for grouped conv fwd tile profiling (#5114)

## Motivation

Improve profiling for grouped convolution forward for better comparison
between CK and CK Tile
## Technical Details

- Include preprocessing time for ck tile
- Add flush cache for conv fwd profiler
- Switch configs to builder reflect
- Add KPerXdl deduce
- Add non-grouped ported instances

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

pass

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-786
2026-03-11 23:38:15 +01:00
Bartłomiej Kocot
e262252c4c [rocm-libraries] ROCm/rocm-libraries#5115 (commit a21861e)
[CK][CK Tile] Add grouped conv backward weight tile test and fix tr load in BASE_V1 pipeline (#5115)

## Motivation

Test grouped conv backward weight from ck tile and fix incorrect values.

## Technical Details

- Add test for CI
- Add daily tests
- Fix transpose load in BASE_V1 pipeline

## Test Plan

test_grouped_convnd_backward_weight_tile

## Test Result

in progress

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-783
2026-03-10 03:03:04 +00:00
Ville Pietilä
68b0f420ba [rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400)
[CK_TILE] Add CK Tile bwd weight profiler (#4797)

## Motivation

To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.

## Technical Details

- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.

---------

Co-authored-by: Ville Pietilä <>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2026-03-04 21:49:42 +00:00
Bartłomiej Kocot
d244aaa1c0 [rocm-libraries] ROCm/rocm-libraries#4791 (commit 6cc17c6)
[CK][CK TILE] Improve oob check (#4791)

## Motivation

Improve OOB checks. Remove permutes which have been generated by thread
buffer zero clear. at now in assembly there is only condmask instead of
permute + condmask.

Change number of KPack for generated instances

## Technical Details

Remove permute instructions from assembly

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: jakpiase <jakpia21@gmail.com>
2026-02-24 22:40:48 +01:00
Bartłomiej Kocot
2dd2f114b3 [rocm-libraries] ROCm/rocm-libraries#4407 (commit adde219)
[CK][CK TILE] Add has hot loop check for pipeline v1

## Motivation

Add has hot loop check for pipeline v1 (v1 basic and v1 basic async).
Enable more tests which have been fixed by this change.

## Technical Details

Hot loop has been executed without num loop check.

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

Passed

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-651
AICK-663
2026-02-11 13:43:01 +00:00
Bartłomiej Kocot
6d6ee8f023 [rocm-libraries] ROCm/rocm-libraries#4457 (commit 258a459)
[CK][CK Tile] Temporary disable grouped conv fwd tile comp
 async instances  (#4457)

## Motivation

[CK][CK Tile] Temporary disable grouped conv fwd tile comp async
instances due to the failures

## Technical Details

disable configs to not comple these instances

## Test Plan

test_grouped_convnd_fwd_Tile

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-02-11 01:52:59 +00:00
Bartłomiej Kocot
27e0a34e0f [rocm-libraries] ROCm/rocm-libraries#4406 (commit 61f9f90)
[CK] CK Tile grouped convolution direct load

## Motivation

CK Tile grouped convolution forward direct load support.

## Technical Details

Basic pipeline for direct load and new instances for forward for v1 and
v4 pipelines.

## Test Plan

test_grouped_convnd_fwd_tile

## Test Result

CI pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-130
2026-02-09 21:09:42 +00:00
jakpiase
731afe535a [rocm-libraries] ROCm/rocm-libraries#4357 (commit ff3e982)
[CK_TILE] Add support and tests for V6 pipeline in conv fwd
 (#4357)

Added support for conv v6 pipeline in ck tile's convolution forward
kernel. CK Tile v6 pipeline is the equivalent to old ck's V5 pipeline
and should be faster than other pipelines for some cases. This PR also
adds tests inside profiler that's currently inside experimental
directory, so now we should be able to detect regressions easier.
2026-02-08 19:57:53 +00:00
Bartłomiej Kocot
1ae83137eb Enable Grouped Conv Tile Fwd Tests daily (#3680) 2026-01-31 15:55:25 -07:00
Bartłomiej Kocot
3d67e6c492 [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624)
* [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err

* Update test_grouped_convnd_fwd_tile.cpp

* Update test_grouped_convnd_fwd_tile.cpp

* Update conv_tuning_params.hpp

* clang format fix

* Update CMakeLists.txt
2026-01-27 11:04:11 +02:00
Robin Voetter
cc75948d1c [CK_BUILDER] conv bwd weight testing (#3618)
* ck-builder: restructure testing conv

In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.

* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp

This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.

* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3

Turns out that the supplied layout isn't actually supported...

* ck-builder: ck and reference conv integration for bwd weight

* ck-builder: ck bwd weight execution test

* ck-builder: ckt::run support for ck-tile bwd weight

* ck-builder: ck tile bwd weight execution test

* ck-builder: extra debug printing in MatchesReference

* ck-builder: make ckt::run return RunResult

This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.

* ck-builder: RunResult matcher

Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.

* ck-builder: doc fixes

* ck-builder: add missing headers
2026-01-26 23:50:15 +01:00
Bartłomiej Kocot
0727e85e52 [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes
2026-01-19 22:29:01 -07:00