Files
composable_kernel/experimental/builder/include/ck_tile/builder/factory
Johannes Graner b7c8fb164f [rocm-libraries] ROCm/rocm-libraries#7937 (commit abe276d)
[CK Tile] Add conv Wavelet GEMM pipeline and bwd_weight
 instances (#7937)

## Motivation

CK Tile had no pipeline competitive with old CK's wavelet on the
RetinaNet K=36 C=256 3x3 conv bwd_weight class. This adds a
wave-specialized "wavelet" GEMM pipeline so CK Tile has a competitive
kernel for spatial small-K shapes.

## Technical Details

- New wavelet GEMM pipeline (`gemm_pipeline_ag_bg_cr_wavelet.hpp`):
workgroup split into math waves (LDS read + MFMA) and load waves (DRAM
read + LDS write).
- VGPR role-split: `operator()` has two top-level mutually-exclusive
`is_math` branches so the allocator overlays both roles onto the same
physical VGPRs, cutting arch VGPR ~33-40% and raising occupancy.
Correctness depends on identical `block_sync_lds` counts on both arms
plus a matching load-wave barrier stub in the epilogue
(`cshuffle_epilogue.hpp`).
- Kernel dispatch (`grouped_convolution_backward_weight_kernel.hpp`):
`kIsWavelet` path, `LaunchBlockSize`, load-wave barrier stub.

Uplift: wavelet is the fastest CK Tile pipeline on the RetinaNet K=36
C=256 3x3 family, beating the best non-wavelet CK Tile kernel by 10-27%
(googlenet K=320 by 16-23%); the role-split roughly halves the parity
gap vs old CK on the 13x13 fp16 shape.

## Test Plan

- `ckProfiler grouped_conv_bwd_weight`, NHWGC layout, fp16/bf16,
`split_k=all`, CPU verify on RetinaNet K=36 shapes (7x7, 13x13) and a
broad 2D sweep.
- Correctness: `-v=1` across `split_k` in {-1,1,2,4,8,16,32,64}
(barrier-parity / deadlock check).
- `test_grouped_convnd_bwd_weight` over the tests `.conf` wavelet
instances.

## Test Result

- All wavelet instances CPU-verify correct across the split-K sweep; no
hangs (dual-arm barrier sequence matches).
- Wavelet wins the RetinaNet K=36 C=256 3x3 family (10-27% over best
non-wavelet CK Tile) and googlenet K=320 (16-23%); at parity-or-better
vs old CK on the majority of spatial shapes.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-06-02 08:51:17 +00:00
..

Convolution Builder Factory Directory

This directory implements compile-time dispatch from high-level signature and algorithm descriptors to our existing specialized convolution kernel implementations.

See the main builder documentation for an overview.

Design Overview

The factory system operates in two phases:

  1. Algorithm Classification: Predicate concepts in conv_dispatcher.hpp inspect the algorithm descriptor to determine which kernel variant it satisfies. The predicates are evaluated in a specific order using if constexpr:

    • Cross-direction (checked first, supports all convolution directions):

      • ReferenceAlgorithm — simple reference implementation for validation
      • TileAlgorithm — CK Tile backend, dispatches via ConvTileFactory
    • Forward direction (old CK):

      • FwdXdlV3Algorithm — newer XDL pipeline using block GEMM structure
      • FwdXdlAlgorithm — standard XDL using AMD XDLops instructions
      • FwdWmmaAlgorithm — WMMA variant for gfx11/gfx12 hardware
      • FwdDlAlgorithm — vectorized dot-product kernel (non-XDLops)
      • LargeTensorAlgorithm — XDL with extended tensor support
    • Backward weight direction (old CK):

      • BwdXdlAlgorithm, BwdXdlV3Algorithm, BwdTwoStageXdlAlgorithm, BwdDlAlgorithm, BwdMultiDXdlAlgorithm, BwdWmmaV3Algorithm, BwdTwoStageWmmaV3Algorithm, BwdWmmaAlgorithm, BwdMultiDWmmaV3Algorithm
    • Backward data direction: Currently supports only Reference and Tile algorithms. Optimized old CK kernels are not yet implemented.

  2. Factory Instantiation: Each factory transforms builder descriptors into backend-specific template parameters and instantiates the corresponding kernel.

Key Files

  • conv_dispatcher.hpp: Entry point with make_conv_instance() function. Contains dispatch logic and algorithm classification predicates. Start here to understand the overall flow.

  • Forward factories (old CK): conv_fwd_v3_factory.hpp, conv_fwd_xdl_factory.hpp, conv_fwd_wmma_factory.hpp, conv_fwd_dl_factory.hpp, conv_fwd_large_tensor_factory.hpp

  • Backward weight factories (old CK): conv_bwd_weight_xdl_factory.hpp, conv_bwd_weight_xdl_v3_factory.hpp, conv_bwd_weight_two_stage_xdl_factory.hpp, conv_bwd_weight_dl_factory.hpp, conv_bwd_weight_multi_d_xdl_factory.hpp, conv_bwd_weight_wmma_v3_factory.hpp, conv_bwd_weight_two_stage_wmma_v3_factory.hpp, conv_bwd_weight_wmma_factory.hpp, conv_bwd_weight_multi_d_wmma_v3_factory.hpp

  • Cross-direction factories: reference_factory.hpp (reference implementation), conv_tile_factory.hpp (CK Tile backend)

  • helpers/: Transformation utilities that map builder types to backend-specific parameters. Organized into helpers/ck/ (old CK mappings) and helpers/ck_tile/ (CK Tile mappings).

Usage

#include "ck_tile/builder/factory/conv_dispatcher.hpp"

// Uses latest version by default (currently "0.1.0")
auto kernel = make_conv_instance<SIGNATURE, ALGORITHM>();

// Or pin to a specific version
auto kernel_v0 = make_conv_instance<SIGNATURE, ALGORITHM, "0.0.0">();

The dispatcher automatically selects the appropriate factory at compile time.

Factory Architecture and the Unification Gap

Each factory is a self-contained facade: it accepts builder descriptors and produces a kernel instance, but it does so with its own algorithm descriptor shape and its own parameter mapping logic. The 16+ factories share no common infrastructure for parameter transformation.

Old CK factories (e.g., ConvFwdXdlV3Factory) flatten all algorithm parameters into a single device operation template instantiation with approximately 49 template arguments. The factory's primary job is mapping builder enum values (layouts, data types, elementwise ops) to CK's internal types. Within old CK, the XDL and WMMA factories duplicate much of this mapping logic despite sharing the same underlying parameter concepts.

The CK Tile factory (ConvTileFactory) composes modern objects — a traits type, a tile partitioner, a GEMM pipeline, and an epilogue pipeline — each with its own configuration. This results in approximately 31 parameters distributed across four composed types rather than one flat template.

Both factory paths produce a kernel Instance type that satisfies the same usage interface (construction, argument setup, invocation). The dispatcher abstracts this difference from the caller. However, the algorithm descriptor accepted by each factory is different — the unification burden currently falls on the caller (MIOpen), not the dispatcher. Collapsing these per-variant descriptors into a single algorithm format that the dispatcher decomposes internally is the key step toward making the builder a true unified facade.