Files
composable_kernel/experimental/builder/include/ck_tile/builder
JH-Leon-KIM-AMD 89e943a9f3 [CK_BUILDER] Add GPU Reference Algorithm to CK Builder (#3381)
* [CK_BUILDER] Integrate GPU reference as ConvAlgorithm

Add GPU reference as a ConvAlgorithm specialization, enabling:
- Unified Builder API for reference and optimized kernels
- Future ckProfiler integration for validation
- First step toward numerical validation in Builder tests

Changes:
- Add ConvAlgorithmSpecialization::REFERENCE enum
- Add ConvAlgorithm_Reference struct
- Add IsReferenceAlgorithm concept
- Create 3 reference factories (Forward, BwdData, BwdWeight)
- Wire into conv_dispatcher
- Add proof-of-concept test (passing)

Test result: Can instantiate reference through Builder API

* Add GPU reference execution tests

- Reference kernel executes through Builder (459ms)
- Both reference and optimized can instantiate
- Tests passing

Next: Implement utilities for comparison

* Optimized Builder kernel execution works

- MakeArgument pattern implemented
- Builder-generated kernel executes successfully
- Tests passing (451ms execution)

Next: Add comparison

* VALIDATION COMPLETE: Builder == Reference

Builder-generated kernel output matches GPU reference!

Test: Validate_Optimized_vs_Reference_Forward_2D_FP16
Result: PASS ✓

This proves CK Builder generates correct code!

* Update to new Builder API

All tests passing

* Rename test file for clarity

test_builder_kernel_execution -> test_builder_kernel_validation

* Add all 3 directions support

- Forward, Backward Data, Backward Weight
- All reference factories working
- Dispatcher wired for all directions
- 9 tests passing

Tests:
- test_reference_execution: 3 tests (all directions)
- test_optimized_execution: 3 tests (all directions)
- test_builder_kernel_validation: 3 tests (fwd validated, bwd placeholders)

* Add backward direction support

- Backward data and weight dispatcher wiring
- Fix factories for new API
- All 3 directions tested
- 9 tests passing

* Refactor: Change IsReferenceAlgorithm from concept to consteval function

Address review feedback: Use consteval function in dispatcher instead of
concept, matching the pattern for other algorithms (Tile, XDL, WMMA, DL).

- Remove IsReferenceAlgorithm concept from conv_algorithm_concepts.hpp
- Add IsReferenceAlgorithm() consteval function to conv_dispatcher.hpp
- Update dispatcher to use function call: IsReferenceAlgorithm<T>()
- Remove redundant algorithm checks from reference factory requires clauses

All tests passing (9/9).

* Move Tile algorithm check outside direction block to support all directions

* Implement MakeInvokerPointer interface and add random input validation

- Implement full Argument/Invoker structs for old CK interface (not just nullptr)
- Refactor with reference_common.hpp to reduce code duplication
- Add random input validation tests: Builder vs direct GPU reference (all directions)
- Fix layout: GNHWC -> NHWGC to match reference kernel expectations
- All 12 tests pass with IDENTICAL results on random input

* Move ConvAlgorithm_Reference to test/impl/conv_algorithm_types.hpp

Keep types.hpp for data types only (enums), move algorithm descriptors
to conv_algorithm_types.hpp as suggested by review.

* Add static_assert to ensure reference factories only accept PassThrough operations

Reference implementation doesn't support fused elementwise operations.
Add compile-time validation to fail early with clear error message if
non-PassThrough operations are specified on input, weight, or output.

* Add InstanceTraits support for reference kernels

- Store SIGNATURE/ALGORITHM/VERSION in Instance for reflection
- Create shared ReferenceCommonTraits base for common properties
- Add 3 direction-specific InstanceTraits specializations in one file
- Include data type and layouts in instance_string output

* Remove optimized kernel validation tests from reference-only branch

* Use existing layout helper and organize reference tests

Use LayoutToCK from conv_tensor_layout.hpp and move reference InstanceTraits
test to validation folder.

* Merge develop branch

Fix DataType switch for new mixed precision types.

* Fix comment spacing for CI

* Convert IsReferenceAlgorithm from function to concept

* Add reference tests to CI smoke tests

* Consolidate 3 reference factories into single unified factory

---------

Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>

[ROCm/composable_kernel commit: a0acc83a72]
2025-12-29 16:11:08 +02:00
..

Composable Kernel Builder Design Documentation

This directory contains the builder framework for Composable Kernel, which provides a compile-time, type-safe interface for constructing convolution operations with various configurations.

Table of Contents


Convolution Signature

Overview

The convolution signature system provides a compile-time description of grouped convolution operations. A signature is a collection of properties that fully characterize a convolution kernel's mathematical and operational behavior, enabling:

  • Compile-time validation: Ensures type safety and correctness before kernel instantiation
  • Kernel selection: Matches user requirements to optimized implementations
  • Specialization: Enables optimized code paths for specific configurations
  • Composability: Supports building complex operations from simpler components

The signature leverages modern C++20 features, particularly concepts, to provide expressive, self-documenting interfaces with compile-time guarantees.

Architecture

The signature system is organized into a hierarchical structure:

┌─────────────────────────────────────────────────────────┐
│                    ConvSignature                        │
├─────────────────────────────────────────────────────────┤
│ Properties:                                             │
│   • spatial_dim: int           (1D, 2D, or 3D)          │
│   • direction: ConvDirection   (Fwd/BwdData/BwdWeight)  │
│   • data_type: DataType        (default data type)      │
│   • accumulation_data_type: DataType                    │
│   • input: ConvTensor          ──┐                      │
│   • weight: ConvTensor         ──│                      │
│   • output: ConvTensor         ──│                      │
└──────────────────────────────────┼──────────────────────┘
                                   │
                                   ▼
              ┌─────────────────────────────────────────┐
              │           ConvTensor                    │
              ├─────────────────────────────────────────┤
              │ ╔═════════════════════════════════════╗ │
              │ ║ TensorConfig (required)             ║ │
              │ ╠═════════════════════════════════════╣ │
              │ ║  • layout: ConvLayout               ║ │
              │ ║  • data_type: DataType (optional)   ║ │
              │ ║  • compute_type: DataType (optional)║ │
              │ ╚═════════════════════════════════════╝ │
              │                                         │
              │ ┌─────────────────────────────────────┐ │
              │ │ TensorOperation (optional)          │ │
              │ ├─────────────────────────────────────┤ │
              │ │  • elementwise_operation            │ │
              │ │  • auxiliary_operand_configs[]      │ │
              │ │    (each is also ConvTensor)  ◄───────┼─┐
              │ └─────────────────────────────────────┘ │ │
              └─────────────────────────────────────────┘ │
                                                          │
                                 Recursive ───────────────┘

Key Design Points:

  • ConvSignature contains three ConvTensor instances (input, weight, output)
  • All tensors share the same ConvTensor structure
  • Each ConvTensor has:
    • TensorConfig (required): Defines layout as well as optional data and compute type overrides
    • TensorOperation (optional): Defines fused elementwise operations
  • Auxiliary operands (e.g., bias) in TensorOperation also use the ConvTensor type

Core Components

1. Signature Level

The top-level signature contains global properties that apply to the entire convolution operation:

template <typename T>
concept ConvSignatureDescriptor = requires(T t) {
    { t.spatial_dim } -> std::convertible_to<unsigned int>;  // 1, 2, or 3
    { t.data_type } -> std::convertible_to<DataType>;        // Default data type
    { t.input } -> ConvTensorDescriptor;
    { t.weight } -> ConvTensorDescriptor;
    { t.output } -> ConvTensorDescriptor;
    requires ConvolutionDirectionWellDefinedIfProvided<T>;   // Optional direction
};

Properties:

  • spatial_dim: Dimensionality of the convolution (1D, 2D, or 3D)
  • direction: Operation type (optional, defaults to FORWARD)
    • FORWARD: Standard forward convolution
    • BACKWARD_DATA: Gradient computation w.r.t. input
    • BACKWARD_WEIGHT: Gradient computation w.r.t. weights
  • data_type: Default data type for all tensors (FP32, FP16, BF16, FP8, I8, U8)
  • accumulation_data_type: Type used for internal accumulation

2. Tensor Level

Each tensor (input, weight, output) has its own descriptor:

template <typename T>
concept ConvTensorDescriptor = requires(T t) {
    { t.config } -> TensorConfigDescriptor;
    requires ElementwiseOpWellDefinedIfProvided<T>;
};

A tensor descriptor encapsulates:

  • Configuration: Layout and data type information
  • Operation (optional): Fused elementwise operations on this tensor

3. Tensor Configuration

Describes the memory layout and data types:

template <typename T>
concept TensorConfigDescriptor = requires(T t) {
    { t.layout } -> std::convertible_to<ConvLayout>;
    { t.data_type } -> std::convertible_to<DataType>;  // Optional override
};

Layout Types (dimension-specific):

  • 1D Convolution:

    • Input: GNCW, GNWC, NWGC, NGCW, G_NW_C_strided
    • Weight: GKXC, GKCX, KXGC, G_K_X_C_strided
    • Output: GNKW, GNWK, NWGK, NGKW, G_NW_K_strided
  • 2D Convolution:

    • Input: GNCHW, GNHWC, NHWGC, NGCHW, G_NHW_C_strided
    • Weight: GKYXC, GKCYX, KYXGC, G_K_YX_C_strided
    • Output: GNKHW, GNHWK, NHWGK, NGKHW, G_NHW_K_strided
  • 3D Convolution:

    • Input: GNCDHW, GNDHWC, NDHWGC, NGCDHW, G_NDHW_C_strided
    • Weight: GKZYXC, GKCZYX, KZYXGC, G_K_ZYX_C_strided
    • Output: GNKDHW, GNDHWK, NDHWGK, NGKDHW, G_NDHW_K_strided

Where:

  • G = Groups
  • N = Batch size
  • C = Input channels
  • K = Output channels (filters)
  • W, H, D = Width, Height, Depth (spatial dimensions)
  • X, Y, Z = Filter dimensions

4. Tensor Operations

Describes fused elementwise operations applied to a tensor:

template <typename T>
concept TensorOperatorDescriptor = requires(T t) {
    { t.elementwise_operation } -> std::convertible_to<ElementwiseOperation>;
    requires AuxiliaryOperandConfigsWellDefinedIfProvided<T>;
};

Supported Operations:

  • PASS_THROUGH: No operation (identity)
  • SCALE: Multiply by a scalar
  • CLAMP: Clamp values to a range
  • BIAS_BNORM_CLAMP: Bias addition + batch normalization + clamp
  • SCALEADD_SCALEADD_RELU: Fused scale-add operations + ReLU activation

Auxiliary Operands: Some operations require additional tensor inputs (e.g., bias tensors, scaling factors). These are specified through auxiliary_operand_configs, which is an array of TensorConfigDescriptor objects describing the layout and data type of each auxiliary input.

Concepts and Validation

The signature system uses C++20 concepts for compile-time validation at multiple levels:

Constraint Concepts

// Spatial dimension must be 1, 2, or 3
template <auto N>
concept ConvSpatialDim = std::is_integral_v<decltype(N)> && (N == 1 || N == 2 || N == 3);

// Valid data types for convolution
template <DataType T>
concept ValidConvDataType = 
    (T == DataType::FP32) || (T == DataType::FP16) || (T == DataType::BF16) ||
    (T == DataType::FP8) || (T == DataType::I8) || (T == DataType::U8);

Validation Concept

// Validates a complete signature
template <auto Sig>
concept ValidConvSignature = requires {
    requires ConvSpatialDim<Sig.spatial_dim>;
    requires ValidConvDataType<Sig.data_type>;
};

Tensor Descriptors

The layout/data type/elementwise operation are described per tensor. This multi-level hierarchy allows:

  • Flexibility: Each tensor can have independent layout and data type
  • Reusability: Common configurations can be shared across different signatures
  • Extensibility: New properties can be added to specific levels without affecting others
  • Clarity: Separates concerns (global properties vs. tensor-specific properties)

Optional Signature Fields

Several fields in the signature are optional:

  • direction: Defaults to FORWARD if not specified, reducing boilerplate for the common case
  • Tensor data_type: Falls back to signature's default, allowing mixed-precision with minimal specification
  • Tensor operation: Defaults to PASS_THROUGH, supporting both fused and non-fused operations with the same interface

This design follows the principle of "make the common case simple, the complex case possible."

Convolution Algorithm

Convolution Factory

Convolution factory builds the instance based on the convolution signature and convolution algorithm. The signature and the algorithm descriptions are dispatched to the relevant algorithm specific factory for instance creation. The convolution factory design is described in a separate Readme.