Files
composable_kernel/experimental/builder/include/ck_tile/builder/testing/README.md
Robin Voetter 42048bdb7d [CK_BUILDER] Integrate CKB validation with CK verification (#3649)
* ck-builder: tensor copy function

This function copies one tensor to another, so that the memory
layout can be changed between them.

* ck-builder: fix ck::bhalf literals

These types don't work properly.

* ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it

This reduces the amount of duplicated code a bit.

* ck-builder: add flat tensor iterator

This "iterator" type pretends to be a pointer, useful for passing
tensors to functions expecting pointer-like types.

* ck-builder: integrate validation with ck gpu verification

By templating the gpu_verify function over iterators, we can use
the new FlatTensorIterator to adapt the function to multi-
dimensional tensors without changing either implementation
too much.

* ck-builder: add check_by_accumulations

This changes the gpu_verification.hpp code to also accept "iterator"
types for the relevant gpu_verify and gpu_reduce_max functions.

* ck: fix test_gpu_verification GenerateRandomData for bhalf

is_integer_it<bhalf_t> yields true, but it is not actually
an integer.

* ck: make gpu_verification kernels be proper persistent kernels

Previously these were using a hardcoded value for the grid size. This
commit changes that so that the grid size is automatically derived
from the kernel's occupancy and the number of multiprocessors on
the GPU.

* ck: clean up gpu_verification.hpp using block_reduce

This implements a small generic block reduce function, and rewrites
the rest of gpu_verification.hpp using that function to clean it up
a bit.

* ck-builder: doc typos

* ck-builder: update testing readme with validation interface.

* ck-builder: rebase fixes + review comments

* ck-builder: fix device integer generation with float types

Passing bfloat here causes a nans due to type_convert performing
a bitcast.

* ck: another bhalf_t bug

CK expects that int-generation with ck::bhalf_t yields bhalf integers,
not unsigned integers. This makes the logic of FillUniformRandInteger
compatible with GeneratorTensor_2<InDataType>, however idiotic that
may be.
2026-01-28 17:41:02 +01:00

17 KiB

CK-Builder Testing Utilities

This directory contains testing utilities designed to simplify the process of writing unit tests for GPU kernels built with ck_tile::builder. These utilities enable a clean, expressive Given-When-Then (Given-When-Then) testing pattern that separates test setup, execution, and validation.

See the main builder documentation for an overview of the CK-Builder API components.

Overview

Testing GPU kernels typically involves significant boilerplate: allocating device memory, initializing test data, launching kernels, and validating results. The utilities in this directory abstract away these repetitive tasks, allowing you to focus on defining test cases and verifying correctness.

The core components are:

  • Args: A struct template that holds runtime parameters for a specific test case.
  • Input and Output: Helper classes that groups operation inputs and outputs.
  • run(): Invokes an algorithm on the GPU.
  • validate(): A utility that performs on-GPU validation and integrates with GoogleTest/GoogleMock.

Together, these components enable a structured approach to kernel testing that mirrors the Given-When-Then pattern commonly used in behavior-driven development.

The Given-When-Then Testing Pattern

The Given-When-Then pattern organizes tests into three distinct phases:

  1. Given: Set up the preconditions and test data
  2. When: Execute the action being tested
  3. Then: Verify the expected outcome

This structure makes tests easier to read, write, and maintain. Each phase has a clear purpose, and the testing utilities are designed to support this workflow.

Given: Defining the Test Case

The "Given" phase establishes the context for your test. This includes both the compile-time characteristics of the kernel and the runtime parameters for the specific test case.

Operation Signature

The "signature" defines the mathematical contract that the kernel must satisfy. It specifies compile-time properties such as:

  • Spatial dimensionality (1D, 2D, or 3D)
  • Convolution direction (Forward, Backward Data, Backward Weight)
  • Tensor memory layout (e.g., NHWC, NCHW)
  • Data types (FP32, FP16, BF16, etc.)
  • Fused element-wise operations (e.g., Bias, ReLU)

The format of the signature struct is enforced at compile time using C++20 concepts by the CK-Builder API, ensuring type safety and enabling compile-time optimizations. The design of these concepts and the required constraints are discussed in the CK Builder design description.

// Define our custom signature struct.
struct ConvSignature {
    int spatial_dim = 2;
    ck_tile::builder::ConvDirection direction =
        ck_tile::builder::ConvDirection::FORWARD;
    ck_tile::builder::GroupConvLayout2D layout =
        ck_tile::builder::GroupConvLayout2D::NHWGC_GKYXC_NHWGK;
    ck_tile::builder::DataType data_type =
        ck_tile::builder::DataType::FP16;
    ck_tile::builder::ElementwiseOperation elementwise_operation =
        ck_tile::builder::ElementwiseOperation::PASS_THROUGH;
};

// Double-check that out structure is well-defined according to the CK-Builder API.
static_assert(ck_tile::builder::ConvSignatureDescriptor<ConvSignature>);

// Instantiate the signature with a configuration. These values are again checked
// by the CK-Builder API when a device operation is built.
constexpr auto SIGNATURE = ConvSignature{
    .spatial_dim = 2,
    .direction = ck_tile::builder::ConvDirection::FORWARD,
    .layout = ck_tile::builder::GroupConvLayout2D::NHWGC_GKYXC_NHWGK,
    .data_type = ck_tile::builder::DataType::FP16,
    .elementwise_operation = ck_tile::builder::ElementwiseOperation::PASS_THROUGH,
};

Run-time Arguments

The Args struct template provides the runtime parameters for your test case. It is parameterized by the SIGNATURE and contains fields for tensor dimensions, strides, dilations, and other dynamic properties. Note that the exact parameters required for each Args depends on the SIGNATURE: For example, a SIGNATURE that represents a forward convolution requires specifying the number of batches, groups, input- and output-channels, filter dimensions, filter strides, and so on. A SIGNATURE that represents a simple GEMM operation may instead require only the dimensions of the A-, B- and C-matrices.

    ck_tile::builder::test::Args<SIGNATURE> args = {
    .lengths = {
        .batch_size      = 128,
        .groups          = 1,
        .input_channels  = 64,
        .output_channels = 128,
        .image           = {.height = 56, .width = 56},
        .filter          = {.height = 3,  .width = 3},
    },
    .filter_strides  = {.height = 1, .width  = 1},
    .filter_dilation = {.height = 1, .width  = 1},
    .input_left_pad  = {.width  = 1, .height = 1},
    .input_right_pad = {.width  = 1, .height = 1},
};

Tensor Memory Management

Tensor memory is passed using the Inputs<SIGNATURE> and Outputs<SIGNATURE> structures. These group all inputs and outputs for an operation. Note that these structures do not "own" the memory inside: They only logically group the inputs so that they can be passed as a common type. The amount of inputs and outputs may differ depending on the SIGNATURE, and this avoids having to pass additional values and accept additional parameters in those situations.

The exact fields in Inputs and Outputs depend again on the particular SIGNATURE that they are constructed with. In general, these structures are intended to be freely constructible from external data and only serve to group relevant information. Automatic memory management can be performed using the UniqueInputs<SIGNATURE> and UniqueOutputs<SIGNATURE> structures instead. The alloc_inputs and alloc_outputs functions are used to initialize these types: They take an Args structure and allocate the appropriate amounts of memory. .get() is used to return an instance of the appropriate Input or Output.

auto inputs = ck_tile::builder::test::allocate_inputs(args);
auto outputs = ck_tile::builder::test::allocate_outputs(args);

Note that these functions merely allocate memory: After allocation, the memory is still uninitialized.

Tensor Memory Initialization

Operation inputs can be initialized by using ck_tile::builder::test::init_inputs(). Crucially, this operation accepts all inputs, as well as the args structure. This is because initializing tensor memory is a context-dependent operation: We need to understand the operation in detail in order to generate inputs which do not overflow, do not generate NaNs or all zeros, etc. Passing the args allows init_inputs to generate a good test for the operation at hand.

When: Executing the Kernel

The "When" phase is where the kernel to be tested is actually executed. This involves selecting an algorithm and using the Builder to generate the kernel.

Operation Algorithm

The "algorithm" defines the implementation strategy for the kernel. It specifies low-level details such as:

  • Thread block dimensions and tile sizes
  • GEMM implementation (XDL or WMMA)
  • Data transfer vectorization
  • Pipeline scheduling

As with the signature struct, the format of the algorithm struct is enforced at compile time using C++20 concepts by the CK-Builder API. The design of these concepts and the required constraints are discussed in the CK Builder factory design description.

// Define our custom algorithm struct.
struct ConvAlgorithm {
    // Thread block configuration
    ThreadBlock thread_block;

    // Gridwise GEMM configuration
    GridwiseXdlGemm gridwise_gemm;

    // Block transfer configuration
    Transfer transfer;

    // Additional tuning parameters
    // ...
};

// Double-check that our algorithm is well-defined according to the CK-Builder API.
static_assert(ck_tile::builder::ConvAlgorithmDescriptor<ConvAlgorithm>);

// Instantiate the algorithm with a configuration. Like with the signature struct
// the CK-Builder API will check that the values are correct when a device
// operation is built.
constexpr auto ALGORITHM = ConvAlgorithm{
    .thread_block = /* ... */;
    .gridwise_gem = /* ... */;
    .transfer = /* ... */;
    // ...
};

Building the Kernel

The Builder combines the signature (what to compute) with the algorithm (how to compute it) to generate a kernel type which represents the operation. The implementation details, including invocation method, depend on the particular signature and algorithm.

using Conv = ck_tile::builder::ConvBuilder<SIGNATURE, ALGORITHM>::Instance;
auto conv = Conv{};

Invoking the Kernel

After creating the kernel instance, it can be invoked by passing the instance, the arguments, the inputs, and the outputs to run(). This operation writes results into the buffers in outputs.

ck_tile::builder::test::run(conv, args, inputs.get(), outputs.get());

Then: Verifying the Results

The "Then" phase validates that the kernel produced the expected output. This is done by running a reference kernel and comparing the results.

Building the Reference Kernel

The reference kernel is just another kernel instance of the builder, one that's been externally verified to produce the correct results. As this kernel is also running on the GPU, we can use it to perform tests far more quickly than when comparing the outputs to a CPU-based reference implementation.

In order to obtain an instance of the reference kernel, the correct ALGORITHM needs to be passed to the Builder.

struct ReferenceAlgorithm {
    ck_tile::builder::ConvAlgorithmSpecialization specialization;
};
static_assert(ck_tile::builder::ConvAlgorithmDescriptor<ReferenceAlgorithm>);
constexpr auto REFERENCE_ALGORITHM = ReferenceAlgorithm{
    .specialization = ck_tile::builder::ConvAlgorithmSpecialization::REFERENCE;
};
using ReferenceConv = ck_tile::builder::ConvBuilder<SIGNATURE, REFERENCE_ALGORITHM>::Instance;
auto reference_conv = ReferenceConv{};

This instance can then be invoked using ck_tile::builder::test::run(), the same as the kernel to be tested. Note that another instance of the Outputs structure needs to be passed here in order to store the results.

auto reference_outputs = ck_tile::builder::test::allocate_outputs(args);
ck_tile::builder::test::run(conv, args, inputs.get(), reference_outputs.get());

Validating Results

In order to actually verify that the results of the executed device operation are correct, they are compared against the reference output obtained in the previous step. This is done by calling validate() with the runtime arguments of the operation, as well as both the actual and reference output. This then yields a ValidationReport, a type which holds information about which tensors of the output are considered to be equivalent and which are considered to be different. Actually comparing the tensor elements is performed on the GPU to keep the tests fast.

const auto report = ck_tile::builder::test::validate(args, outputs.get(), reference_outputs.get());

ValidationReport::get_errors() returns a vector of tensors from both outputs which are considered to be incorrect, each error case exposes some information about what went wrong.

for (const auto& e : report.get_errors()) {
    std::cout << "error: " << e.tensor_name << " was incorrect!" << std::endl;
}

GoogleTest/GoogleMock integration is provided using the MatchesReference matcher. This invokes validate() internally, and then turns the result into a GoogleMock-comparible value. Note that this function is closely tied to GoogleMock and the test setup that CK-Builder uses internally, and so is not exposed through the CK-Builder public interface.

EXPECT_THAT(outputs.get(), MatchesReference(args, reference_outputs.get()));

Complete Example

Here's a complete test that demonstrates the Given-When-Then pattern:

#include <gtest/gtest.h>
#include "ck_tile/builder/conv_signature_concepts.hpp"
#include "ck_tile/builder/conv_algorithm_concepts.hpp"
#include "ck_tile/builder/conv_builder.hpp"
#include "ck_tile/testing/tensor_memory_manager.hpp"
#include "ck_tile/testing/validator.hpp"
#include "testing_utils.hpp"

// Define the convolution signature
struct ConvSignature {
    int spatial_dim = 2;
    ck_tile::builder::ConvDirection direction =
        ck_tile::builder::ConvDirection::FORWARD;
    ck_tile::builder::GroupConvLayout2D layout =
        ck_tile::builder::GroupConvLayout2D::NHWGC_GKYXC_NHWGK;
    ck_tile::builder::DataType data_type =
        ck_tile::builder::DataType::FP16;
    ck_tile::builder::ElementwiseOperation elementwise_operation =
        ck_tile::builder::ElementwiseOperation::PASS_THROUGH;
};
static_assert(ck_tile::builder::ConvSignatureDescriptor<ConvSignature>);
constexpr auto SIGNATURE = ConvSignature{
    .spatial_dim = 2,
    .direction = ck_tile::builder::ConvDirection::FORWARD,
    .layout = ck_tile::builder::GroupConvLayout2D::NHWGC_GKYXC_NHWGK,
    .data_type = ck_tile::builder::DataType::FP16,
    .elementwise_operation = ck_tile::builder::ElementwiseOperation::PASS_THROUGH,
};

// Define the convolution algorithm
struct ConvAlgorithm {
    // Algorithm configuration details...
    // (Omitted for brevity)
};
static_assert(ck_tile::builder::ConvAlgorithmDescriptor<ConvAlgorithm>);
constexpr auto ALGORITHM = ConvAlgorithm{/* ... */};

// Define the reference convolution algorithm
struct ReferenceAlgorithm {
    ck_tile::builder::ConvAlgorithmSpecialization specialization;
};
static_assert(ck_tile::builder::ConvAlgorithmDescriptor<ReferenceAlgorithm>);
constexpr auto REFERENCE_ALGORITHM = ReferenceAlgorithm{
    .specialization = ck_tile::builder::ConvAlgorithmSpecialization::REFERENCE;
};

// The actual test
TEST(ConvolutionTest, Forward2D_FP16) {
    // ===== GIVEN: Set up the test case =====

    // Define runtime parameters
    ck_tile::builder::test::Args<ConvSignature> args = {
        .lengths = {
            .batch_size      = 128,
            .groups          = 1,
            .input_channels  = 64,
            .output_channels = 128,
            .image           = {.height = 56, .width = 56},
            .filter          = {.height = 3,  .width = 3},
        },
        .filter_strides  = {.height = 1, .width  = 1},
        .filter_dilation = {.height = 1, .width  = 1},
        .input_left_pad  = {.width  = 1, .height = 1},
        .input_right_pad = {.width  = 1, .height = 1},
    };

    // Allocate GPU memory
    auto inputs = ck_tile::builder::test::allocate_inputs(args);
    auto outputs = ck_tile::builder::test::allocate_outputs(args);
    auto reference_outputs = ck_tile::builder::test::allocate_outputs(args);

    // Initialize inputs
    ck_tile::builder::test::init_inputs(args, inputs);

    // ===== WHEN: Execute the kernel =====

    // Build the kernel
    using Conv = ck_tile::builder::ConvBuilder<SIGNATURE, ALGORITHM>::Instance;
    auto conv = Conv{};

    // Compute actual results
    ck_tile::builder::test::run(conv, args, inputs.get(), outputs.get());

    // ===== THEN: Verify the results =====

    // Build the reference kernel
    using ReferenceConv = ck_tile::builder::ConvBuilder<SIGNATURE, REFERENCE_ALGORITHM>::Instance;
    auto reference_conv = ReferenceConv{};

    // Compute reference results
    ck_tile::builder::test::run(conv, args, inputs.get(), reference_outputs.get());

    // Check the results
    EXPECT_THAT(outputs.get(), ck_tile::test::MatchesReference(args, reference_outputs.get()));
}

Benefits of This Approach

  1. Clarity: The Given-When-Then structure makes tests self-documenting. Each phase has a clear purpose.

  2. Reduced Boilerplate: The utilities handle memory management, initialization, and validation, eliminating repetitive code.

  3. Type Safety: The use of C++20 concepts ensures that signatures and algorithms are well-formed at compile time.

  4. Flexibility: The Args struct can be easily extended to support different test scenarios, Inputs and Outputs can be modified to support additional tensors where necessary, and alternatives to init_inputs() can be provided to support additional testing strategies.

  5. Integration: validate() integrates seamlessly with GoogleTest/GoogleMock through MatchesReference, providing familiar assertion syntax.

  6. Maintainability: Changes to the testing infrastructure are localized to the utility classes, not scattered across individual tests.

Future Enhancements

Potential improvements to the testing utilities include:

  • Performance benchmarking utilities
  • Automatic test case generation from parameter ranges
  • Enhanced error reporting with visual diffs
  • Support for multi-GPU testing scenarios