mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Files

Illia Silin b05040b919 [rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f)

[CK] Fix latest batch of staging compiler warnings

## Motivation

Suppress the new batch of clang lifetimebound and invalidation warnings
with the latest staging compiler.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

2026-05-08 14:15:31 +00:00

algorithm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

check_err.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

conv_common.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

convolution_host_tensor_descriptor_helper.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

convolution_parameter.hpp

[Compiler] Addressing new compiler warnings (#3640 )

2026-02-02 09:39:48 -08:00

device_memory.hpp

[CK_Builder] [testing] Integrate device random generators (#3427 )

2025-12-30 10:03:05 -08:00

device_tensor_generator.hpp

[CK_BUILDER] Integrate CKB validation with CK verification (#3649 )

2026-01-28 17:41:02 +01:00

fill.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

gpu_verification.hpp

[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f)

2026-05-08 14:15:31 +00:00

host_common_util.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

host_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

host_tensor_generator.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

host_tensor.hpp

[Compiler] Addressing new compiler warnings (#3640 )

2026-02-02 09:39:48 -08:00

iterator.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

literals.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

numeric.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

ranges.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

README.md

[rocm-libraries] ROCm/rocm-libraries#4277 (commit 4348901)

2026-02-10 21:27:27 +00:00

thread.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

README.md

CK Library Utility

This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility.

Quick Start

Use GPU validation for all new tests (10-100x faster than CPU validation)
Let the system compute tolerances automatically based on data types
Only transfer error statistics, not full tensors

File-to-Purpose Quick Reference

Need to...	Use this file	Key function/class
Validate on GPU (recommended)	`gpu_verification.hpp`	`gpu_verify()`
Validate on CPU (legacy/debugging)	`check_err.hpp`	`check_err()`
Compute tolerances automatically	`check_err.hpp`	`get_relative_threshold<>()`
Allocate GPU memory	`device_memory.hpp`	`DeviceMem`
Create CPU tensors	`host_tensor.hpp`	`Tensor<T>`
Generate test data on GPU	`device_tensor_generator.hpp`	`FillUniformRandFp()`
Generate test data on CPU (legacy)	`host_tensor_generator.hpp`	`GeneratorTensor_*`
Set up convolution parameters	`convolution_parameter.hpp`	`ConvParam`
Create tensor descriptors	`host_tensor.hpp`	`HostTensorDescriptor`

Core Validation Tools

GPU Validation (Recommended)

gpu_verification.hpp - Complete on-device verification

gpu_verify(): Compares device tensors entirely on GPU
- Automatic tolerance computation based on data types
- Only transfers error statistics (~12 bytes), not tensors
- Detailed error reporting (count, max error, percentage)
- Supports all CK data types (fp32, fp16, bf16, fp8, int8, etc.)
gpu_reduce_max(): Computes max(abs(tensor)) on GPU for tolerance scaling
Grid-stride kernels with LDS reduction for optimal performance

Performance: 10-100x faster than CPU validation for large tensors.

Example usage:

// Explicit tolerance
bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size);

// Automatic tolerance for mixed precision
bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size);

See: test/gpu_verification/test_gpu_verification.cpp

Tolerance Computation

check_err.hpp - Automatic tolerance calculation

get_relative_threshold<ComputeType, OutType, AccType>(): Computes relative tolerance from mantissa bits
get_absolute_threshold<ComputeType, OutType, AccType>(): Computes absolute tolerance scaled by magnitude
Type-specific overloads for all CK data types
Accumulation-aware error bounds

Theory: Based on IEEE 754 floating-point arithmetic and error propagation analysis.

Legacy CPU Validation

check_err.hpp - CPU-based error checking (legacy)

Overloaded check_err() functions for different data types
Type-aware default tolerances
Detailed error reporting (first 5 mismatches, statistics)

Note: Requires full tensor transfer to CPU - slow for large tensors. Use gpu_verification.hpp for new tests.

See: test/convnd_fwd/convnd_fwd_naive.cpp for legacy CPU validation patterns

Numerical Validation Strategy

TL;DR: CK computes tolerances from IEEE 754 precision limits, not arbitrary values. FP32 gets ~1e-5 relative tolerance, FP16 gets ~1e-3, etc. The system accounts for accumulation effects in matrix operations.

CK implements a theoretically-grounded approach to numerical validation that goes beyond simple fixed tolerances. The validation system is designed around three core principles:

1. Type-Aware Tolerance Computation

Rather than using arbitrary threshold values, CK computes tolerances based on the datatypes:

Relative tolerance: Derived from mantissa bits as 2^(-mantissa_bits) * 0.5
Absolute tolerance: Scaled by value magnitude as 2^(exponent - mantissa_bits) * 0.5
Multi-type analysis: Considers compute type, output type, and accumulator type separately
Conservative bounds: Takes maximum error across all data paths

2. Algorithm-Aware Validation

Different algorithms have different error characteristics:

Accumulation effects: Matrix operations (GEMM, convolution) accumulate errors proportional to the number of operations
Precision cascades: Mixed-precision operations require careful tolerance selection based on the weakest link
Operation-specific bounds: Tolerances scale with problem size (e.g., K dimension in GEMM)

The validation system accepts number_of_accumulations to adjust tolerances for algorithmic context.

3. Data Type Characteristics

Each data type has inherent precision limits that inform validation:

Data Type	Mantissa Bits	Typical rtol	Typical atol
FP32	23	1e-5	3e-6
TF32	10	5e-4	5e-4
FP16	10	1e-3	1e-3
BF16	7	1e-1	1e-3
FP8	3-4	1e-3	1e-3
BF8	2-3	1e-3	1e-3
FP4	2	0.5	0.5
INT8/INT32	N/A	0	0

GPU-First Validation Philosophy

Modern CK testing emphasizes pure GPU validation to eliminate performance bottlenecks:

Traditional CPU-Based Approach (Legacy)

GPU Kernel → Transfer to CPU → CPU Verification
            ↑ BOTTLENECK: PCIe transfer of entire tensor

Problem: Transferring multi-GB tensors over PCIe is 10-100x slower than computation
Impact: Test suites become I/O bound rather than compute bound
Limitation: Cannot efficiently test large-scale problems

Modern GPU-First Approach (Recommended)

GPU Kernel → GPU Reference → GPU Verification → Transfer scalars only
                                               ↑ Only ~12 bytes transferred

Advantage: All data stays on GPU, only error statistics transfer to CPU
Performance: 10-100x faster for large tensors
Scalability: Enables testing of multi-GB tensors efficiently
Completeness: Detailed error reporting (count, max error, percentage) without full transfer

When to Use Each Approach

Use GPU-First Validation When:

Testing production kernels (performance matters)
Working with large tensors (>1MB)
Running extensive test suites
Validating at scale

Use CPU-Based Validation When:

Debugging specific values (need to inspect individual elements)
Working with tiny tensors (<1KB)
Maintaining backward compatibility
Implementing CPU reference algorithms

Testing Workflow Comparison

Modern GPU-First Workflow (Recommended)

// 1. Allocate device memory only
DeviceMem input_dev(size), output_dev(size), reference_dev(size);

// 2. Initialize on GPU (no CPU involvement)
input_dev.FillUniformRandFp<float>(-1.0f, 1.0f);

// 3. Run kernel under test
run_kernel(input_dev, output_dev, params);

// 4. Run reference on GPU
run_reference_kernel(input_dev, reference_dev, params);

// 5. Verify on GPU (only transfers ~12 bytes of error stats)
bool pass = gpu_verify<float>(output_dev, reference_dev, rtol, atol, size);
if (!pass) {
    std::cout << "Validation failed!" << std::endl;
    return false;
}

Key advantage: Zero tensor transfers - all data stays on GPU.

Legacy CPU-Based Workflow

// 1. Create host tensors (allocates CPU memory)
Tensor<float> input_host(dims), output_host(dims), reference_host(dims);

// 2. Generate on CPU
input_host.GenerateTensorValue(GeneratorTensor_3<float>{-1.0f, 1.0f});

// 3. Allocate device memory
DeviceMem input_dev(size), output_dev(size);

// 4. Transfer to device (slow for large tensors)
input_dev.ToDevice(input_host.data());

// 5. Run kernel
run_kernel(input_dev, output_dev, params);

// 6. Transfer back to CPU (slow for large tensors)
output_dev.FromDevice(output_host.data());

// 7. Compute reference on CPU
compute_reference(input_host, reference_host, params);

// 8. Verify on CPU
bool pass = check_err(output_host, reference_host, "Test failed");

Bottleneck: Steps 4 and 6 transfer entire tensors over PCIe.

Supporting Utilities

Tensor Management

host_tensor.hpp: CPU-side tensor container with multi-dimensional support
- HostTensorDescriptor: Dimension, stride, and layout management
- Tensor<T>: Host tensor with generation and conversion utilities
device_memory.hpp: GPU memory management with RAII semantics
- DeviceMem: Device allocation, transfer, and initialization
- Device-side random value generation
- SetZero(): Zero-initialize device memory (required for backward passes)

Data Generation

device_tensor_generator.hpp: GPU-side tensor initialization (recommended)
- FillUniformRandFp<T>(): Fill with uniform random floating-point values
- FillUniformRandInt<T>(): Fill with uniform random integer values
host_tensor_generator.hpp: CPU-side functor-based generators (legacy)
- Various patterns: zero, constant, random, sequential, diagonal, checkerboard
fill.hpp: STL-style fill functors for containers

Convolution Utilities

convolution_parameter.hpp: Convolution parameter management
- ConvParam: Encapsulates dimensions, strides, padding, dilations
- Output dimension calculation and FLOP estimation
convolution_host_tensor_descriptor_helper.hpp: Tensor descriptor creation helpers
conv_common.hpp: Common convolution utilities

See: test/convnd_fwd/convnd_fwd_naive.cpp for convolution parameter usage

Workspace Management

Some operations require temporary GPU memory for intermediate computations:

// Check if workspace is needed
const std::size_t workspace_sz = op_ptr->GetWorkSpaceSize(argument_ptr.get());

// Allocate and set workspace if needed
if (workspace_sz > 0) {
    DeviceMem workspace_dev(workspace_sz);
    op_ptr->SetWorkSpacePointer(argument_ptr.get(), workspace_dev.GetDeviceBuffer());
}

Algorithmic Utilities

algorithm.hpp: Generic algorithms
ranges.hpp: Range-based utilities and concepts
iterator.hpp: Custom iterator implementations
numeric.hpp: Numeric operations

Miscellaneous

host_common_util.hpp: Common host-side utilities
host_gemm.hpp: CPU reference GEMM implementation
literals.hpp: User-defined literals
thread.hpp: Threading utilities

Best Practices

Choosing Tolerances

Prefer automatic computation: Use gpu_verify() with automatic tolerance calculation
Consider accumulation: Pass number_of_accumulations for matrix operations
Respect data type limits: Don't expect FP16 to match FP32 precision
Account for algorithm: Different operations have different error characteristics

Performance Optimization

Use GPU-first validation for all new tests
Avoid CPU transfers unless debugging specific values
Generate data on GPU when possible
Batch verification to amortize kernel launch overhead