[CK_BUILDER] Integrate CKB validation with CK verification (#3649)

* ck-builder: tensor copy function This function copies one tensor to another, so that the memory layout can be changed between them. * ck-builder: fix ck::bhalf literals These types don't work properly. * ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it This reduces the amount of duplicated code a bit. * ck-builder: add flat tensor iterator This "iterator" type pretends to be a pointer, useful for passing tensors to functions expecting pointer-like types. * ck-builder: integrate validation with ck gpu verification By templating the gpu_verify function over iterators, we can use the new FlatTensorIterator to adapt the function to multi- dimensional tensors without changing either implementation too much. * ck-builder: add check_by_accumulations This changes the gpu_verification.hpp code to also accept "iterator" types for the relevant gpu_verify and gpu_reduce_max functions. * ck: fix test_gpu_verification GenerateRandomData for bhalf is_integer_it<bhalf_t> yields true, but it is not actually an integer. * ck: make gpu_verification kernels be proper persistent kernels Previously these were using a hardcoded value for the grid size. This commit changes that so that the grid size is automatically derived from the kernel's occupancy and the number of multiprocessors on the GPU. * ck: clean up gpu_verification.hpp using block_reduce This implements a small generic block reduce function, and rewrites the rest of gpu_verification.hpp using that function to clean it up a bit. * ck-builder: doc typos * ck-builder: update testing readme with validation interface. * ck-builder: rebase fixes + review comments * ck-builder: fix device integer generation with float types Passing bfloat here causes a nans due to type_convert performing a bitcast. * ck: another bhalf_t bug CK expects that int-generation with ck::bhalf_t yields bhalf integers, not unsigned integers. This makes the logic of FillUniformRandInteger compatible with GeneratorTensor_2<InDataType>, however idiotic that may be.
2026-04-20 06:49:15 +00:00 · 2026-01-28 17:41:02 +01:00
parent d6cccf6093
commit 42048bdb7d
11 changed files with 636 additions and 291 deletions
--- a/experimental/builder/include/ck_tile/builder/testing/README.md
+++ b/experimental/builder/include/ck_tile/builder/testing/README.md
@@ -12,7 +12,8 @@ The core components are:

 - **`Args`**: A struct template that holds runtime parameters for a specific test case.
 - **`Input`** and **`Output`**: Helper classes that groups operation inputs and outputs.
- **`Validator`**: A utility that performs on-GPU validation and integrates with GoogleTest/GoogleMock.
+- **`run()`**: Invokes an algorithm on the GPU.
+- **`validate()`**: A utility that performs on-GPU validation and integrates with GoogleTest/GoogleMock.

 Together, these components enable a structured approach to kernel testing that mirrors the Given-When-Then pattern commonly used in behavior-driven development.

@@ -200,26 +201,27 @@ auto reference_outputs = ck_tile::builder::test::allocate_outputs(args);
 ck_tile::builder::test::run(conv, args, inputs.get(), reference_outputs.get());
 ```

-#### `Validator<SIGNATURE>`
+#### Validating Results

-The `Validator` class encapsulates the validation logic. It performs on-GPU correctness checks by comparing two instances of the `Outputs` structure.
+In order to actually verify that the results of the executed device operation are correct, they are compared against the reference output obtained in the previous step. This is done by calling `validate()` with the runtime arguments of the operation, as well as both the actual and reference output. This then yields a *`ValidationReport`*, a type which holds information about which tensors of the output are considered to be equivalent and which are considered to be different. Actually comparing the tensor elements is performed on the GPU to keep the tests fast.

 ```cpp
-ck_tile::builder::test::Validator<SIGNATURE> validator(outputs.get(), reference_outputs.get());
+const auto report = ck_tile::builder::test::validate(args, outputs.get(), reference_outputs.get());
 ```

-The `Validator` provides methods that return GoogleMock matchers, enabling clean integration with GoogleTest:
+`ValidationReport::get_errors()` returns a vector of tensors from both outputs which are considered to be incorrect, each error case exposes some information about what went wrong.

 ```cpp
-EXPECT_THAT(validator.result(), validator.matches_reference_output());
+for (const auto& e : report.get_errors()) {
+    std::cout << "error: " << e.tensor_name << " was incorrect!" << std::endl;
+}
 ```

-The `matches_reference_output()` matcher checks that the output is numerically correct within acceptable tolerances. The `Validator` can also provide more detailed diagnostics, such as:
+GoogleTest/GoogleMock integration is provided using the `MatchesReference` matcher. This invokes `validate()` internally, and then turns the result into a GoogleMock-comparible value. Note that this function is closely tied to GoogleMock and the test setup that CK-Builder uses internally, and so is not exposed through the CK-Builder public interface.

- Maximum absolute error
- Maximum relative error
- Number of mismatched elements
- Specific locations of errors
+```cpp
+EXPECT_THAT(outputs.get(), MatchesReference(args, reference_outputs.get()));
+```

 ## Complete Example

@@ -232,6 +234,7 @@ Here's a complete test that demonstrates the Given-When-Then pattern:
 #include "ck_tile/builder/conv_builder.hpp"
 #include "ck_tile/testing/tensor_memory_manager.hpp"
 #include "ck_tile/testing/validator.hpp"
+#include "testing_utils.hpp"

 // Define the convolution signature
 struct ConvSignature {
@@ -318,8 +321,7 @@ TEST(ConvolutionTest, Forward2D_FP16) {
    ck_tile::builder::test::run(conv, args, inputs.get(), reference_outputs.get());

    // Check the results
-    ck_tile::builder::test::Validator<SIGNATURE> validator(outputs.get(), reference_outputs.get());
-    EXPECT_THAT(validator.result(), validator.is_ok());
+    EXPECT_THAT(outputs.get(), ck_tile::test::MatchesReference(args, reference_outputs.get()));
 }
 ```

@@ -333,7 +335,7 @@ TEST(ConvolutionTest, Forward2D_FP16) {

 4. **Flexibility**: The `Args` struct can be easily extended to support different test scenarios, `Inputs` and `Outputs` can be modified to support additional tensors where necessary, and alternatives to `init_inputs()` can be provided to support additional testing strategies.

-5. **Integration**: The `Validator` integrates seamlessly with GoogleTest/GoogleMock, providing familiar assertion syntax.
+5. **Integration**: `validate()` integrates seamlessly with GoogleTest/GoogleMock through `MatchesReference`, providing familiar assertion syntax.

 6. **Maintainability**: Changes to the testing infrastructure are localized to the utility classes, not scattered across individual tests.

--- a/experimental/builder/include/ck_tile/builder/testing/tensor_foreach.hpp
+++ b/experimental/builder/include/ck_tile/builder/testing/tensor_foreach.hpp
@@ -6,6 +6,7 @@
 #include "ck_tile/builder/testing/tensor_descriptor.hpp"
 #include "ck_tile/builder/factory/helpers/ck/conv_tensor_type.hpp"
 #include <cstdint>
+#include <cassert>
 #include <concepts>
 #include <array>

@@ -348,4 +349,115 @@ void clear_tensor_buffer(const TensorDescriptor<DT, RANK>& desc,
    fill_tensor_buffer(desc, buffer, [value]([[maybe_unused]] size_t i) { return value; });
 }

+/// @brief Utility for copying a tensor from one layout to another
+///
+/// This function copies tensor data from `src_buffer` to `dst_buffer`,
+/// changing the layout from `src_desc` to `dst_desc`. Note that the src and
+/// dst tensor lengths must be compatible, otherwise this function may write
+/// out of bounds.
+///
+/// @tparam DT The element datatype of both tensors.
+/// @tparam RANK The rank (number of spatial dimensions) of the tensors.
+///
+/// @param src_desc The descriptor of the source tensor to copy from.
+/// @param src_buffer The memory of the source tensor.
+/// @param dst_desc The descriptor of the destination tensor to copy to.
+/// @param dst_buffer The memory of the destination tensor.
+template <DataType DT, size_t RANK>
+void copy_tensor(const TensorDescriptor<DT, RANK>& src_desc,
+                 const void* src_buffer,
+                 const TensorDescriptor<DT, RANK>& dst_desc,
+                 void* dst_buffer)
+{
+    assert(src_desc.get_lengths() == dst_desc.get_lengths());
+    const auto src_strides = src_desc.get_strides();
+    const auto dst_strides = dst_desc.get_strides();
+    tensor_foreach(dst_desc.get_lengths(),
+                   [src_buffer, dst_buffer, src_strides, dst_strides](const auto& index) {
+                       using T            = detail::cpp_type_t<DT>;
+                       const auto* src    = static_cast<const T*>(src_buffer);
+                       auto* dst          = static_cast<T*>(dst_buffer);
+                       const auto src_off = calculate_offset(index, src_strides);
+                       const auto dst_off = calculate_offset(index, dst_strides);
+                       dst[dst_off]       = src[src_off];
+                   });
+}
+
+/// @brief Simple iterator implementation over tensors.
+///
+/// This type implements a simple "iterator" type for tensor types,
+/// basically exposing operator[] for flat indices. This type is useful
+/// to be able to provide a "pointer-like" object to API that does not
+/// expect higher dimensional tensor types, and expects linear pointers
+/// instead. Ideally, one just needs to replace the `T* ptr` with
+/// `Iterator it` to update those API to be compatible with this type.
+///
+/// @note This is not intended to be a full implementation of the C++
+/// iterator concept. For example, it does not really hold any state,
+/// because that is not really useful anyway.
+///
+/// @tparam DT The datatype of the tensor to iterate over. Note that this
+/// is only here for reference purposes, the actual data type of the backing
+/// memory is provided via the backing iterator type.
+/// @tparam RANK The rank (number of spatial dimensions) of the tensors.
+/// @tparam Iterator The backing iterator type. This can be a (non-void)
+/// pointer type.
+template <DataType DT, size_t RANK, typename Iterator>
+struct FlatTensorIterator
+{
+    /// @brief Construct a FlatTensorIterator.
+    ///
+    /// Construct a FlatTensorIterator from a tensor descriptor and a backing
+    /// iterator. The backing iterator can just be a non-void pointer type,
+    /// note that the result of FlatTensorIterator::operator[] is the same as
+    /// that of Iterator::operator[].
+    ///
+    /// @param desc The descriptor of the tensor to iterate.
+    /// @param inner The inner iterator, for example a (non-void) pointer.
+    FlatTensorIterator(const TensorDescriptor<DT, RANK>& desc, Iterator inner)
+        : iter_(desc.get_lengths()), strides_(desc.get_strides()), inner_(inner)
+    {
+    }
+
+    /// @brief Return the value at a particular flat index.
+    ///
+    /// This function returns the value of the tensor at flat coordinate
+    /// `flat_index`. This index is then unflattened into a multi-dimensional
+    /// index according to the way described in `NdIter`, and a tensor offset
+    /// is computed from that according to `calculate_offset`. The value at
+    /// that offset in the inner iterator is then the return value of this
+    /// function.
+    ///
+    /// @note NdIter iterates such that the inner dimension (right-most value
+    /// in the tensor shape) changes fastest.
+    ///
+    /// @note This function performs no bounds checking.
+    ///
+    /// @param flat_index The flat index into this tensor.
+    ///
+    /// @pre flat_index < numel()
+    ///
+    /// @see NdIter
+    __host__ __device__ auto& operator[](size_t flat_index) const
+    {
+        const auto index  = iter_(flat_index);
+        const auto offset = calculate_offset(index, strides_);
+        return inner_[offset];
+    }
+
+    /// @brief Return the total number of elements to iterate over.
+    ///
+    /// @see NdIter::numel()
+    __host__ __device__ size_t numel() const { return iter_.numel(); }
+
+    private:
+    NdIter<RANK> iter_;
+    Extent<RANK> strides_;
+    Iterator inner_;
+};
+
+template <DataType DT, size_t RANK, typename Iterator>
+FlatTensorIterator(const TensorDescriptor<DT, RANK>&,
+                   Iterator) -> FlatTensorIterator<DT, RANK, Iterator>;
+
 } // namespace ck_tile::builder::test
--- a/experimental/builder/include/ck_tile/builder/testing/validation.hpp
+++ b/experimental/builder/include/ck_tile/builder/testing/validation.hpp
@@ -8,6 +8,7 @@
 #include "ck_tile/builder/testing/tensor_foreach.hpp"
 #include "ck_tile/builder/factory/helpers/ck/conv_tensor_type.hpp"
 #include "ck/utility/type_convert.hpp"
+#include "ck/library/utility/gpu_verification.hpp"
 #include <string_view>
 #include <vector>
 #include <algorithm>
@@ -48,8 +49,8 @@ struct ValidationReport
        /// The total number of elements in each tensor.
        uint64_t total_elements;

-        /// The number of elements which were bitwise 0.
-        uint64_t zero_elements;
+        /// Set to true if both tensors have all their elements be 0.
+        bool both_all_zero;

        // Max error.
        double max_error;
@@ -59,7 +60,7 @@ struct ValidationReport
        /// If both tensors are all zero, it indicates either an incorrect testing setup
        /// or an issue with the testing framework. For that reason we also consider that
        /// a failure.
-        bool is_all_zero() const { return zero_elements == total_elements; }
+        bool is_all_zero() const { return both_all_zero; }

        /// @brief Return whether the check associated to this case was successful.
        ///
@@ -86,7 +87,7 @@ struct ValidationReport

    /// @brief Compare two tensors and record the results in the report.
    ///
-    /// This is the main function used to compare two tensors. The results of this
+    /// This is one of the main function used to compare two tensors. The results of this
    /// comparison, including any supplemental information, is recorded into the report.
    ///
    /// @returns `false` if the comparison failed. If so, the details can be found via
@@ -111,8 +112,45 @@ struct ValidationReport
               const TensorDescriptor<DT, RANK>& descriptor,
               const void* actual,
               const void* expected,
-               double rtol = 1e-3,
-               double atol = 1e-3);
+               float rtol = 1e-3f,
+               float atol = 1e-3f);
+
+    /// @brief Compare two tensors and record the results in the report, with automatic
+    /// computation of tolerances.
+    ///
+    /// This variant computes the tolerances automatically based on the compute
+    /// (accumulation) type, and the number of accumulations required per result value.
+    /// This is one of the main function used to compare two tensors. The results of this
+    /// comparison, including any supplemental information, is recorded into the report.
+    /// @returns `false` if the comparison failed. If so, the details can be found via
+    /// `get_errors()`.
+    ///
+    /// @tparam OutDataType The data type of the tensors to check. This is the type of the
+    /// values in tensor memory.
+    /// @tparam ComputeType The data type that tensor operations are computed with internally.
+    /// @tparam AccType The data type that tensor values are accumulated with internally.
+    /// @tparam RANK The rank (number of spatial dimensions) of the tensor to check.
+    ///
+    /// @param tensor_name The name of the tensors to check. This should be a value by which
+    /// whoever is debugging the associated test later can easily find out which of the
+    /// outputs of a device operation was incorrect.
+    /// @param descriptor The descriptor (memory layout) of the tensor.
+    /// @param actual The device buffer with the values of the tensor to-be-tested, ie, the
+    /// results of the device operation.
+    /// @param expected The device buffer with the values of the reference tensor. These are
+    /// treated as a "golden standard", and should usually be generated by a reference
+    /// implementation.
+    /// @param number_of_accumulations The maximum number of accumulations required to compute
+    /// a value of the result tensor.
+    template <DataType OutDataType,
+              DataType ComputeType = OutDataType,
+              DataType AccType     = ComputeType,
+              size_t RANK>
+    bool check_by_accumulations(std::string_view tensor_name,
+                                const TensorDescriptor<OutDataType, RANK>& descriptor,
+                                const void* actual,
+                                const void* expected,
+                                const size_t number_of_accumulations);

    private:
    std::vector<Case> reports_;
@@ -121,89 +159,58 @@ struct ValidationReport
 template <DataType DT, size_t RANK>
 bool ValidationReport::check(std::string_view tensor_name,
                             const TensorDescriptor<DT, RANK>& descriptor,
-                             const void* actual_data,
-                             const void* expected_data,
-                             double rtol,
-                             double atol)
+                             const void* actual,
+                             const void* expected,
+                             float rtol,
+                             float atol)
 {
-    const auto strides = descriptor.get_strides();
+    using CKType = detail::cpp_type_t<DT>;

-    // During development and CI, only the kernels that were changed would fail, and so we can
-    // assume that the average case does not have errors. Therefore, split out testing into a
-    // quick test which just counts the incorrect elements, and a more in-depth test that also
-    // returns the indices of the incorrect items.
+    const auto a_it  = FlatTensorIterator(descriptor, static_cast<const CKType*>(actual));
+    const auto e_it  = FlatTensorIterator(descriptor, static_cast<const CKType*>(expected));
+    const auto numel = a_it.numel();

-    // Initial pass: count errors
-
-    // Allocate and reset counter
-    auto d_counters = alloc_buffer(sizeof(uint64_t) * 3);
-    check_hip(hipMemset(d_counters.get(), 0, sizeof(uint64_t) * 3));
-
-    auto d_error_count = &reinterpret_cast<uint64_t*>(d_counters.get())[0];
-    auto d_zero_count  = &reinterpret_cast<uint64_t*>(d_counters.get())[1];
-    auto d_max_error   = &reinterpret_cast<double*>(d_counters.get())[2];
-
-    tensor_foreach(descriptor.get_lengths(), [=](auto index) {
-        using CKType = typename factory::internal::DataTypeToCK<DT>::type;
-
-        const auto* actual   = static_cast<const CKType*>(actual_data);
-        const auto* expected = static_cast<const CKType*>(expected_data);
-
-        static_assert(!std::is_same_v<CKType, double>,
-                      "TODO implement compare_kernel() for double");
-
-        const auto offset = calculate_offset(index, strides);
-
-        const auto a = actual[offset];
-        const auto b = expected[offset];
-
-        const auto o   = static_cast<double>(type_convert<float>(a));
-        const auto r   = static_cast<double>(type_convert<float>(b));
-        const auto err = std::abs(o - r);
-
-        atomicMax(d_max_error, err);
-        if(err > atol + rtol * std::abs(r) || !std::isfinite(o) || !std::isfinite(r))
-        {
-            // We expect the number of errors to be very low, so just use an atomic
-            // for now.
-            atomicAdd(d_error_count, 1);
-        }
-
-        // Now compare the numbers as bitwise too.
-        // Update the counter if they're both zero.
-        using Bytes   = std::array<std::byte, sizeof(CKType)>;
-        bool all_zero = true;
-        for(auto x : std::bit_cast<Bytes>(a))
-        {
-            if(x != std::byte{0})
-                all_zero = false;
-        }
-        for(auto x : std::bit_cast<Bytes>(b))
-        {
-            if(x != std::byte{0})
-                all_zero = false;
-        }
-        if(all_zero)
-        {
-            atomicAdd(d_zero_count, 1);
-        }
-    });
-
-    uint64_t error_count = 0;
-    check_hip(hipMemcpy(&error_count, d_error_count, sizeof(uint64_t), hipMemcpyDeviceToHost));
-    uint64_t zero_count = 0;
-    check_hip(hipMemcpy(&zero_count, d_zero_count, sizeof(uint64_t), hipMemcpyDeviceToHost));
-    double max_error = 0;
-    check_hip(hipMemcpy(&max_error, d_max_error, sizeof(double), hipMemcpyDeviceToHost));
+    const auto result = ck::profiler::gpu_verify<CKType>(a_it, e_it, rtol, atol, numel);

    // TODO: Gather detailed coordinates.

    reports_.push_back(Case{
        .tensor_name    = std::string(tensor_name),
-        .wrong_elements = error_count,
+        .wrong_elements = result.error_count,
        .total_elements = descriptor.get_element_size(),
-        .zero_elements  = zero_count,
-        .max_error      = max_error,
+        .both_all_zero  = result.all_zero,
+        .max_error      = result.max_error,
+    });
+
+    return reports_.back().is_ok();
+}
+
+template <DataType OutDataType, DataType ComputeType, DataType AccType, size_t RANK>
+bool ValidationReport::check_by_accumulations(std::string_view tensor_name,
+                                              const TensorDescriptor<OutDataType, RANK>& descriptor,
+                                              const void* actual,
+                                              const void* expected,
+                                              const size_t number_of_accumulations)
+{
+    using CKComputeType = detail::cpp_type_t<ComputeType>;
+    using CKAccType     = detail::cpp_type_t<AccType>;
+    using CKOutDataType = detail::cpp_type_t<OutDataType>;
+
+    const auto a_it  = FlatTensorIterator(descriptor, static_cast<const CKOutDataType*>(actual));
+    const auto e_it  = FlatTensorIterator(descriptor, static_cast<const CKOutDataType*>(expected));
+    const auto numel = a_it.numel();
+
+    const auto result = ck::profiler::gpu_verify<CKOutDataType, CKComputeType, CKAccType>(
+        a_it, e_it, static_cast<int>(number_of_accumulations), numel);
+
+    // TODO: Gather detailed coordinates.
+
+    reports_.push_back(Case{
+        .tensor_name    = std::string(tensor_name),
+        .wrong_elements = result.error_count,
+        .total_elements = descriptor.get_element_size(),
+        .both_all_zero  = result.all_zero,
+        .max_error      = result.max_error,
    });

    return reports_.back().is_ok();
--- a/experimental/builder/test/testing_utils.hpp
+++ b/experimental/builder/test/testing_utils.hpp
@@ -209,7 +209,8 @@ struct ReferenceOutputMatcher
                    // Round to 2 digits
                    const float percentage = e.wrong_elements * 10000 / e.total_elements / 100.f;
                    *listener << e.wrong_elements << "/" << e.total_elements
-                              << " incorrect elements (~" << percentage << "%)";
+                              << " incorrect elements (~" << percentage << "%)," << " max error "
+                              << e.max_error;
                }
            }
        }
--- a/experimental/builder/test/unit_conv_fwd_testing.cpp
+++ b/experimental/builder/test/unit_conv_fwd_testing.cpp
@@ -98,8 +98,10 @@ TEST(ConvFwdTesting, Validate)
            [&]([[maybe_unused]] std::string_view name,
                const auto& desc,
                void* ckt::Outputs<SIGNATURE>::*ptr) {
-                ckt::clear_tensor_buffer(desc, a.get().*ptr, ck::bhalf_t{123});
-                ckt::clear_tensor_buffer(desc, b.get().*ptr, ck::bhalf_t{123});
+                ckt::clear_tensor_buffer(
+                    desc, a.get().*ptr, ck::type_convert<ck::bhalf_t, float>(123));
+                ckt::clear_tensor_buffer(
+                    desc, b.get().*ptr, ck::type_convert<ck::bhalf_t, float>(123));
            });

        const auto report = ckt::validate(ARGS, a.get(), b.get());
@@ -115,8 +117,10 @@ TEST(ConvFwdTesting, Validate)
                const auto& desc,
                void* ckt::Outputs<SIGNATURE>::*ptr) {
                ++field_count;
-                ckt::clear_tensor_buffer(desc, a.get().*ptr, ck::bhalf_t{2});
-                ckt::clear_tensor_buffer(desc, b.get().*ptr, ck::bhalf_t{1});
+                ckt::clear_tensor_buffer(
+                    desc, a.get().*ptr, ck::type_convert<ck::bhalf_t, float>(2));
+                ckt::clear_tensor_buffer(
+                    desc, b.get().*ptr, ck::type_convert<ck::bhalf_t, float>(1));
            });

        const auto report = ckt::validate(ARGS, a.get(), b.get());
--- a/experimental/builder/test/unit_tensor_foreach.cpp
+++ b/experimental/builder/test/unit_tensor_foreach.cpp
@@ -225,3 +225,99 @@ TEST(TensorForeach, ClearTensorZeros)

    EXPECT_THAT(actual, Eq(0));
 }
+
+TEST(TensorForeach, CopyTensor)
+{
+    constexpr auto dt       = ckb::DataType::I32;
+    const ckt::Extent shape = {10, 3, 45, 23, 6};
+    using Counter           = uint32_t;
+
+    const auto src_desc = ckt::make_descriptor<dt>(shape, ckt::PackedRightLayout{});
+    const auto dst_desc = ckt::make_descriptor<dt>(shape, ckt::PackedLeftLayout{});
+
+    auto src_buffer = ckt::alloc_tensor_buffer(src_desc);
+    auto dst_buffer = ckt::alloc_tensor_buffer(dst_desc);
+
+    const auto gen = [](const auto& index, const auto& lengths) {
+        // Simple incrementing counter
+        return static_cast<Counter>(ckt::calculate_offset(index, lengths));
+    };
+
+    ckt::fill_tensor(
+        src_desc, src_buffer.get(), [lengths = src_desc.get_lengths(), gen](const auto& index) {
+            return gen(index, lengths);
+        });
+    ckt::clear_tensor_buffer(dst_desc, dst_buffer.get());
+
+    // Perform the actual test
+
+    ckt::copy_tensor(src_desc, src_buffer.get(), dst_desc, dst_buffer.get());
+
+    // Check that the dst tensor has the same data
+
+    auto d_invalid = ckt::alloc_buffer(sizeof(Counter));
+    ckt::check_hip(hipMemset(d_invalid.get(), 0, sizeof(Counter)));
+
+    ckt::tensor_foreach(shape,
+                        [lengths = dst_desc.get_lengths(),
+                         gen,
+                         dst     = dst_buffer.get(),
+                         invalid = reinterpret_cast<Counter*>(d_invalid.get()),
+                         strides = dst_desc.get_strides()](const auto& index) {
+                            const auto offset   = ckt::calculate_offset(index, strides);
+                            const auto expected = gen(index, lengths);
+                            const auto actual   = reinterpret_cast<const Counter*>(dst)[offset];
+
+                            if(expected != actual)
+                                atomicAdd(invalid, 1);
+                        });
+
+    Counter invalid = 0;
+    ckt::check_hip(hipMemcpy(&invalid, d_invalid.get(), sizeof(Counter), hipMemcpyDeviceToHost));
+
+    EXPECT_THAT(invalid, Eq(0));
+}
+
+TEST(TensorForeach, FlatTensorIterator)
+{
+    using Counter = uint32_t;
+
+    constexpr auto dt                = ckb::DataType::I32;
+    const ckt::Extent shape          = {10, 9, 8, 7, 6, 5, 4, 3, 2, 1};
+    const ckt::Extent packed_strides = ckt::PackedRightLayout{}(shape);
+
+    const auto desc = ckt::make_descriptor<dt>(shape, ckt::PackedLeftLayout{});
+
+    auto buffer = ckt::alloc_tensor_buffer(desc);
+
+    // Fill the tensor with random values according to the *flat* index. The
+    // FlatTensorIterator iterates over flat values even if the strides are not
+    // packed, so indexing these elements according to the flat index in the
+    // iterator should yield again this value.
+    ckt::fill_tensor(desc, buffer.get(), [packed_strides](const auto& index) {
+        const auto flat_index = ckt::calculate_offset(index, packed_strides);
+        return static_cast<int32_t>(flat_index * 10001 % 1001);
+    });
+
+    auto iterator = ckt::FlatTensorIterator(desc, reinterpret_cast<const int32_t*>(buffer.get()));
+
+    auto d_invalid = ckt::alloc_buffer(sizeof(Counter));
+    ckt::check_hip(hipMemset(d_invalid.get(), 0, sizeof(Counter)));
+
+    ckt::tensor_foreach(shape,
+                        [iterator,
+                         packed_strides,
+                         strides = desc.get_strides(),
+                         data    = reinterpret_cast<const int32_t*>(buffer.get()),
+                         invalid = reinterpret_cast<Counter*>(d_invalid.get())](const auto& index) {
+                            const auto flat_index = ckt::calculate_offset(index, packed_strides);
+                            const auto offset     = ckt::calculate_offset(index, strides);
+                            if(iterator[flat_index] != data[offset])
+                                atomicAdd(invalid, 1);
+                        });
+
+    Counter invalid = 0;
+    ckt::check_hip(hipMemcpy(&invalid, d_invalid.get(), sizeof(Counter), hipMemcpyDeviceToHost));
+
+    EXPECT_THAT(invalid, Eq(0));
+}
--- a/experimental/builder/test/unit_validation.cpp
+++ b/experimental/builder/test/unit_validation.cpp
@@ -74,7 +74,8 @@ TYPED_TEST(ValidationReportTests, SingleCorrect)
    ckt::fill_tensor(desc, b.get(), generator);

    ckt::ValidationReport report;
-    report.check("correct", desc, b.get(), a.get());
+    report.check("correct - explicit tolerance", desc, b.get(), a.get());
+    report.check_by_accumulations("correct - implicit tolerance", desc, b.get(), a.get(), 0);

    EXPECT_THAT(report.get_errors().size(), Eq(0));
 }
@@ -97,17 +98,22 @@ TYPED_TEST(ValidationReportTests, SingleIncorrect)
    });

    ckt::ValidationReport report;
-    report.check("incorrect", desc, b.get(), a.get());
+    report.check("incorrect - explicit tolerance", desc, b.get(), a.get());
+    report.check_by_accumulations("incorrect - implicit tolerance", desc, b.get(), a.get(), 0);

    const auto errors = report.get_errors();

    const auto flat_size       = desc.get_element_size();
    const auto expected_errors = flat_size >= 999999 ? 3 : flat_size >= 12345 ? 2 : 1;

-    ASSERT_THAT(errors.size(), Eq(1));
-    EXPECT_THAT(errors[0].tensor_name, StrEq("incorrect"));
-    EXPECT_THAT(errors[0].wrong_elements, Eq(expected_errors));
-    EXPECT_THAT(errors[0].total_elements, Eq(desc.get_element_size()));
+    ASSERT_THAT(errors.size(), Eq(2));
+    EXPECT_THAT(errors[0].tensor_name, StrEq("incorrect - explicit tolerance"));
+    EXPECT_THAT(errors[1].tensor_name, StrEq("incorrect - implicit tolerance"));
+    for(int i = 0; i < 2; ++i)
+    {
+        EXPECT_THAT(errors[i].wrong_elements, Eq(expected_errors));
+        EXPECT_THAT(errors[i].total_elements, Eq(desc.get_element_size()));
+    }
 }

 TYPED_TEST(ValidationReportTests, ZeroIsIncorrect)
@@ -121,14 +127,20 @@ TYPED_TEST(ValidationReportTests, ZeroIsIncorrect)
    ckt::clear_tensor_buffer(desc, b.get());

    ckt::ValidationReport report;
-    report.check("zero_is_incorrect", desc, b.get(), a.get());
+    report.check("zero_is_incorrect - explicit tolerance", desc, b.get(), a.get());
+    report.check_by_accumulations(
+        "zero_is_incorrect - implicit tolerance", desc, b.get(), a.get(), 0);

    const auto errors = report.get_errors();
-    ASSERT_THAT(errors.size(), Eq(1));
-    EXPECT_THAT(errors[0].tensor_name, StrEq("zero_is_incorrect"));
-    EXPECT_THAT(errors[0].wrong_elements, Eq(0));
-    EXPECT_THAT(errors[0].total_elements, Eq(desc.get_element_size()));
-    EXPECT_THAT(errors[0].zero_elements, Eq(desc.get_element_size()));
+    ASSERT_THAT(errors.size(), Eq(2));
+    EXPECT_THAT(errors[0].tensor_name, StrEq("zero_is_incorrect - explicit tolerance"));
+    EXPECT_THAT(errors[1].tensor_name, StrEq("zero_is_incorrect - implicit tolerance"));
+    for(int i = 0; i < 2; ++i)
+    {
+        EXPECT_THAT(errors[i].wrong_elements, Eq(0));
+        EXPECT_THAT(errors[i].total_elements, Eq(desc.get_element_size()));
+        EXPECT_THAT(errors[i].both_all_zero, Eq(true));
+    }
 }

 TEST(ValidationReportTests, MultipleSomeIncorrect)
@@ -143,11 +155,12 @@ TEST(ValidationReportTests, MultipleSomeIncorrect)
        auto b = ckt::alloc_tensor_buffer(desc);

        ckt::fill_tensor_buffer(
-            desc, a.get(), [](size_t i) { return ck::type_convert<ck::bhalf_t>(i % 100); });
+            desc, a.get(), [](size_t i) { return ck::type_convert<ck::bhalf_t>(float(i % 100)); });
        ckt::fill_tensor_buffer(
-            desc, b.get(), [](size_t i) { return ck::type_convert<ck::bhalf_t>(i % 101); });
+            desc, b.get(), [](size_t i) { return ck::type_convert<ck::bhalf_t>(float(i % 101)); });

-        report.check("incorrect 1", desc, b.get(), a.get());
+        report.check("incorrect 1 - explicit tolerance", desc, b.get(), a.get());
+        report.check("incorrect 1 - implicit tolerance", desc, b.get(), a.get(), 0);
    }

    {
@@ -169,7 +182,8 @@ TEST(ValidationReportTests, MultipleSomeIncorrect)
            }
        });

-        report.check("correct", desc, b.get(), a.get());
+        report.check("correct - explicit tolerance", desc, b.get(), a.get());
+        report.check("correct - implicit tolerance", desc, b.get(), a.get(), 0);
    }

    {
@@ -182,16 +196,21 @@ TEST(ValidationReportTests, MultipleSomeIncorrect)
        ckt::fill_tensor_buffer(desc, a.get(), []([[maybe_unused]] size_t i) { return 1; });
        ckt::fill_tensor_buffer(desc, b.get(), []([[maybe_unused]] size_t i) { return 555; });

-        report.check("incorrect 2", desc, b.get(), a.get());
+        report.check("incorrect 2 - explicit tolerance", desc, b.get(), a.get());
+        report.check("incorrect 2 - implicit tolerance", desc, b.get(), a.get(), 0);
    }

    const auto errors = report.get_errors();

-    ASSERT_THAT(errors.size(), Eq(2));
-    EXPECT_THAT(errors[0].tensor_name, StrEq("incorrect 1"));
+    ASSERT_THAT(errors.size(), Eq(4));
+    EXPECT_THAT(errors[0].tensor_name, StrEq("incorrect 1 - explicit tolerance"));
    EXPECT_THAT(errors[0].wrong_elements, Eq(46840334));
-    EXPECT_THAT(errors[1].tensor_name, StrEq("incorrect 2"));
-    EXPECT_THAT(errors[1].wrong_elements, Eq(482800));
+    EXPECT_THAT(errors[1].tensor_name, StrEq("incorrect 1 - implicit tolerance"));
+    EXPECT_THAT(errors[1].wrong_elements, Eq(46840334));
+    EXPECT_THAT(errors[2].tensor_name, StrEq("incorrect 2 - explicit tolerance"));
+    EXPECT_THAT(errors[2].wrong_elements, Eq(482800));
+    EXPECT_THAT(errors[3].tensor_name, StrEq("incorrect 2 - implicit tolerance"));
+    EXPECT_THAT(errors[3].wrong_elements, Eq(482800));
 }

 // MatchesReference operates on the types defined in testing.hpp, so just
@@ -234,7 +253,7 @@ ValidationReport validate<DUMMY_SIGNATURE>(const Args<DUMMY_SIGNATURE>& args,
 {
    ValidationReport report;
    report.check("a", args.make_a_descriptor(), actual.a, expected.a);
-    report.check("b", args.make_b_descriptor(), actual.b, expected.b);
+    report.check_by_accumulations("b", args.make_b_descriptor(), actual.b, expected.b, 0);
    return report;
 }

@@ -299,5 +318,5 @@ TEST(MatchesReference, Incorrect)
    EXPECT_THAT(listener.str(),
                StringEqWithDiff( //
                    "1 tensors failed to validate\n"
-                    "    - a: 625/625 incorrect elements (~100%)"));
+                    "    - a: 625/625 incorrect elements (~100%), max error 1"));
 }