[CK_TILE] Support for elementwise kernel (#2246)

* Elementwise kernel implementation Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: yashagar <yashagar@amd.com> * Elementwise with generalized nDims * Adding the n-ary input tensor feature * Generalize dimensions on top of inputs * Add TFLOPS + remove std usage for tuples * 1D basecase optimization * Cleanup code + refactoring to a common interface * Generalize to unary and add an example * Cleanup, refactoring and commenting * Suggestions for LWPCK-3170: elementwise kernel improvements * Clang-format: remod.py * Replace InputTensorType with XDataType as the type of input_tensors * Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs * Move examples to folder 19_elementwise * Add missing copyright headers and fix some existing ones * Replace an assert with throw std::runtime_error in elementwise example * Avoid reading the output by using make_static_distributed_tensor for y_tile * Removed two unused includes * No need to move windows to the next block when each workgroup processes a single tile * Only copy input tensors to the device * Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example * Adding output strides to the kernel, transposition example and update the other examples * Changes made by remod.py * Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp * Reuse generic reference binary/unary operation in examples + refactoring the transpose reference * Fix comments in elementwise_example.cpp - Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses - ElementWiseTraits was renamed to ElementWiseShape - Adopt suggestions made by Copilot when prompted to check for factual or typographical errors * Simplify CMakeLists.txt and remove the unused variables this uncovers * Rename a file and fix some copyright statements * Changes made by script/clang-format-overwrite.sh * Add basic unit test for ElementWiseKernel * Remove left-over uninformative comment in apply unit test * Changes made by clang-format-overwrite.sh * fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view * Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp * Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array * Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally * Move examples to folder 20_elementwise * Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK * Fix CLang formating * Bump up the elementwise example folder number * Elementwise: add padding + minor cleanup * Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view * Add isSupportedArg to Elementwise kernel + addapt example and unit tests * Fix clang-format on the unit test file --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2026-04-20 06:49:15 +00:00 · 2025-07-24 12:21:45 +03:00
parent 6681593864
commit 606b0cc947
23 changed files with 1509 additions and 6 deletions
--- a/test/ck_tile/elementwise/test_elementwise_1d.cpp
+++ b/test/ck_tile/elementwise/test_elementwise_1d.cpp
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
+
+#include <gtest/gtest.h>
+#include <vector>
+#include <cmath> // For std::abs
+#include <tuple>
+#include <type_traits> // For std::is_same_v, std::is_floating_point_v
+#include <utility>     // For std::index_sequence, std::forward
+
+#include "ck_tile/core.hpp"
+#include "ck_tile/host.hpp"
+#include "ck_tile/host/kernel_launch.hpp"
+#include "ck_tile/ops/elementwise/kernel/elementwise_kernel.hpp"
+#include "ck_tile/ops/elementwise/pipeline/elementwise_pipeline_problem.hpp"
+#include "ck_tile/ops/elementwise/pipeline/elementwise_pipeline_default_policy.hpp"
+#include "ck_tile/ops/elementwise/pipeline/elementwise_shape.hpp"
+#include "ck_tile/ops/elementwise/binary_elementwise_operation.hpp"
+#include "ck_tile/ops/elementwise/unary_element_wise_operation.hpp"
+
+// Traits to get number of inputs for an elementwise operation
+template <typename Op>
+struct elementwise_op_traits;
+
+template <>
+struct elementwise_op_traits<ck_tile::element_wise::Add>
+{
+    static constexpr int num_inputs = 2;
+};
+template <>
+struct elementwise_op_traits<ck_tile::element_wise::Relu>
+{
+    static constexpr int num_inputs = 1;
+};
+
+template <std::size_t D, typename F>
+auto make_uniform_array_with_factory(F&& factory)
+{
+    return [&]<std::size_t... Is>(std::index_sequence<Is...>)
+    {
+        return std::array<std::invoke_result_t<F, std::size_t>, D>{factory(Is)...};
+    }
+    (std::make_index_sequence<D>{});
+}
+
+template <typename Tuple>
+class TestCkTileElementwise : public ::testing::Test
+{
+    protected:
+    using XDataType         = std::tuple_element_t<0, Tuple>;
+    using YDataType         = std::tuple_element_t<1, Tuple>;
+    using ComputeDataType   = std::tuple_element_t<2, Tuple>;
+    using ElementwiseOpType = std::tuple_element_t<3, Tuple>;
+    using BlockWarps_       = std::tuple_element_t<4, Tuple>;
+    using BlockTile_        = std::tuple_element_t<5, Tuple>;
+    using WarpTile_         = std::tuple_element_t<6, Tuple>;
+    using TestElementWiseShape =
+        ck_tile::ElementWiseShape<BlockWarps_, BlockTile_, WarpTile_, ComputeDataType>;
+    static constexpr int NumInputs = elementwise_op_traits<ElementwiseOpType>::num_inputs;
+
+    void RunTest(ck_tile::index_t total_m_elements)
+    {
+        // Dims and Strides (1D example)
+        auto lens    = ck_tile::make_tuple(total_m_elements);
+        auto strides = ck_tile::make_tuple(
+            static_cast<ck_tile::index_t>(1)); // Strides for the single dimension
+
+        // Host Tensors
+        auto h_xs = make_uniform_array_with_factory<NumInputs>([&](std::size_t) {
+            auto ret = ck_tile::HostTensor<XDataType>({total_m_elements});
+            ck_tile::FillUniformDistribution<XDataType>{0.f, 5.f}(ret);
+            return ret;
+        });
+        ck_tile::HostTensor<YDataType> h_y({total_m_elements});
+        h_y.SetZero();
+        ck_tile::HostTensor<YDataType> h_y_ref({total_m_elements});
+        h_y_ref.SetZero();
+
+        // Device Buffers
+        auto d_xs_mems_owner = make_uniform_array_with_factory<NumInputs>(
+            [&](std::size_t i) { return ck_tile::DeviceMem(h_xs[i]); });
+        for(int i = 0; i < NumInputs; ++i)
+        {
+            d_xs_mems_owner[i].ToDevice(h_xs[i].data());
+        }
+
+        ck_tile::DeviceMem d_y_mem(h_y);
+        d_y_mem.SetZero();
+
+        auto d_x_ptrs_tuple = [&]<std::size_t... Is>(std::index_sequence<Is...>)
+        {
+            return ck_tile::make_tuple(
+                static_cast<const XDataType*>(d_xs_mems_owner[Is].GetDeviceBuffer())...);
+        }
+        (std::make_index_sequence<NumInputs>{});
+
+        YDataType* p_y_device = static_cast<YDataType*>(d_y_mem.GetDeviceBuffer());
+
+        // Problem and Policy
+        using Problem = ck_tile::ElementWisePipelineProblem<XDataType,
+                                                            ComputeDataType,
+                                                            YDataType,
+                                                            TestElementWiseShape,
+                                                            ElementwiseOpType>;
+        using Policy  = ck_tile::ElementWiseDefaultPolicy;
+
+        ck_tile::ElementWiseKernel<Problem, Policy> ew_kernel;
+
+        // Launch configuration
+        ck_tile::index_t grid_size =
+            (total_m_elements + TestElementWiseShape::kBlockM - 1) / TestElementWiseShape::kBlockM;
+        dim3 grid(grid_size, 1, 1);
+        dim3 block(TestElementWiseShape::kBlockSize, 1, 1);
+        constexpr ck_tile::index_t kBlockPerCu = 1;
+
+        ck_tile::stream_config s{nullptr, false, 0}; // Default stream, no timing, no log
+
+        // Check if the kernel configuration is supported
+        if(!ew_kernel.IsSupportedArgument(lens))
+        {
+            throw std::runtime_error(
+                "The kernel configuration is not supported for the given input size.");
+        }
+
+        ck_tile::launch_kernel(
+            s,
+            ck_tile::make_kernel<TestElementWiseShape::kBlockSize, // MaxThreadPerBlock
+                                 kBlockPerCu>                      // MinBlockPerCu
+            (ew_kernel,
+             grid,
+             block,
+             0, // actual shared memory
+             lens,
+             strides, // input strides
+             strides, // output strides
+             d_x_ptrs_tuple,
+             p_y_device));
+
+        d_y_mem.FromDevice(h_y.data());
+
+        // Reference computation on host
+        ElementwiseOpType op_host;
+        for(ck_tile::index_t i = 0; i < total_m_elements; ++i)
+        {
+            auto get_host_op_args = [&]<std::size_t... Is>(std::index_sequence<Is...>)
+            {
+                return ck_tile::make_tuple(static_cast<ComputeDataType>(h_xs[Is](i))...);
+            }
+            (std::make_index_sequence<NumInputs>{});
+
+            YDataType temp_y_val;
+            ck_tile::apply(
+                [&](auto&&... host_input_args) {
+                    op_host(temp_y_val,
+                            std::forward<decltype(host_input_args)>(host_input_args)...);
+                },
+                get_host_op_args);
+            h_y_ref(i) = temp_y_val;
+        }
+
+        // Check results
+        check_err(h_y, h_y_ref, "Error: Incorrect results!", 1e-5, 1e-5);
+    }
+};
+
+// Shape parameters (can be shared or varied per test type)
+using Shape1_BlockWarps = ck_tile::sequence<1>;   // 1D warp arrangement in M
+using Shape1_BlockTile  = ck_tile::sequence<256>; // M-dimension of block tile
+using Shape1_WarpTile   = ck_tile::sequence<64>;  // M-dimension of warp tile
+
+// Test configurations
+using TestConfig_F32_Add = std::tuple<float,
+                                      float,
+                                      float,
+                                      ck_tile::element_wise::Add,
+                                      Shape1_BlockWarps,
+                                      Shape1_BlockTile,
+                                      Shape1_WarpTile>;
+
+using TestConfig_F32_Relu = std::tuple<float,
+                                       float,
+                                       float,
+                                       ck_tile::element_wise::Relu,
+                                       Shape1_BlockWarps,
+                                       Shape1_BlockTile,
+                                       Shape1_WarpTile>;
+
+using TestConfig_F16_Add = std::tuple<ck_tile::half_t,
+                                      ck_tile::half_t,
+                                      float, // Compute in float for half
+                                      ck_tile::element_wise::Add,
+                                      Shape1_BlockWarps,
+                                      Shape1_BlockTile,
+                                      Shape1_WarpTile>;
+
+using TestTypes = ::testing::Types<TestConfig_F32_Add, TestConfig_F32_Relu, TestConfig_F16_Add>;
+
+TYPED_TEST_SUITE(TestCkTileElementwise, TestTypes);
+
+TYPED_TEST(TestCkTileElementwise, RunElementwise_1024) { this->RunTest(1024); }
+
+TYPED_TEST(TestCkTileElementwise, RunElementwise_513)
+{
+    EXPECT_THROW((this->RunTest(513)),
+                 std::runtime_error); // Test with an input size that's not a multiple of kVectorM
+}
+
+TYPED_TEST(TestCkTileElementwise, RunElementwise_516)
+{
+    this->RunTest(516); // Test with an input size that's not a multiple of blockM
+}
+
+TYPED_TEST(TestCkTileElementwise, RunElementwise_Small_32)
+{
+    this->RunTest(32); // Test with a very small size
+}