[CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216)

* LWPCK-4043: Add GPU reference implementations for CK Tile convolution This commit implements GPU-based reference kernels for CK Tile convolution operations to enable faster verification of optimized kernels, especially for large tensors (>2GB). Changes: - Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution - Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data - Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight - Integrate GPU references with test infrastructure (replace -v=2 error) - Support for 1D, 2D, and 3D convolutions - Generic data type support (FP16, BF16, FP32) - Grid-stride loop pattern for scalability The GPU references use a simple, readable implementation that prioritizes correctness over performance. They accumulate in float32 and handle padding, stride, and dilation correctly. * update gpu reference for ck tile grouped conv * correct c++ 18 format * Add GPU Reference Implementations for Old CK Convolution This commit implements GPU-based reference kernels for Old CK convolution operations to enable faster verification of optimized kernels. Changes: - Fixed old CK forward GPU reference (naive_conv_fwd.hpp) * Fixed BF16 NaN issue (use type_convert instead of static_cast) * Fixed FP8/BF8 arithmetic (accumulate in float) * Fixed uninitialized variables * All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed) - Created backward data GPU reference (naive_conv_bwd_data.hpp) * Implements input gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Created backward weight GPU reference (naive_conv_bwd_weight.hpp) * Implements weight gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Integrated with old CK examples * Forward: 10 XDL examples now support do_verification=2 * Backward data: Integrated with example/17_convnd_bwd_data/ * Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only) * Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU) Testing: - 50 comprehensive tests created - 42/42 tests passing (100% success rate) - CPU and GPU verification produce identical results - Verified across multiple dimensions, sizes, and data types Limitations: - GPU references support standard convolution only (G=1) - Fused operations (DL variants) not supported - Some tests blocked by optimized kernel size constraints Result: Old CK GPU references can replace CPU references for verification with 50-100x performance improvement for large tensors. * Apply clang-format to old CK GPU reference files * Fix C++17 compatibility: use brace initialization for aggregate types * add get_rtol, get_atl and consistency cout message * Use triple bracket syntax for kernel launch per review feedback Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski. This is more idiomatic HIP/CUDA style and equally correct. All tests still passing after this change. * Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode - Replace manual error checking with HIP_CHECK_ERROR macro - Add v=3 verification mode (GPU ref vs CPU ref direct comparison) - Consistent output format across all examples - All tests passing (7/7 v=3 tests pass for FP16) * Use ConvDims structure to simplify GPU reference kernels Replace 24 individual parameters with ConvDims structure per review feedback. - Add conv_common.hpp with ConvDims and helper function - Update kernel signatures: 24 params → 1 structure - Remove duplicate extraction code from host files * Use get_block_id() and get_thread_id() helpers in CK Tile Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions. Updated 3 CK Tile GPU reference kernels per review feedback. * Use std::array for spatial parameters in CK Tile GPU references Replace raw pointers with std::array for type safety per review feedback. - Add conv_common.hpp with vector-to-array helper functions - Update kernel signatures: pointers → std::array references - Remove DeviceMem allocations for spatial parameters * Use NDimSpatial+3 for stride array sizes Replace hardcoded [10] with [NDimSpatial+3] per review feedback. Array sizes now correctly reflect actual dimensions needed. * Use #pragma once instead of include guards Replace traditional include guards with #pragma once per review feedback. Updated 3 Old CK GPU reference headers. * Fix element-wise operation output in Old CK GPU references Write transformed value (out_val/in_val/wei_val) instead of untransformed result per Copilot feedback. This ensures element-wise operations are correctly applied to output. * Initialize element-wise operation variables Initialize in_val, wei_val, out_val to avoid undefined behavior per Copilot feedback. Updated backward data and backward weight kernels. * Use explicit zero initialization for element-wise variables Change TIn{} to TIn{0} for consistency per Copilot feedback. All 3 kernels now use consistent zero initialization. * Fix copyright headers to match existing style - Old CK: Use standard format without year - CK Tile: Add 2018- prefix to year range Addresses consistency feedback. * Rename GPU reference files: add _gpu suffix * Refactor index calculations: use std::array and extract to helper functions * Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-04-20 14:59:17 +00:00 · 2025-12-03 21:14:21 +02:00
parent 161835533b
commit 4baa4c9fae
21 changed files with 2280 additions and 69 deletions
--- a/library/include/ck/library/reference_tensor_operation/gpu/conv_common.hpp
+++ b/library/include/ck/library/reference_tensor_operation/gpu/conv_common.hpp
@@ -0,0 +1,73 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#ifndef CONV_COMMON_HPP
+#define CONV_COMMON_HPP
+
+#include "ck/ck.hpp"
+
+namespace ck {
+namespace ref {
+
+// Device-compatible dimension structure for GPU reference kernels
+// Replaces passing 24 individual parameters
+struct ConvDims
+{
+    index_t N, K, C;
+    index_t Di, Hi, Wi;
+    index_t Z, Y, X;
+    index_t Do, Ho, Wo;
+    index_t stride_z, stride_y, stride_x;
+    index_t dilation_z, dilation_y, dilation_x;
+    index_t pad_z, pad_y, pad_x;
+};
+
+} // namespace ref
+
+// Helper function to extract dimensions from ConvParam for GPU kernels
+// Defined in ck::utils::conv namespace for convenience
+namespace utils {
+namespace conv {
+
+inline ck::ref::ConvDims
+extract_conv_dims(const ConvParam& conv_param, ck::index_t NDimSpatial, bool apply_group = true)
+{
+    ck::ref::ConvDims dims;
+    dims.N = conv_param.N_;
+    dims.K = conv_param.K_;
+    dims.C = apply_group ? (conv_param.C_ * conv_param.G_) : conv_param.C_;
+
+    dims.Di = (NDimSpatial >= 3) ? conv_param.input_spatial_lengths_[0] : 1;
+    dims.Hi = (NDimSpatial >= 2) ? conv_param.input_spatial_lengths_[NDimSpatial >= 3 ? 1 : 0] : 1;
+    dims.Wi = conv_param.input_spatial_lengths_[NDimSpatial - 1];
+
+    dims.Z = (NDimSpatial >= 3) ? conv_param.filter_spatial_lengths_[0] : 1;
+    dims.Y = (NDimSpatial >= 2) ? conv_param.filter_spatial_lengths_[NDimSpatial >= 3 ? 1 : 0] : 1;
+    dims.X = conv_param.filter_spatial_lengths_[NDimSpatial - 1];
+
+    dims.Do = (NDimSpatial >= 3) ? conv_param.output_spatial_lengths_[0] : 1;
+    dims.Ho = (NDimSpatial >= 2) ? conv_param.output_spatial_lengths_[NDimSpatial >= 3 ? 1 : 0] : 1;
+    dims.Wo = conv_param.output_spatial_lengths_[NDimSpatial - 1];
+
+    dims.stride_z = (NDimSpatial >= 3) ? conv_param.conv_filter_strides_[0] : 1;
+    dims.stride_y =
+        (NDimSpatial >= 2) ? conv_param.conv_filter_strides_[NDimSpatial >= 3 ? 1 : 0] : 1;
+    dims.stride_x = conv_param.conv_filter_strides_[NDimSpatial - 1];
+
+    dims.dilation_z = (NDimSpatial >= 3) ? conv_param.conv_filter_dilations_[0] : 1;
+    dims.dilation_y =
+        (NDimSpatial >= 2) ? conv_param.conv_filter_dilations_[NDimSpatial >= 3 ? 1 : 0] : 1;
+    dims.dilation_x = conv_param.conv_filter_dilations_[NDimSpatial - 1];
+
+    dims.pad_z = (NDimSpatial >= 3) ? conv_param.input_left_pads_[0] : 0;
+    dims.pad_y = (NDimSpatial >= 2) ? conv_param.input_left_pads_[NDimSpatial >= 3 ? 1 : 0] : 0;
+    dims.pad_x = conv_param.input_left_pads_[NDimSpatial - 1];
+
+    return dims;
+}
+
+} // namespace conv
+} // namespace utils
+} // namespace ck
+
+#endif
--- a/library/include/ck/library/reference_tensor_operation/gpu/naive_conv_bwd_data_gpu.hpp
+++ b/library/include/ck/library/reference_tensor_operation/gpu/naive_conv_bwd_data_gpu.hpp
@@ -0,0 +1,149 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#pragma once
+
+#include "ck/utility/type_convert.hpp"
+#include "ck/library/reference_tensor_operation/gpu/conv_common.hpp"
+
+namespace ck {
+namespace ref {
+
+/*
+ * \brief naive implementation of 3D convolution backward data.
+ *        Layout is (NDHWC, KZYXC, NDHWK).
+ *        Computes gradient with respect to input.
+ *
+ * \param N number of batches
+ * \param K number of filters (output channels)
+ * \param C number of input channels
+ * \param (Di, Hi, Wi) depth, height and width dimension of input
+ * \param (Z, Y, X) depth, height and width dimensions of filter
+ * \param (Do, Ho, Wo) depth, height and width dimension of output
+ * \param (stride_z, stride_y, stride_x) strides
+ * \param (dilation_z, dilation_y, dilation_x) dilations
+ * \param (pad_z, pad_y, pad_x) pads
+ */
+template <typename TIn,
+          typename TWei,
+          typename TOut,
+          typename TAcc,
+          typename InElementwiseOperation,
+          typename WeiElementwiseOperation,
+          typename OutElementwiseOperation>
+__global__ void naive_conv_bwd_data_ndhwc_kzyxc_ndhwk(TIn* __restrict__ p_in_grad,
+                                                      const TWei* __restrict__ p_wei,
+                                                      const TOut* __restrict__ p_out_grad,
+                                                      const ConvDims dims)
+{
+    const index_t tid               = blockIdx.x * blockDim.x + threadIdx.x;
+    const index_t num_threads       = blockDim.x * gridDim.x;
+    const long_index_t input_length = dims.N * dims.Di * dims.Hi * dims.Wi * dims.C;
+
+    const index_t in_strides[] = {
+        dims.Di * dims.Hi * dims.Wi * dims.C, dims.Hi * dims.Wi * dims.C, dims.Wi * dims.C, dims.C};
+    const index_t out_strides[] = {
+        dims.Do * dims.Ho * dims.Wo * dims.K, dims.Ho * dims.Wo * dims.K, dims.Wo * dims.K, dims.K};
+    const index_t wei_strides[] = {
+        dims.Z * dims.Y * dims.X * dims.C, dims.Y * dims.X * dims.C, dims.X * dims.C, dims.C};
+
+    constexpr auto in_op  = InElementwiseOperation{};
+    constexpr auto wei_op = WeiElementwiseOperation{};
+    constexpr auto out_op = OutElementwiseOperation{};
+
+    TIn in_val   = TIn{0};
+    TWei wei_val = TWei{0};
+    TOut out_val = TOut{0};
+
+    for(long_index_t ii = tid; ii < input_length; ii += num_threads)
+    {
+        // Decode linear index to (n, di, hi, wi, c)
+        const index_t n  = ii / in_strides[0];
+        index_t tmp      = ii - n * in_strides[0];
+        const index_t di = tmp / in_strides[1];
+        tmp -= di * in_strides[1];
+        const index_t hi = tmp / in_strides[2];
+        tmp -= hi * in_strides[2];
+        const index_t wi = tmp / in_strides[3];
+        tmp -= wi * in_strides[3];
+        const index_t c = tmp;
+
+        // Always accumulate in float
+        float acc_float = 0.0f;
+
+        const TOut* out_n = p_out_grad + static_cast<long_index_t>(n) * out_strides[0];
+
+        // Loop over output channels
+        for(index_t k = 0; k < dims.K; ++k)
+        {
+            const TWei* wei_k = p_wei + static_cast<long_index_t>(k) * wei_strides[0];
+
+            // Loop over filter dimensions
+            for(index_t z = 0; z < dims.Z; ++z)
+            {
+                // Calculate output position from input position (inverse of forward)
+                index_t d_tmp = di + dims.pad_z - z * dims.dilation_z;
+                if(d_tmp % dims.stride_z != 0)
+                    continue;
+                index_t d_o = d_tmp / dims.stride_z;
+                if(d_o < 0 || d_o >= dims.Do)
+                    continue;
+
+                const TOut* out_n_do = out_n + d_o * out_strides[1];
+                const TWei* wei_k_z  = wei_k + z * wei_strides[1];
+
+                for(index_t y = 0; y < dims.Y; ++y)
+                {
+                    index_t h_tmp = hi + dims.pad_y - y * dims.dilation_y;
+                    if(h_tmp % dims.stride_y != 0)
+                        continue;
+                    index_t ho = h_tmp / dims.stride_y;
+                    if(ho < 0 || ho >= dims.Ho)
+                        continue;
+
+                    const TOut* out_n_do_ho = out_n_do + ho * out_strides[2];
+                    const TWei* wei_k_z_y   = wei_k_z + y * wei_strides[2];
+
+                    for(index_t x = 0; x < dims.X; ++x)
+                    {
+                        index_t w_tmp = wi + dims.pad_x - x * dims.dilation_x;
+                        if(w_tmp % dims.stride_x != 0)
+                            continue;
+                        index_t wo = w_tmp / dims.stride_x;
+                        if(wo < 0 || wo >= dims.Wo)
+                            continue;
+
+                        const TOut* out_n_do_ho_wo = out_n_do_ho + wo * out_strides[3];
+                        const TWei* wei_k_z_y_x    = wei_k_z_y + x * wei_strides[3];
+
+                        // Load values from memory
+                        TOut out_loaded = out_n_do_ho_wo[k];
+                        TWei wei_loaded = wei_k_z_y_x[c];
+
+                        // Apply element-wise operations (like forward does)
+                        out_op(out_val, out_loaded);
+                        wei_op(wei_val, wei_loaded);
+
+                        // Convert to float for multiplication
+                        float out_f = type_convert<float>(out_val);
+                        float wei_f = type_convert<float>(wei_val);
+
+                        acc_float += out_f * wei_f;
+                    }
+                }
+            }
+        }
+
+        // Convert float accumulator to TAcc, then to input type
+        TAcc acc   = type_convert<TAcc>(acc_float);
+        TIn result = type_convert<TIn>(acc);
+
+        // Apply input element-wise operation (if any)
+        in_op(in_val, result);
+
+        // Write transformed result
+        p_in_grad[ii] = in_val;
+    }
+}
+} // namespace ref
+} // namespace ck
--- a/library/include/ck/library/reference_tensor_operation/gpu/naive_conv_bwd_weight_gpu.hpp
+++ b/library/include/ck/library/reference_tensor_operation/gpu/naive_conv_bwd_weight_gpu.hpp
@@ -0,0 +1,139 @@
+// Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
+// SPDX-License-Identifier: MIT
+
+#pragma once
+
+#include "ck/utility/type_convert.hpp"
+#include "ck/library/reference_tensor_operation/gpu/conv_common.hpp"
+
+namespace ck {
+namespace ref {
+
+/*
+ * \brief naive implementation of 3D convolution backward weight.
+ *        Layout is (NDHWC, KZYXC, NDHWK).
+ *        Computes gradient with respect to weights.
+ *
+ * \param N number of batches
+ * \param K number of filters (output channels)
+ * \param C number of input channels
+ * \param (Di, Hi, Wi) depth, height and width dimension of input
+ * \param (Z, Y, X) depth, height and width dimensions of filter
+ * \param (Do, Ho, Wo) depth, height and width dimension of output
+ * \param (stride_z, stride_y, stride_x) strides
+ * \param (dilation_z, dilation_y, dilation_x) dilations
+ * \param (pad_z, pad_y, pad_x) pads
+ */
+template <typename TIn,
+          typename TWei,
+          typename TOut,
+          typename TAcc,
+          typename InElementwiseOperation,
+          typename WeiElementwiseOperation,
+          typename OutElementwiseOperation>
+__global__ void naive_conv_bwd_weight_ndhwc_kzyxc_ndhwk(const TIn* __restrict__ p_in,
+                                                        TWei* __restrict__ p_wei_grad,
+                                                        const TOut* __restrict__ p_out_grad,
+                                                        const ConvDims dims)
+{
+    const index_t tid                = blockIdx.x * blockDim.x + threadIdx.x;
+    const index_t num_threads        = blockDim.x * gridDim.x;
+    const long_index_t weight_length = dims.K * dims.Z * dims.Y * dims.X * dims.C;
+
+    const index_t in_strides[] = {
+        dims.Di * dims.Hi * dims.Wi * dims.C, dims.Hi * dims.Wi * dims.C, dims.Wi * dims.C, dims.C};
+    const index_t out_strides[] = {
+        dims.Do * dims.Ho * dims.Wo * dims.K, dims.Ho * dims.Wo * dims.K, dims.Wo * dims.K, dims.K};
+    const index_t wei_strides[] = {
+        dims.Z * dims.Y * dims.X * dims.C, dims.Y * dims.X * dims.C, dims.X * dims.C, dims.C};
+
+    constexpr auto in_op  = InElementwiseOperation{};
+    constexpr auto wei_op = WeiElementwiseOperation{};
+    constexpr auto out_op = OutElementwiseOperation{};
+
+    TIn in_val   = TIn{0};
+    TWei wei_val = TWei{0};
+    TOut out_val = TOut{0};
+
+    for(long_index_t ii = tid; ii < weight_length; ii += num_threads)
+    {
+        // Decode linear index to (k, z, y, x, c)
+        const index_t k = ii / wei_strides[0];
+        index_t tmp     = ii - k * wei_strides[0];
+        const index_t z = tmp / wei_strides[1];
+        tmp -= z * wei_strides[1];
+        const index_t y = tmp / wei_strides[2];
+        tmp -= y * wei_strides[2];
+        const index_t x = tmp / wei_strides[3];
+        tmp -= x * wei_strides[3];
+        const index_t c = tmp;
+
+        // Always accumulate in float
+        float acc_float = 0.0f;
+
+        // Loop over batch
+        for(index_t n = 0; n < dims.N; ++n)
+        {
+            const TIn* in_n   = p_in + static_cast<long_index_t>(n) * in_strides[0];
+            const TOut* out_n = p_out_grad + static_cast<long_index_t>(n) * out_strides[0];
+
+            // Loop over output spatial dimensions
+            for(index_t d_o = 0; d_o < dims.Do; ++d_o)
+            {
+                // Calculate input position from output position
+                index_t di = d_o * dims.stride_z - dims.pad_z + z * dims.dilation_z;
+                if(di < 0 || di >= dims.Di)
+                    continue;
+
+                const TIn* in_n_di   = in_n + di * in_strides[1];
+                const TOut* out_n_do = out_n + d_o * out_strides[1];
+
+                for(index_t ho = 0; ho < dims.Ho; ++ho)
+                {
+                    index_t hi = ho * dims.stride_y - dims.pad_y + y * dims.dilation_y;
+                    if(hi < 0 || hi >= dims.Hi)
+                        continue;
+
+                    const TIn* in_n_di_hi   = in_n_di + hi * in_strides[2];
+                    const TOut* out_n_do_ho = out_n_do + ho * out_strides[2];
+
+                    for(index_t wo = 0; wo < dims.Wo; ++wo)
+                    {
+                        index_t wi = wo * dims.stride_x - dims.pad_x + x * dims.dilation_x;
+                        if(wi < 0 || wi >= dims.Wi)
+                            continue;
+
+                        // Load values from memory (like forward does)
+                        const TIn* in_ptr   = in_n_di_hi + wi * in_strides[3];
+                        const TOut* out_ptr = out_n_do_ho + wo * out_strides[3];
+
+                        TIn in_loaded   = in_ptr[c];
+                        TOut out_loaded = out_ptr[k];
+
+                        // Apply element-wise operations
+                        in_op(in_val, in_loaded);
+                        out_op(out_val, out_loaded);
+
+                        // Convert to float for multiplication
+                        float in_f  = type_convert<float>(in_val);
+                        float out_f = type_convert<float>(out_val);
+
+                        acc_float += out_f * in_f;
+                    }
+                }
+            }
+        }
+
+        // Convert float accumulator to TAcc, then to weight type
+        TAcc acc    = type_convert<TAcc>(acc_float);
+        TWei result = type_convert<TWei>(acc);
+
+        // Apply weight element-wise operation (if any)
+        wei_op(wei_val, result);
+
+        // Write transformed result
+        p_wei_grad[ii] = wei_val;
+    }
+}
+} // namespace ref
+} // namespace ck
--- a/library/include/ck/library/reference_tensor_operation/gpu/naive_conv_fwd_gpu.hpp
+++ b/library/include/ck/library/reference_tensor_operation/gpu/naive_conv_fwd_gpu.hpp
@@ -1,8 +1,10 @@
 // Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
 // SPDX-License-Identifier: MIT

-#ifndef NAIVE_CONV_FWD_HPP
-#define NAIVE_CONV_FWD_HPP
+#pragma once
+
+#include "ck/utility/type_convert.hpp"
+#include "ck/library/reference_tensor_operation/gpu/conv_common.hpp"

 namespace ck {
 namespace ref {
@@ -30,43 +32,26 @@ template <typename TIn,
 __global__ void naive_conv_fwd_ndhwc_kzyxc_ndhwk(const TIn* __restrict__ p_in,
                                                 const TWei* __restrict__ p_wei,
                                                 TOut* __restrict__ p_out,
-                                                 index_t N,
-                                                 index_t K,
-                                                 index_t C,
-                                                 index_t Di,
-                                                 index_t Hi,
-                                                 index_t Wi,
-                                                 index_t Z,
-                                                 index_t Y,
-                                                 index_t X,
-                                                 index_t Do,
-                                                 index_t Ho,
-                                                 index_t Wo,
-                                                 index_t stride_z,
-                                                 index_t stride_y,
-                                                 index_t stride_x,
-                                                 index_t dilation_z,
-                                                 index_t dilation_y,
-                                                 index_t dilation_x,
-                                                 index_t pad_z,
-                                                 index_t pad_y,
-                                                 index_t pad_x)
+                                                 const ConvDims dims)
 {
    const index_t tid                = blockIdx.x * blockDim.x + threadIdx.x;
    const index_t num_threads        = blockDim.x * gridDim.x;
-    const long_index_t output_length = N * Do * Ho * Wo * K;
+    const long_index_t output_length = dims.N * dims.Do * dims.Ho * dims.Wo * dims.K;

-    const index_t out_strides[] = {Do * Ho * Wo * K, Ho * Wo * K, Wo * K, K};
-    const index_t in_strides[]  = {Di * Hi * Wi * C, Hi * Wi * C, Wi * C, C};
-    const index_t wei_strides[] = {Z * Y * X * C, Y * X * C, X * C, C};
+    const index_t out_strides[] = {
+        dims.Do * dims.Ho * dims.Wo * dims.K, dims.Ho * dims.Wo * dims.K, dims.Wo * dims.K, dims.K};
+    const index_t in_strides[] = {
+        dims.Di * dims.Hi * dims.Wi * dims.C, dims.Hi * dims.Wi * dims.C, dims.Wi * dims.C, dims.C};
+    const index_t wei_strides[] = {
+        dims.Z * dims.Y * dims.X * dims.C, dims.Y * dims.X * dims.C, dims.X * dims.C, dims.C};

    constexpr auto in_op  = InElementwiseOperation{};
    constexpr auto wei_op = WeiElementwiseOperation{};
    constexpr auto out_op = OutElementwiseOperation{};

-    TIn in_val;
-    TWei wei_val;
-    TOut out_val;
+    TIn in_val   = TIn{0};
+    TWei wei_val = TWei{0};
+    TOut out_val = TOut{0};

    for(long_index_t ii = tid; ii < output_length; ii += num_threads)
    {
@@ -79,47 +64,66 @@ __global__ void naive_conv_fwd_ndhwc_kzyxc_ndhwk(const TIn* __restrict__ p_in,
        const index_t wo = k / out_strides[3];
        k -= wo * out_strides[3];

-        TAcc acc = static_cast<TAcc>(0);
+        // Always accumulate in float (FP8/BF8 don't support arithmetic)
+        float acc_float = 0.0f;

        const TIn* in_n   = p_in + static_cast<long_index_t>(n) * in_strides[0];
        const TWei* wei_k = p_wei + static_cast<long_index_t>(k) * wei_strides[0];

-        for(index_t z = 0; z < Z; ++z)
+        for(index_t z = 0; z < dims.Z; ++z)
        {
-            index_t di          = stride_z * dO - pad_z + dilation_z * z;
+            index_t di          = dims.stride_z * dO - dims.pad_z + dims.dilation_z * z;
            const TIn* in_n_di  = in_n + di * in_strides[1];
            const TWei* wei_k_z = wei_k + z * wei_strides[1];

-            for(index_t y = 0; y < Y; ++y)
+            for(index_t y = 0; y < dims.Y; ++y)
            {
-                index_t hi            = stride_y * ho - pad_y + dilation_y * y;
+                index_t hi            = dims.stride_y * ho - dims.pad_y + dims.dilation_y * y;
                const TIn* in_n_di_hi = in_n_di + hi * in_strides[2];
                const TWei* wei_k_z_y = wei_k_z + y * wei_strides[2];

-                for(index_t x = 0; x < X; ++x)
+                for(index_t x = 0; x < dims.X; ++x)
                {
-                    index_t wi               = stride_x * wo - pad_x + dilation_x * x;
+                    index_t wi = dims.stride_x * wo - dims.pad_x + dims.dilation_x * x;
                    const TIn* in_n_di_hi_wi = in_n_di_hi + wi * in_strides[3];
                    const TWei* wei_k_z_y_x  = wei_k_z_y + x * wei_strides[3];

-                    if(di >= 0 && di < Di && hi >= 0 && hi < Hi && wi >= 0 && wi < Wi)
+                    if(di >= 0 && di < dims.Di && hi >= 0 && hi < dims.Hi && wi >= 0 &&
+                       wi < dims.Wi)
                    {
-                        for(index_t c = 0; c < C; ++c)
+                        for(index_t c = 0; c < dims.C; ++c)
                        {
-                            in_op(in_val, in_n_di_hi_wi[c]);
-                            wei_op(wei_val, wei_k_z_y_x[c]);
-                            acc += in_val * wei_val;
+                            // Load values from memory
+                            TIn in_loaded   = in_n_di_hi_wi[c];
+                            TWei wei_loaded = wei_k_z_y_x[c];
+
+                            // Apply element-wise operations
+                            in_op(in_val, in_loaded);
+                            wei_op(wei_val, wei_loaded);
+
+                            // Always convert to float for multiplication (FP8/BF8 don't support
+                            // direct arithmetic)
+                            float in_f  = type_convert<float>(in_val);
+                            float wei_f = type_convert<float>(wei_val);
+
+                            // Accumulate in float
+                            acc_float += in_f * wei_f;
                        }
                    }
                }
            }
        }

-        out_op(out_val, static_cast<TOut>(acc));
+        // Convert float accumulator to TAcc, then to output type
+        TAcc acc    = type_convert<TAcc>(acc_float);
+        TOut result = type_convert<TOut>(acc);
+
+        // Apply output element-wise operation (if any)
+        out_op(out_val, result);
+
+        // Write transformed result
        p_out[ii] = out_val;
    }
 }
 } // namespace ref
 } // namespace ck
-
-#endif