CK Instance Gen (#1145)

* Format * Format * Format * Remove const * Use the right template * Format * Format * add row/col instances * Add missing file * fixed * fixing block to etile error * Format * Updates * Format * fixed rrr layout * generating a sample JSON file: currently contains includes, prologue/epilogue and instances * version where the json is passed into the instances to generate a key * updated run function to just launch kernel * updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer * adding in testing files * cleaned up comments, still need to work on including header files * removed unneeded files * removed/commented out JSON implementation * added fusion(prologue/epilogue) into instance generation * working on instance selection * added instance selection, need to fix instance validation * removed block2etile map validity check for testing purposes * test running: failing due to incorrect files/input * all grid descs/ptrs completed, but device file not found * Update test and embed modules * Restore older version * added convolution operation, written test, debugging generated code for compilation * attempting to include CK in host directory: _Float16 error * CK header file issues * slight fix * don't crash when hip can't report total memory * dump generated code to a file * changing sizes * creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed * some fixes to call the device code * separating test files for conv and gemm * completed arg ptr, now have linking errors * clang format fix * resolved linker issues in conv test * remove dependency on libutility from ck * resolved num dim error * properly passing arg ptr, errors with passing typenames: redefinition/redeclaration * undo the commenting of device function * hand created kernel code to find rtc issues * dump the full src to file * resolved redeclaration errors, cleaned up errors for Amber's kernel code * debugging purposes: redeclaration error * config files * resolved errors for NumTensor and redeclaration, formatted version.h * resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type * WIP: close to getting kernel compiled * WIP: fixing rtc errors * fixed sequence errors, formatting, still one error with run fcn * yay: kernel compiles and runs * updated templated/generated version to run and compile * minor fixes * working generated example, resolved memory access error due to padding * adding in reference kernel, validation failing against reference * debugging: printing kernel argsz * reduced error in results * debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues * working validation (using reference convolution) with prologue function for both hard-coded and generated version * WIP: create an alt version that creates Argument on the device * wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments * wip: making necessary methods device code * added grid descs, working on grid pointers, errors with stl numerics * wip: updating kernel args - issue, replacing some std functions * replaced std::accumulate call with temp hardcoded version * wip: args causing memory issue * Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes * adding object file dump * temporary hardcoding of grid size, can remove device op inst + arg ptr * minor fix for grid size * added modified example where arg ptr is created on the device for generated version as well * removed device op instance and arg ptr from modified examples * moving device op file for testing purposes and to properly build CK * commenting out print-outs * adjust compiler args to produce a valid ELF file * temporary removal of validation * reverting compiler args back for working example * retrieve necessary arguments from generated template parameters in correct format * calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly * scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example * temporary change to generate ELF format binary object file * removed unecessary code, added comments * formatting fix * cleaned up code, added new tests, restructured library: move helper into CK * refactored launch parameter calculation to be more concise * renamed files and variables for more clarity/uniformity * more code cleaning, removed debug statements * moved majority of my files into codegen directory, running properly * updated Embed.cmake(string_view) in codegen directory * updated host directory to match Embed.cmake as well * added old tests in * updated instance generation methods to be more concise * removed layout from launch parameter calculation * working test * fixed issue with verification, all instances working * updated verification in other tests * removed duplicate matrix padder file, removed code dumps * removed old hard-coded tests * removed old host directory, all files in codegen directory now * fixed copyright in files * commenting out validation * renamed files * made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code * updated headers * removing duplicate file for fwd conv to gemm, merging with original file * fix building codegen with clang++ directly * resolving build error from conv_fwd_to_gemm * fix for previous error * renaming tests * created common test file * cleaned up code, added comments * renamed device op * fixed typos in comments * removed extra space * code cleanup: resolving Amber's comments * removed wrapper struct for matrix padder, fixed template * cleaned up if statements for better readability --------- Co-authored-by: Paul <pfultz2@yahoo.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-04-19 22:39:03 +00:00 · 2024-06-25 14:37:35 -07:00
parent cb13839425
commit 3e9711f0cb
33 changed files with 3417 additions and 47 deletions
--- a/codegen/src/device_gemm_multiple_d.cpp
+++ b/codegen/src/device_gemm_multiple_d.cpp
@@ -1,6 +1,6 @@

 // SPDX-License-Identifier: MIT
-// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.

 #include "ck/host/device_gemm_multiple_d/problem.hpp"
 #include "ck/host/device_gemm_multiple_d/operation.hpp"
@@ -11,23 +11,28 @@ namespace ck {
 namespace host {
 namespace device_gemm_multiple_d {

+// return the relevant device op file based on the operation
 std::string Problem::GetIncludeHeader() const
 {
    return "ck/tensor_operation/gpu/device/impl/device_gemm_multiple_d_xdl_cshuffle.hpp";
 }

-std::vector<Solution> Problem::GetSolutions(const std::string& arch) const
+// returns templated instances when provided with a problem specification
+std::vector<Solution> Problem::GetSolutions(const std::string& arch,
+                                            const std::string& prologue,
+                                            const std::string& epilogue) const
 {
    if(get_xdlop_archs().count(arch) == 0)
        return {};
-    auto ops = ck::host::device_gemm_multiple_d::Operation_Xdl_CShuffle::CreateOperations(*this);
+    auto ops = ck::host::device_gemm_multiple_d::Operation_Xdl_CShuffle::CreateOperations(
+        *this, prologue, epilogue); // obtains vector of instances
    std::vector<Solution> result;
    std::transform(ops.begin(), ops.end(), std::back_inserter(result), [&](const auto& op) {
-        return op.ToSolution();
+        return op.ToSolution(); // template instance with correct values
    });
    return result;
 }

 } // namespace device_gemm_multiple_d
 } // namespace host
-} // namespace ck
+} // namespace ck
--- a/codegen/src/device_gemm_multiple_d_operation_xdl_cshuffle.cpp
+++ b/codegen/src/device_gemm_multiple_d_operation_xdl_cshuffle.cpp
@@ -10,6 +10,7 @@ namespace ck {
 namespace host {
 namespace device_gemm_multiple_d {

+// calculate appropriate Gemm Specification based on input tensor dimensions
 static std::string GetGemmSpec(const std::size_t m,
                               const std::size_t n,
                               const std::size_t k,
@@ -30,9 +31,40 @@ static std::string GetGemmSpec(const std::size_t m,
    return "ck::tensor_operation::device::GemmSpecialization::" + spec + "Padding";
 }

+// function to update prologue/epilogue with user provided operation
+void Operation_Xdl_CShuffle::update_prologue(const std::string& prologue)
+{
+    if(!prologue.empty())
+    {
+        this->prologue    = prologue;
+        this->cde_elem_op = "CDEElementOp";
+    }
+    else
+    {
+        this->prologue = "";
+    }
+}
+
+void Operation_Xdl_CShuffle::update_epilogue(const std::string& epilogue)
+{
+    if(!epilogue.empty())
+    {
+        this->epilogue    = epilogue;
+        this->cde_elem_op = "CDEElementOp";
+    }
+    else
+    {
+        this->epilogue = "";
+    }
+}
+
+// accounts for all possible combinations of Row/Col major
 static Layout ToLayout(bool Trans) { return Trans ? Layout::Column : Layout::Row; }

-std::vector<Operation_Xdl_CShuffle> Operation_Xdl_CShuffle::CreateOperations(const Problem& prob)
+// Hard-code tuning parameters in modularized fashion, string them together into a vector of
+// instances
+std::vector<Operation_Xdl_CShuffle> Operation_Xdl_CShuffle::CreateOperations(
+    const Problem& prob, const std::string& prologue, const std::string& epilogue)
 {
    std::vector<Operation_Xdl_CShuffle> result;

@@ -155,6 +187,7 @@ std::vector<Operation_Xdl_CShuffle> Operation_Xdl_CShuffle::CreateOperations(con
        // clang-format on
    };

+    // choose correct arrangement of tuning parameters based on the layout of each tensor
    const auto a_block_descriptions =
        prob.TransA ? a_block_descriptions_colmajor : a_block_descriptions_rowmajor;
    const auto b_block_descriptions =
@@ -165,6 +198,7 @@ std::vector<Operation_Xdl_CShuffle> Operation_Xdl_CShuffle::CreateOperations(con
    assert(tile_descriptions.size() == cshuffle_descriptions.size());
    assert(tile_descriptions.size() == c_block_descriptions.size());

+    // Put all values together into a single operation > store into the result vector
    for(std::size_t i = 0; i < tile_descriptions.size(); i++)
    {
        Operation_Xdl_CShuffle x;
@@ -188,12 +222,17 @@ std::vector<Operation_Xdl_CShuffle> Operation_Xdl_CShuffle::CreateOperations(con
                                            x.tile_desc.m_per_block,
                                            x.tile_desc.n_per_block,
                                            x.tile_desc.k_per_block);
+        x.update_prologue(prologue);
+        x.update_epilogue(epilogue);
        result.push_back(x);
    }
    return result;
 }

-std::vector<std::vector<Operation_Xdl_CShuffle>> Operation_Xdl_CShuffle::CreateOperations()
+// set up instances when not provided with a problem specification, use default operation values and
+// all possible layout combinations
+std::vector<std::vector<Operation_Xdl_CShuffle>>
+Operation_Xdl_CShuffle::CreateOperations(const std::string& prologue, const std::string& epilogue)
 {
    std::vector<Problem> problems;
    for(bool TransA : {true, false})
@@ -204,7 +243,8 @@ std::vector<std::vector<Operation_Xdl_CShuffle>> Operation_Xdl_CShuffle::CreateO
            prob.TransB = TransB;
            problems.push_back(prob);
        }
-    return Transform(problems, [](const Problem& p) { return CreateOperations(p); });
+    return Transform(problems,
+                     [&](const Problem& p) { return CreateOperations(p, prologue, epilogue); });
 }

 static const char* const DeviceGemmMultipleD_Xdl_CShuffleTemplate =
@@ -224,9 +264,20 @@ static const char* const DeviceGemmMultipleD_Xdl_CShuffleTemplate =
    "${CDEBlockTransferClusterLengths_MBlock_MPerBlock_NBlock_NPerBlock}, "
    "${CDEBlockTransferScalarPerVector_NPerBlock}>";

+// use hardcoded instances from vector of operations to substitute values into instance template
 Solution Operation_Xdl_CShuffle::ToSolution() const
 {
    std::unordered_map<std::string, std::string> values = {
+        {"name",
+         std::to_string(this->tile_desc.block_size) + "_" +
+             std::to_string(this->tile_desc.m_per_block) + "_" +
+             std::to_string(this->tile_desc.n_per_block) + "_" +
+             std::to_string(this->tile_desc.k_per_block) + "_" +
+             std::to_string(this->tile_desc.ak1) + "_" + std::to_string(this->tile_desc.bk1) + "_" +
+             std::to_string(this->tile_desc.m_per_XDL) + "_" +
+             std::to_string(this->tile_desc.n_per_XDL) + "_" +
+             std::to_string(this->tile_desc.m_Xdl_per_wave) + "_" +
+             std::to_string(this->tile_desc.n_Xdl_per_wave)},
        {"LayoutA", ToString(this->A.layout)},
        {"LayoutB", ToString(this->B.layout)},
        {"LayoutDs",
--- a/codegen/src/device_grouped_conv_fwd_multiple_abd.cpp
+++ b/codegen/src/device_grouped_conv_fwd_multiple_abd.cpp
@@ -0,0 +1,42 @@
+
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.
+
+#include "ck/host/device_grouped_conv_fwd_multiple_d/conv_fwd_problem.hpp"
+#include "ck/host/device_grouped_conv_fwd_multiple_d/conv_fwd_op.hpp"
+#include "ck/host/utils.hpp"
+#include <algorithm>
+#include <iostream>
+
+namespace ck {
+namespace host {
+namespace conv {
+
+// return the relevant device op file based on the operation
+// NOTE: this is a modified version of the original CK file that calls the kernel from a device
+// function and makes the Argument class accessible on the device
+std::string Problem_Conv_Fwd::GetIncludeHeader() const
+{
+    return "ck/tensor_operation/gpu/device/impl/"
+           "codegen_device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp";
+}
+
+// return vector of forward convolution instances when provided with a problem instance
+std::vector<Solution> Problem_Conv_Fwd::GetSolutions(const std::string& arch,
+                                                     const std::string& prologue,
+                                                     const std::string& epilogue) const
+{
+    if(get_xdlop_archs().count(arch) == 0)
+        return {};
+    auto ops = ck::host::conv::Operation_Conv_Fwd_Xdl_Cshuffle::CreateOperations(
+        *this, prologue, epilogue);
+    std::vector<Solution> result;
+    std::transform(ops.begin(), ops.end(), std::back_inserter(result), [&](const auto& op) {
+        return op.ToSolution();
+    });
+    return result;
+}
+
+} // namespace conv
+} // namespace host
+} // namespace ck
--- a/codegen/src/device_grouped_conv_fwd_multiple_abd_operation_xdl_cshuffle.cpp
+++ b/codegen/src/device_grouped_conv_fwd_multiple_abd_operation_xdl_cshuffle.cpp
@@ -0,0 +1,364 @@
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.
+
+#include "ck/host/device_grouped_conv_fwd_multiple_d/conv_fwd_op.hpp"
+#include <iostream>
+#include "ck/host/stringutils.hpp"
+#include "ck/host/utils.hpp"
+#include <cassert>
+
+namespace ck {
+namespace host {
+namespace conv {
+
+// calculate appropriate Gemm Specification based on input tensor dimensions
+// NOTE: in CK, MNKPadding is always used for forward convolution
+static std::string GetGemmSpec(const std::size_t m,
+                               const std::size_t n,
+                               const std::size_t k,
+                               const std::size_t m_per_block,
+                               const std::size_t n_per_block,
+                               const std::size_t k_per_block)
+{
+    std::string spec = "";
+    if(integer_divide_ceil(m, m_per_block) * m_per_block - m != 0)
+        spec += "M";
+    if(integer_divide_ceil(n, n_per_block) * n_per_block - n != 0)
+        spec += "N";
+    if(integer_divide_ceil(k, k_per_block) * k_per_block - k != 0)
+        spec += "K";
+    if(spec == "")
+        return "ck::tensor_operation::device::GemmSpecialization::Default";
+
+    return "ck::tensor_operation::device::GemmSpecialization::" + spec + "Padding";
+}
+
+// function to update prologue/epilogue with user provided operation
+void Operation_Conv_Fwd_Xdl_Cshuffle::update_prologue(const std::string& prologue)
+{
+    if(!prologue.empty())
+    {
+        this->prologue    = prologue;
+        this->cde_elem_op = "CDEElementOp";
+    }
+    else
+    {
+        this->prologue = "";
+    }
+}
+
+void Operation_Conv_Fwd_Xdl_Cshuffle::update_epilogue(const std::string& epilogue)
+{
+    if(!epilogue.empty())
+    {
+        this->epilogue    = epilogue;
+        this->cde_elem_op = "CDEElementOp";
+    }
+    else
+    {
+        this->epilogue = "";
+    }
+}
+
+// Hard-code tuning parameters in modularized fashion, string them together into a vector of
+// instances
+std::vector<Operation_Conv_Fwd_Xdl_Cshuffle> Operation_Conv_Fwd_Xdl_Cshuffle::CreateOperations(
+    const Problem_Conv_Fwd& prob, const std::string& prologue, const std::string& epilogue)
+{
+    std::vector<Operation_Conv_Fwd_Xdl_Cshuffle> result;
+
+    std::vector<operation::TileDesc> tile_descriptions = {
+        // clang-format off
+//  Block|  MPer|  NPer|  KPer| AK1| BK1| MPer| NPer| MXdl| NXdl| NumGemmK|
+//   Size| Block| Block| Block|    |    |  XDL|  XDL|  Per|  Per| Prefetch|
+//       |      |      |      |    |    |     |     | Wave| Wave|    Stage|
+//       |      |      |      |    |    |     |     |     |     |         |
+  {   64,   64,   32,    32,   8,   8,   32,   32,    2,    1,        1},
+  {   256,   128,   256,    32,   8,   8,   32,   32,    4,    2,        1},
+  {   256,   128,   128,    32,   8,   8,   32,   32,    2,    2,        1},
+  {   64,   64,   64,    32,   8,   8,   32,   32,    2,    2,        1},
+  {   256,   256,   128,    32,   8,   8,   32,   32,    4,    2,        1},
+  {   128,   128,   128,    32,   8,   8,   32,   32,    4,    2,        1}
+        // clang-format on
+    };
+
+    std::vector<operation::BlockTransferDesc> a_block_descriptions = {
+        // clang-format off
+//  ABlockTransfer| ABlockTransfer| ABlockTransfer| ABlockTransfer| ABlockTransfer| ABlockTransfer| ABlockLds|
+//   ThreadCluster|  ThreadCluster| SrcAccessOrder|   SrcVectorDim|      SrcScalar|      DstScalar| AddExtraM|
+// Lengths_K0_M_K1|   ArrangeOrder|               |               |      PerVector|   PerVector_K1|          |
+//                |               |               |               |               |               |          |
+  {    S<4, 16, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1},
+  {    S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1},
+  {    S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              1,              8,         1},
+  {    S<4, 16, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              1,              8,         1},
+  {    S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1},
+  {    S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1}
+        // clang-format on
+    };
+
+    std::vector<operation::BlockTransferDesc> b_block_descriptions = {
+        // clang-format off
+//  BBlockTransfer| BBlockTransfer| BBlockTransfer| BlockTransfer| BBlockTransfer| BBlockTransfer| BBlockLds|
+//   ThreadCluster|  ThreadCluster| SrcAccessOrder|  SrcVectorDim|      SrcScalar|      DstScalar| AddExtraN|
+// Lengths_K0_N_K1|   ArrangeOrder|               |              |      PerVector|   PerVector_K1|          |
+//                |               |               |              |               |               |          |
+  {    S<4, 16, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1},
+  {    S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1},
+  {    S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              1,              8,         1},
+  {    S<4, 16, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              1,              8,         1},
+  {    S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1},
+  {    S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1}
+        // clang-format on
+    };
+
+    std::vector<operation::CShuffleDesc> cshuffle_descriptions = {
+        // clang-format off
+//    CShuffle|    CShuffle|
+// MXdlPerWave| NXdlPerWave|
+//  PerShuffle|  PerShuffle|
+//            |            |
+  {          1,           1},
+  {          1,           1},
+  {          1,           1},
+  {          1,           1},
+  {          1,           1},
+  {          1,           1}
+        // clang-format on
+    };
+
+    std::vector<operation::CBlockTransferDesc> c_block_descriptions = {
+        // clang-format off
+// CBlockTransferClusterLengths|  CBlockTransfer
+//         _MBlock_MWaveMPerXdl| ScalarPerVector
+//         _NBlock_NWaveNPerXdl|   _NWaveNPerXdl
+//                             |                
+  {              S<1, 16, 1, 4>,               1},
+  {              S<1, 32, 1, 8>,               8},
+  {              S<1, 32, 1, 8>,               8},
+  {              S<1, 16, 1, 4>,               1},
+  {              S<1, 32, 1, 8>,               8},
+  {              S<1, 16, 1, 8>,               8}
+        // clang-format on
+    };
+
+    assert(tile_descriptions.size() == a_block_descriptions.size());
+    assert(tile_descriptions.size() == b_block_descriptions.size());
+    assert(tile_descriptions.size() == cshuffle_descriptions.size());
+    assert(tile_descriptions.size() == c_block_descriptions.size());
+
+    // Put all values together into a single operation > store into the result vector
+    for(std::size_t i = 0; i < tile_descriptions.size(); i++)
+    {
+        Operation_Conv_Fwd_Xdl_Cshuffle x;
+        x.NumDim           = prob.NumDim;
+        x.tile_desc        = tile_descriptions[i];
+        x.a_block_transfer = a_block_descriptions[i];
+        x.b_block_transfer = b_block_descriptions[i];
+        x.cshuffle         = cshuffle_descriptions[i];
+        x.c_block_transfer = c_block_descriptions[i];
+        x.A                = TensorDesc{prob.ADataType, prob.ALayout};
+        x.B                = TensorDesc{prob.BDataType, prob.BLayout};
+        x.E                = TensorDesc{prob.EDataType, prob.ELayout};
+        x.Ds               = Transform(prob.DsLayout, prob.DsDataType, [](auto lo, auto dt) {
+            return TensorDesc{dt, lo};
+        });
+        x.a_elem_op        = prob.AElementOp;
+        x.b_elem_op        = prob.BElementOp;
+        x.cde_elem_op      = prob.CDEElementOp;
+        x.update_prologue(prologue);
+        x.update_epilogue(epilogue);
+        result.push_back(x);
+    }
+    return result;
+}
+
+// set up instances when not provided with a problem specification, use default operation values
+std::vector<Operation_Conv_Fwd_Xdl_Cshuffle>
+Operation_Conv_Fwd_Xdl_Cshuffle::CreateOperations(const std::string& prologue,
+                                                  const std::string& epilogue)
+{
+    Problem_Conv_Fwd prob;
+    return CreateOperations(prob, prologue, epilogue);
+}
+
+static const char* const CopyDevice_ConvTemplate =
+    R"(
+${Prologue}
+${Epilogue}
+
+using CDEElementOp = Epilogue;
+using DeviceConv = ck::tensor_operation::device::CodegenDeviceGroupedConvFwdMultipleABD_Xdl_CShuffle<${NumDim}, ${LayoutA}, ${LayoutB}, ${LayoutDs}, ${LayoutE}, ${ADataType}, ${BDataType}, ${AccDataType}, ${CShuffleDataType}, ${DsDataType}, ${EDataType}, ${AElementwiseOperation}, ${BElementwiseOperation}, ${CDEElementwiseOperation}, ${ConvSpecialization}, ${GemmSpecialization}, ${NumGemmkPrefetchStage}, ${BlockSize}, ${MPerBlock}, ${NPerBlock}, ${KPerBlock}, ${AK1}, ${BK1}, ${MPerXDL}, ${NPerXDL}, ${MXdlPerWave}, ${NXdlPerWave}, ${ABlockTransferThreadClusterLengths_AK0_M_AK1}, ${ABlockTransferThreadClusterArrangeOrder}, ${ABlockTransferSrcAccessOrder}, ${ABlockTransferSrcVectorDim}, ${ABlockTransferSrcScalarPerVector}, ${ABlockTransferDstScalarPerVector_AK1}, ${ABlockLdsExtraM}, ${BBlockTransferThreadClusterLengths_BK0_N_BK1}, ${BBlockTransferThreadClusterArrangeOrder}, ${BBlockTransferSrcAccessOrder}, ${BBlockTransferSrcVectorDim}, ${BBlockTransferSrcScalarPerVector}, ${BBlockTransferDstScalarPerVector_BK1}, ${BBlockLdsExtraN}, ${CShuffleMXdlPerWavePerShuffle}, ${CShuffleNXdlPerWavePerShuffle}, ${CDEBlockTransferClusterLengths_MBlock_MPerBlock_NBlock_NPerBlock}, ${CDEBlockTransferScalarPerVector_NPerBlock}>;
+
+constexpr ck::index_t NumATensor = ck::tensor_operation::device::GetNumABTensors<false, ${ADataType}>();
+constexpr ck::index_t NumBTensor = ck::tensor_operation::device::GetNumABTensors<false, ${BDataType}>();
+
+extern "C" __global__ void run_${name}(
+    const ${ADataType}* in_dev,
+    const ${BDataType}* wei_dev,
+    ${EDataType}* __restrict__ out_dev,
+    ck::Array<ck::index_t, ${NumDim} + 3> in_lengths,
+    ck::Array<ck::index_t, ${NumDim} + 3> in_strides,
+    ck::Array<ck::index_t, ${NumDim} + 3> wei_lengths,
+    ck::Array<ck::index_t, ${NumDim} + 3> wei_strides,
+    ck::Array<ck::index_t, ${NumDim} + 3> out_lengths,
+    ck::Array<ck::index_t, ${NumDim} + 3> out_strides,
+    ck::Array<ck::index_t, ${NumDim}> conv_filter_strides,
+    ck::Array<ck::index_t, ${NumDim}> conv_filter_dilations,
+    ck::Array<ck::index_t, ${NumDim}> input_left_pads,
+    ck::Array<ck::index_t, ${NumDim}> input_right_pads,
+    const ${AElementwiseOperation} a_element_op,
+    const ${BElementwiseOperation} b_element_op,
+    const ${CDEElementwiseOperation} cde_element_op
+){
+    
+
+    auto arg = DeviceConv::Argument(in_dev,
+                                    wei_dev,
+                                    ck::Array<const void*, 0>{},
+                                    out_dev,
+                                    in_lengths,
+                                    in_strides,
+                                    wei_lengths,
+                                    wei_strides,
+                                    ck::Array<ck::Array<ck::index_t, ${NumDim} + 3>, 0>{},
+                                    ck::Array<ck::Array<ck::index_t, ${NumDim} + 3>, 0>{},
+                                    out_lengths,
+                                    out_strides,
+                                    conv_filter_strides,
+                                    conv_filter_dilations,
+                                    input_left_pads,
+                                    input_right_pads,
+                                    ${AElementwiseOperation}{},
+                                    ${BElementwiseOperation}{},
+                                    ${CDEElementwiseOperation}{1.0f, 1.0f});
+
+    constexpr ck::LoopScheduler LoopSched = ck::make_default_loop_scheduler();
+
+    // GridwiseGemm
+    using GridwiseGemm = DeviceConv::GridwiseGemm;
+
+    static constexpr auto I0 = ck::Number<0>{};
+
+    ck::tensor_operation::device::device_grouped_conv_fwd_multiple_abd_xdl_cshuffle<
+                    GridwiseGemm,
+                    const ${ADataType}*,
+                    const ${BDataType}*,
+                    typename GridwiseGemm::DsGridPointer,
+                    ${EDataType},
+                    ${AElementwiseOperation},
+                    ${BElementwiseOperation},
+                    ${CDEElementwiseOperation},
+                    DeviceConv::AGridDesc_AK0_M_AK1,
+                    DeviceConv::BGridDesc_BK0_N_BK1,
+                    DeviceConv::DsGridDesc_MBlock_MPerBlock_NBlock_NPerBlock,
+                    DeviceConv::EGridDesc_MBlock_MPerBlock_NBlock_NPerBlock,
+                    DeviceConv::Block2ETileMap,
+		    ck::tensor_operation::device::ComputePtrOffsetOfStridedBatch<NumATensor, NumBTensor, 0>,
+                    ck::integral_constant<bool, true>{},
+                    false,
+                    false>
+		    (
+		     arg.p_as_grid_.At(I0),
+ 		     arg.p_bs_grid_.At(I0),
+		     arg.p_ds_grid_,
+		     arg.p_e_grid_,
+		     arg.a_element_op_,
+		     arg.b_element_op_,
+		     arg.cde_element_op_,
+		     arg.a_g_n_c_wis_lengths_[0], // Group count
+                     arg.a_grid_desc_ak0_m_ak1_,
+		     arg.b_grid_desc_bk0_n_bk1_,
+		     arg.ds_grid_desc_mblock_mperblock_nblock_nperblock_,
+		     arg.e_grid_desc_mblock_mperblock_nblock_nperblock_,
+		     arg.block_2_etile_map_,
+		     arg.compute_ptr_offset_of_batch_
+		    );
+				    
+}
+)";
+
+// use hardcoded instances from vector of operations to substitute values into instance template
+Solution Operation_Conv_Fwd_Xdl_Cshuffle::ToSolution() const
+{
+    std::unordered_map<std::string, std::string> values = {
+        {"name",
+         std::to_string(this->tile_desc.block_size) + "_" +
+             std::to_string(this->tile_desc.m_per_block) + "_" +
+             std::to_string(this->tile_desc.n_per_block) + "_" +
+             std::to_string(this->tile_desc.k_per_block) + "_" +
+             std::to_string(this->tile_desc.ak1) + "_" + std::to_string(this->tile_desc.bk1) + "_" +
+             std::to_string(this->tile_desc.m_per_XDL) + "_" +
+             std::to_string(this->tile_desc.n_per_XDL) + "_" +
+             std::to_string(this->tile_desc.m_Xdl_per_wave) + "_" +
+             std::to_string(this->tile_desc.n_Xdl_per_wave)},
+        {"NumDim", std::to_string(this->NumDim)},
+        {"LayoutA", ToString(this->A.layout)},
+        {"LayoutB", ToString(this->B.layout)},
+        {"LayoutDs",
+         MakeTuple(Transform(this->Ds, [](auto tensor) { return ToString(tensor.layout); }))},
+        {"LayoutE", ToString(this->E.layout)},
+        {"ADataType", ToString(this->A.element)},
+        {"BDataType", ToString(this->B.element)},
+        {"AccDataType", ToString(this->acc)},
+        {"ComputeDataType", ToString(this->A.element)},
+        {"CShuffleDataType", ToString(this->cs_type)},
+        {"DsDataType",
+         MakeTuple(Transform(this->Ds, [](auto tensor) { return ToString(tensor.element); }))},
+        {"EDataType", ToString(this->E.element)},
+        {"AElementwiseOperation", this->a_elem_op},
+        {"BElementwiseOperation", this->b_elem_op},
+        {"CDEElementwiseOperation", this->cde_elem_op},
+        {"Prologue", this->prologue},
+        {"Epilogue", this->epilogue},
+        {"ConvSpecialization", this->conv_specialization},
+        {"GemmSpecialization", this->gemm_specialization},
+        {"NumGemmkPrefetchStage", std::to_string(this->tile_desc.num_gemmk_prefetch_stage)},
+        {"BlockSize", std::to_string(this->tile_desc.block_size)},
+        {"MPerBlock", std::to_string(this->tile_desc.m_per_block)},
+        {"NPerBlock", std::to_string(this->tile_desc.n_per_block)},
+        {"KPerBlock", std::to_string(this->tile_desc.k_per_block)},
+        {"AK1", std::to_string(this->tile_desc.ak1)},
+        {"BK1", std::to_string(this->tile_desc.bk1)},
+        {"MPerXDL", std::to_string(this->tile_desc.m_per_XDL)},
+        {"NPerXDL", std::to_string(this->tile_desc.n_per_XDL)},
+        {"MXdlPerWave", std::to_string(this->tile_desc.m_Xdl_per_wave)},
+        {"NXdlPerWave", std::to_string(this->tile_desc.n_Xdl_per_wave)},
+        {"ABlockTransferThreadClusterLengths_AK0_M_AK1",
+         this->a_block_transfer.thread_cluster_length},
+        {"ABlockTransferThreadClusterArrangeOrder",
+         this->a_block_transfer.thread_cluster_arrange_order},
+        {"ABlockTransferSrcAccessOrder", this->a_block_transfer.src_access_order},
+        {"ABlockTransferSrcVectorDim", std::to_string(this->a_block_transfer.src_vec_dim)},
+        {"ABlockTransferSrcScalarPerVector",
+         std::to_string(this->a_block_transfer.src_scalar_per_vector)},
+        {"ABlockTransferDstScalarPerVector_AK1",
+         std::to_string(this->a_block_transfer.dst_scalar_per_vector_k1)},
+        {"ABlockLdsExtraM", std::to_string(this->a_block_transfer.lds_add_extra_dim)},
+        {"BBlockTransferThreadClusterLengths_BK0_N_BK1",
+         this->b_block_transfer.thread_cluster_length},
+        {"BBlockTransferThreadClusterArrangeOrder",
+         this->b_block_transfer.thread_cluster_arrange_order},
+        {"BBlockTransferSrcAccessOrder", this->b_block_transfer.src_access_order},
+        {"BBlockTransferSrcVectorDim", std::to_string(this->b_block_transfer.src_vec_dim)},
+        {"BBlockTransferSrcScalarPerVector",
+         std::to_string(this->b_block_transfer.src_scalar_per_vector)},
+        {"BBlockTransferDstScalarPerVector_BK1",
+         std::to_string(this->b_block_transfer.dst_scalar_per_vector_k1)},
+        {"BBlockLdsExtraN", std::to_string(this->b_block_transfer.lds_add_extra_dim)},
+        {"CShuffleMXdlPerWavePerShuffle",
+         std::to_string(this->cshuffle.m_Xdl_per_wave_per_shuffle)},
+        {"CShuffleNXdlPerWavePerShuffle",
+         std::to_string(this->cshuffle.n_Xdl_per_wave_per_shuffle)},
+        {"CDEBlockTransferClusterLengths_MBlock_MPerBlock_NBlock_NPerBlock",
+         this->c_block_transfer.cluster_lengths_m_block_m_wave_m_per_Xdl_n_block_n_wave_n_per_Xdl},
+        {"CDEBlockTransferScalarPerVector_NPerBlock",
+         std::to_string(this->c_block_transfer.scalar_per_vector_n_wave_n_per_Xdl)},
+    };
+
+    return Solution{InterpolateString(CopyDevice_ConvTemplate, values), std::move(values)};
+}
+
+} // namespace conv
+} // namespace host
+} // namespace ck
--- a/codegen/src/headers.cpp
+++ b/codegen/src/headers.cpp
@@ -14,4 +14,4 @@ std::unordered_map<std::string_view, std::string_view> GetHeaders()
 }

 } // namespace host
-} // namespace ck
+} // namespace ck
--- a/codegen/src/types.cpp
+++ b/codegen/src/types.cpp
@@ -29,12 +29,20 @@ std::string ToString(DataType dt)
    throw std::runtime_error("Incorrect data type");
 }

+Layout ToLayout(bool Trans) { return Trans ? Layout::Column : Layout::Row; }
+
 std::string ToString(Layout dl)
 {
    switch(dl)
    {
    case Layout::Row: return "ck::tensor_layout::gemm::RowMajor";
    case Layout::Column: return "ck::tensor_layout::gemm::ColumnMajor";
+    case Layout::GKCYX: return "ck::tensor_layout::convolution::GKCYX";
+    case Layout::GKYXC: return "ck::tensor_layout::convolution::GKYXC";
+    case Layout::GNHWK: return "ck::tensor_layout::convolution::GNHWK";
+    case Layout::GNHWC: return "ck::tensor_layout::convolution::GNHWC";
+    case Layout::NHWGC: return "ck::tensor_layout::convolution::NHWGC";
+    case Layout::NHWGK: return "ck::tensor_layout::convolution::NHWGK";
    }
    throw std::runtime_error("Incorrect layout");
 }
--- a/codegen/src/utils.cpp
+++ b/codegen/src/utils.cpp
@@ -1,5 +1,5 @@
 // SPDX-License-Identifier: MIT
-// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.

 #include "ck/host/utils.hpp"