* Init commit new API * apply clang-format * PreShuffle preapring * Apply Preshuffle condition to universal_gemm * Fix: convert size_t to index_t * Review changes * Mode 100755 -> 100644 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Grouped Gemm
Grouped General Matrix Multiplication (Grouped GEMM) is a technique used in GPU computing and high-performance computing to batch together multiple independent GEMM operations (matrix multiplications) into a single kernel launch in order to improve performance and efficiency. This folder contains Grouped GEMM examples that use the ck_tile tile-programming implementation.
Quick Tour for New Users
The Grouped GEMM operators are versions of GEMM that run multiple GEMM operations within a single kernel call. Each GEMM operation performs a matrix multiplication. Unlike regular batched GEMM operations where both matrices must be of the same size and have the same configuration, Grouped GEMM operations can take matrices with different sizes and configurations, making them more flexible for diverse workloads.
Let's now break the example into the following parts: parsing arguments, preparing host and device buffers, preparing data, invoking GEMM, and building the example, while explaining each function.
Parsing Arguments
The example takes three arguments: group_count, repeat, and warmup:
group_count: the number of GEMM operations in the group,repeat: the number of times to repeat the kernel for benchmarkingwarmup: the number of iterations before the actual kernel run time measure.
// Example
const int group_count = arg_parser.get_int("group_count");
const int repeat = arg_parser.get_int("repeat");
const int warmup = arg_parser.get_int("warmup");
In the next step, the input parameters Ms, Ns, Ks, as well as the corresponding stride_As, stride_Bs, and stride_Cs are either provided from the comand line or generated by default. Since one or more input data sets are expected for A and B, each parameter is stored in a std::vector. The size of the vector is defined by group_count.
// Example
std::vector<ck_tile::index_t> Ms = arg_parser.get_int_vec("Ms");
std::vector<ck_tile::index_t> Ns = arg_parser.get_int_vec("Ns");
std::vector<ck_tile::index_t> Ks = arg_parser.get_int_vec("Ks");
std::vector<ck_tile::index_t> stride_As = arg_parser.get_int_vec("stride_As");
std::vector<ck_tile::index_t> stride_Bs = arg_parser.get_int_vec("stride_Bs");
std::vector<ck_tile::index_t> stride_Cs = arg_parser.get_int_vec("stride_Cs");
Where:
Msis the M dimension of each GEMM.Nsis the N dimension of each GEMM.Ksis the K dimension of each GEMM.stride_Asis the stride values for matrix A.stride_Bsis the stride values for matrix B.stride_Csis the stride values for matrix C.
HostTensor and Device Memory Buffers (for CPU and GPU)
Each parameter Ms, Ns, Ks, stride_As, stride_Bs and stride_Cs contains values for more than one matrix, meaning different matrix sizes and strides can be used for different grouped GEMM computations.
The next step is to properly load the input values. For each input matrix, A and B, and for each output matrix, C, you need to create both HostTensor and DeviceMemory, where:
HostTensorrepresents the matrix data on the host (CPU). It stores the data before they are transferred to the device for computation.DeviceMemoryrepresents the matrix data on the device (GPU). This will store the data on the GPU for computation during the Grouped GEMM operation.
HostTensor Buffers (for CPU)
In the first step, create HostTensor for A, B, C. HostTensor allocates memory on the host (CPU) to store the matrices, initializing the memory with the appropriate dimensions and values to store the data. Below is an example code showing how to create HostTensors for those tensors:
// Example
std::vector<ck_tile::HostTensor<ADataType>> a_m_k_tensors;
std::vector<ck_tile::HostTensor<BDataType>> b_k_n_tensors;
std::vector<ck_tile::HostTensor<CDataType>> c_m_n_tensors;
Where:
a_m_k_tensorsis the vector ofHostTensorobjects for matrixA(with dimensionsM × K). Each tensor stores the data for single GEMM operation.b_k_n_tensorsis the vector ofHostTensorobjects for matrixB(with dimensionsK × N).c_m_n_tensorsis the vector ofHostTensorobjects for matrixC(the output matrix with dimensionsM × N).
The std::vector container is used for this purpose throughout. As mentioned above, the number of HostTensors is equal to group_count.
Device Memory Buffers (for GPU)
Now it's time to allocate memory on the device (GPU) and transfer the data from HostTensor to DeviceMemory for actual computation..
// Example
std::vector<std::unique_ptr<ck_tile::DeviceMem>> a_m_k_dev_buf;
std::vector<std::unique_ptr<ck_tile::DeviceMem>> b_k_n_dev_buf;
std::vector<std::unique_ptr<ck_tile::DeviceMem>> c_m_n_dev_buf;
Where:
a_m_k_dev_bufis the buffer used for storing matrix A on the GPU.b_k_n_dev_bufis the buffer used for storing matrix B on the GPU.c_m_n_dev_bufis the buffer used for storing the result matrix C on the GPU.
Prepare data
In the next step, the input tensors are populated. A pseudorandom number generator, an existing distribution (e.g., FillUniformDistribution), or user data can be used to populate the tensors. Descriptors also need to be create for each input tensor.
Use get_default_stride to get the strides for A, B, and C. get_default_stride is a template function that calculates the default stride for a 2D array based on whether it is row-major or column-major. Template parameter determines whether the storage order is row-major (true) or column-major (false). The function takes four params row, col, stride and bool_constant<is_row_major>. If the stride is explicitly provided (stride != 0), the stride is returned as-is. If the stride is not provided (stride == 0), the function computes the default stride. For the Row-major order (is_row_major == true), the stride is set to the number of columns (col). For the column-major order (is_row_major == false), the stride is set to the number of rows (row). This function is useful when working with dynamically allocated 2D arrays, where the user may not specify the stride explicitly. It ensures a natural default stride based on the chosen storage order.
// Example, API
template <bool is_row_major>
auto get_default_stride(std::size_t row, std::size_t col, std::size_t stride, bool_constant<is_row_major>) {
// code
}
Where:
is_row_majoris a bool template parameter that determines whether the storage order is row-major (true) or column-major (false).rowis the number of rows in the matrix.colis the number of columns in the matrix.strideis the current stride (the distance between consecutive elements in memory).bool_constant<is_row_major>is a tag type that helps in differentiating behavior at compile-time.
Next host descriptors for each of the input tensors, A, B, and C are created. Use the f_host_tensor_descriptor function defined below. This function takes four parameters, row, col, stride, and layout, and returns a HostTensorDescriptor based on the specified layout.
// Example for tensor A
ck_tile::HostTensor<ADataType>(f_host_tensor_descriptor(M, K, stride_As[i], a_layout)))
After creating the host_tensors, create deviceMem for each tensor A, B, and C, and then transfer the data to the device. The get_element_space_size_in_bytes() function is used to get the buffer size in bytes. Use ToDevice() to transfer data from the host to the device. The data that was previously generated (a_m_k_tensors[i].data()) is passed as a parameter to ToDevice().
The final step before running the GEMM operation is to retrieve the pointers to the buffers of A, B, and C stored on the device using ->GetDeviceBuffer() and pack them into a shared container. For example: gemm_descs.push_back({p_a, p_b, p_c, M, N, K, stride_As[i], stride_Bs[i], stride_Cs[i]}), where gemm_descs is std::vector<grouped_gemm_kargs> gemm_descs (Code). The container should include values such as:
struct GroupedGemmHostArgs
{
const void* a_ptr;
const void* b_ptr;
void* c_ptr;
index_t M;
index_t N;
index_t K;
index_t stride_A;
index_t stride_B;
index_t stride_C;
};
The data prepared in this way can be passed to the invoke_gemm function. This is a templated function that also takes three template parameters: ALayout, BLayout, and CLayout:
// Example, API
template <typename ALayout, typename BLayout, typename CLayout, bool Persistent>
float invoke_gemm(int n_warmup,
int n_repeat,
int group_count,
const std::vector<grouped_gemm_kargs>& args)
invoke_gemm returns the run time in milliseconds. The workspace memory required for computation is allocated. Workspace memory on the GPU refers to temporary memory buffers allocated when some operations are run. This extra space is needed to hold GEMM descriptions. The following structure can be used to allocate workspace:
// Example
ck_tile::DeviceMem gemm_workspace;
gemm_workspace.Realloc(GetWorkspaceSize(args));
Finally the arguments are passed to group_gemm and the kernel is launched.
// API
template <typename ALayout, typename BLayout, typename CLayout>
float grouped_gemm(const std::vector<grouped_gemm_kargs>& gemm_descs,
const ck_tile::stream_config& s,
void* kargs_ptr)
All the necessary parameters are set, the tiling is computed, the GEMM pipeline and epilogue are prepared, and the GroupedGemmKernel is launched.
Build
# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
sh ../script/cmake-ck-dev.sh ../ <arch>
# The basic pipeline method on the gemm calculation
make tile_example_grouped_gemm -j
This will result in an executable build/bin/tile_example_grouped_gemm
example
args:
-Ms M dimensions - (Default: empty).
-Ns N dimensions - (Default: empty).
-Ks K dimensions - (Default: empty).
-stride_As Tensor A strides - (Default: empty).
-stride_Bs Tensor B strides - (Default: empty).
-stride_Cs Tensor C strides - (Default: empty).
-a_layout A tensor data layout - (Default: Row).
-b_layout B tensor data layout - (Default: Col).
-c_layout C tensor data layout - (Default: Row).
-validate 0. No validation, 1. Validation on CPU. (Default: 1).
-warmup Number of iterations before benchmark the kernel. (Default: 10).
-repeat Number of iterations to benchmark the kernel. (Default: 100).
-group_count Group count. (Default: 16).