mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-04-19 22:39:03 +00:00
* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
Client Example: Grouped GEMM with FastGELU Activation
Theory
This client example demonstrates grouped GEMM fused with FastGELU activation. Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, and FastGELU is a fast approximation of the GELU activation used in transformers and MLPs.
Mathematical Formulation:
For G groups, each with its own A_g, B_g:
- GEMM:
Y_g = A_g \times B_g - FastGELU:
E_g = \text{FastGELU}(Y_g)
FastGELU is defined as:
\text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x)
where \sigma is the sigmoid function.
Algorithmic Background:
- Each group can have different matrix sizes and strides.
- The kernel launches a grid covering all groups, with each block assigned to a group.
- FastGELU is applied in the epilogue for each group.
How to Run
Prerequisites
Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.
Build and run
cd composable_kernel/client_example/17_grouped_gemm_fastgelu
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j
# Example run
./grouped_gemm_fastgelu
Source Code Structure
Directory Layout
client_example/17_grouped_gemm_fastgelu/
├── grouped_gemm_fastgelu.cpp # Main client example: grouped GEMM + FastGELU
├── CMakeLists.txt # Build configuration for the example
Key Functions
- main() (in
grouped_gemm_fastgelu.cpp):
Sets up input matrices for each group, configures GEMM and epilogue parameters, launches the grouped kernel, and verifies the result. - Grouped GEMM kernel invocation:
Uses the Composable Kernel device API to launch grouped GEMM with FastGELU activation.
Additional Details
- Supports multiple groups with different matrix shapes.
- Example parameters can be adjusted in the source for different workloads.
Related Examples
- 15_grouped_gemm: Grouped GEMM in the main example directory
- 04_gemm_add_add_fastgelu: GEMM with FastGELU fusion