# GEMM with Add, Add, and FastGELU Activation ## Theory This example demonstrates a **GEMM operation fused with two addition operations and FastGELU activation**. This pattern is used in transformer feed-forward networks and other neural architectures where a linear transformation is followed by bias addition, residual addition, and a non-linear activation. **Mathematical Formulation:** $$ E = \text{FastGELU}((A \times B) + D_0 + D_1) $$ - $A$: [M, K] input matrix - $B$: [K, N] weight matrix - $D_0$: [N] bias vector (broadcasted) - $D_1$: [M, N] residual tensor - $E$: [M, N] output FastGELU is an efficient approximation of GELU: $$ \text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x) $$ where $\sigma$ is the sigmoid function. **Algorithmic Background:** - The GEMM result is kept in registers, bias and residual are added, and FastGELU is applied before writing to global memory. - No intermediate results are written to global memory. ## How to Run ### Prerequisites Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example. ### Build and run ```bash cd composable_kernel/example/04_gemm_add_add_fastgelu mkdir build && cd build cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc .. make -j # Example run ./gemm_add_add_fastgelu_xdl -M 2048 -N 8192 -K 2048 --verify=1 --time=1 ``` ## Source Code Structure ### Directory Layout ``` example/04_gemm_add_add_fastgelu/ ├── gemm_add_add_fastgelu_xdl.cpp # Main example: sets up, runs, and verifies GEMM+Add+Add+FastGELU include/ck/tensor_operation/gpu/device/ │ └── device_gemm_multiple_d.hpp # Device-level API for multi-tensor GEMM include/ck/tensor_operation/gpu/device/impl/ │ └── device_gemm_xdl_cshuffle_v3.hpp # XDL with C-Shuffle epilogue │ └── device_gemm_fastgelu_impl.hpp # FastGELU-specific implementation include/ck/tensor_operation/gpu/grid/ │ └── gridwise_gemm_multiple_d_xdl.hpp # Grid-level multi-stage GEMM include/ck/tensor_operation/gpu/element/ └── element_wise_operation.hpp # Elementwise operation definitions ``` ### Key Classes and Functions - **DeviceGemmMultipleD** (in `device_gemm_multiple_d.hpp`): Device API for GEMM with multiple auxiliary tensors and fused epilogues. - **gridwise_gemm_multiple_d_xdl** (in `gridwise_gemm_multiple_d_xdl.hpp`): Implements the tiled/blocking GEMM kernel with multi-stage epilogue. - **element_wise_operation** (in `element_wise_operation.hpp`): Defines FastGELU and other elementwise operations. This example demonstrates how Composable Kernel supports complex multi-stage epilogue fusion for advanced neural network architectures.