mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-04-19 22:39:03 +00:00
* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
GEMM with Add, Add, and FastGELU Activation
Theory
This example demonstrates a GEMM operation fused with two addition operations and FastGELU activation. This pattern is used in transformer feed-forward networks and other neural architectures where a linear transformation is followed by bias addition, residual addition, and a non-linear activation.
Mathematical Formulation:
E = \text{FastGELU}((A \times B) + D_0 + D_1)
A: [M, K] input matrixB: [K, N] weight matrixD_0: [N] bias vector (broadcasted)D_1: [M, N] residual tensorE: [M, N] output
FastGELU is an efficient approximation of GELU:
\text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x)
where \sigma is the sigmoid function.
Algorithmic Background:
- The GEMM result is kept in registers, bias and residual are added, and FastGELU is applied before writing to global memory.
- No intermediate results are written to global memory.
How to Run
Prerequisites
Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.
Build and run
cd composable_kernel/example/04_gemm_add_add_fastgelu
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j
# Example run
./gemm_add_add_fastgelu_xdl -M 2048 -N 8192 -K 2048 --verify=1 --time=1
Source Code Structure
Directory Layout
example/04_gemm_add_add_fastgelu/
├── gemm_add_add_fastgelu_xdl.cpp # Main example: sets up, runs, and verifies GEMM+Add+Add+FastGELU
include/ck/tensor_operation/gpu/device/
│ └── device_gemm_multiple_d.hpp # Device-level API for multi-tensor GEMM
include/ck/tensor_operation/gpu/device/impl/
│ └── device_gemm_xdl_cshuffle_v3.hpp # XDL with C-Shuffle epilogue
│ └── device_gemm_fastgelu_impl.hpp # FastGELU-specific implementation
include/ck/tensor_operation/gpu/grid/
│ └── gridwise_gemm_multiple_d_xdl.hpp # Grid-level multi-stage GEMM
include/ck/tensor_operation/gpu/element/
└── element_wise_operation.hpp # Elementwise operation definitions
Key Classes and Functions
- DeviceGemmMultipleD (in
device_gemm_multiple_d.hpp):
Device API for GEMM with multiple auxiliary tensors and fused epilogues. - gridwise_gemm_multiple_d_xdl (in
gridwise_gemm_multiple_d_xdl.hpp):
Implements the tiled/blocking GEMM kernel with multi-stage epilogue. - element_wise_operation (in
element_wise_operation.hpp):
Defines FastGELU and other elementwise operations.
This example demonstrates how Composable Kernel supports complex multi-stage epilogue fusion for advanced neural network architectures.