# GEMM with Bias, Elementwise, and Permute Fusion ## Theory This example demonstrates **GEMM fused with bias addition, elementwise operation, and permutation**. This pattern is used in transformer models and other neural architectures where a linear transformation is followed by bias, activation, and layout transformation. **Mathematical Formulation:** - GEMM: $Y = A \times B$ - Bias: $Z = Y + \text{bias}$ - Elementwise: $E = f(Z)$ (e.g., activation) - Permute: $O = \text{permute}(E, \text{axes})$ **Algorithmic Background:** - The GEMM result is kept in registers, bias and elementwise ops are fused in the epilogue, and permutation is applied before writing to global memory. - Permutation changes the layout/order of tensor axes (e.g., NCHW to NHWC). - This fusion reduces memory traffic and is common in transformer and CNN pipelines. ## How to Run ### Prerequisites Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example. ### Build and run ```bash cd composable_kernel/example/25_gemm_bias_e_permute mkdir build && cd build cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc .. make -j # Example run ./gemm_bias_e_permute_xdl --verify=1 --time=1 ``` ## Source Code Structure ### Directory Layout ``` example/25_gemm_bias_e_permute/ ├── gemm_bias_e_permute_xdl.cpp # Main example: sets up, runs, and verifies GEMM+Bias+Elementwise+Permute include/ck/tensor_operation/gpu/device/ │ └── device_gemm_bias_e_permute.hpp # Device-level API for fused GEMM include/ck/tensor_operation/gpu/device/impl/ │ └── device_gemm_bias_e_permute_impl.hpp # Implementation include/ck/tensor_operation/gpu/grid/ └── gridwise_gemm_bias_e_permute.hpp # Grid-level kernel ``` ### Key Classes and Functions - **DeviceGemmBiasEPermute** (in `device_gemm_bias_e_permute.hpp`): Device API for GEMM fused with bias, elementwise, and permutation. - **gridwise_gemm_bias_e_permute** (in `gridwise_gemm_bias_e_permute.hpp`): Implements the tiled/blocking GEMM kernel with fused epilogue and permutation. This example demonstrates how Composable Kernel supports efficient fusion of linear, bias, activation, and layout operations for deep learning models.