# GEMM with Add and Multiply Fusion ## Theory This example demonstrates **GEMM fused with addition and multiplication operations**. This pattern is used in neural networks for bias addition, scaling, gating, and other elementwise transformations after a linear layer. **Mathematical Formulation:** - GEMM: $Y = A \times B$ - Add: $Z = Y + D_0$ - Multiply: $E = Z \odot D_1$ - $D_0$, $D_1$: auxiliary tensors (e.g., bias, scale, gate) **Algorithmic Background:** - The GEMM result is kept in registers, addition and multiplication are fused in the epilogue. - No intermediate results are written to global memory. - Used for bias+scale, gating, and other fused epilogue patterns. ## How to Run ### Prerequisites Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example. ### Build and run ```bash cd composable_kernel/example/46_gemm_add_multiply mkdir build && cd build cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc .. make -j ``` ### Run ```example_gemm_add_multiply_dl_fp16``` ```bash #arg1: verification (0=no, 1=yes) #arg2: initialization (0=no init, 1=integer value, 2=decimal value) #arg3: time kernel (0=no, 1=yes) #arg4 to 11: M (256x), N(128x), K(32x), StrideA, StrideB, StrideD0, StrideD1, StrideE" ./bin/example_gemm_add_multiply_dl_fp16 1 1 1 ``` ## Source Code Structure ### Directory Layout ``` example/46_gemm_add_multiply/ ├── gemm_add_multiply_xdl.cpp # Main example: sets up, runs, and verifies GEMM+Add+Multiply include/ck/tensor_operation/gpu/device/ │ └── device_gemm_multiple_d.hpp # Device-level API for multi-tensor GEMM include/ck/tensor_operation/gpu/device/impl/ │ └── device_gemm_add_multiply_impl.hpp # Add+Multiply implementation include/ck/tensor_operation/gpu/grid/ │ └── gridwise_gemm_multiple_d_xdl.hpp # Grid-level multi-stage GEMM include/ck/tensor_operation/gpu/element/ └── element_wise_operation.hpp # Elementwise operation definitions ``` ### Key Classes and Functions - **DeviceGemmMultipleD** (in `device_gemm_multiple_d.hpp`): Device API for GEMM with multiple auxiliary tensors and fused epilogues. - **gridwise_gemm_multiple_d_xdl** (in `gridwise_gemm_multiple_d_xdl.hpp`): Implements the tiled/blocking GEMM kernel with multi-stage epilogue. - **element_wise_operation** (in `element_wise_operation.hpp`): Defines addition, multiplication, and other elementwise operations. This example demonstrates how Composable Kernel supports efficient fusion of addition and multiplication with GEMM for deep learning and scientific computing.