Files
composable_kernel/example/04_gemm_add_add_fastgelu
Aviral Goel d85f065b15 chore(copyright): update copyright header for example directory (#3273)
* chore(copyright): update copyright header for codegen directory

* chore(copyright): update copyright header for example directory
2025-11-24 18:02:41 -08:00
..

GEMM with Add, Add, and FastGELU Activation

Theory

This example demonstrates a GEMM operation fused with two addition operations and FastGELU activation. This pattern is used in transformer feed-forward networks and other neural architectures where a linear transformation is followed by bias addition, residual addition, and a non-linear activation.

Mathematical Formulation:


E = \text{FastGELU}((A \times B) + D_0 + D_1)
  • A: [M, K] input matrix
  • B: [K, N] weight matrix
  • D_0: [N] bias vector (broadcasted)
  • D_1: [M, N] residual tensor
  • E: [M, N] output

FastGELU is an efficient approximation of GELU:


\text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x)

where \sigma is the sigmoid function.

Algorithmic Background:

  • The GEMM result is kept in registers, bias and residual are added, and FastGELU is applied before writing to global memory.
  • No intermediate results are written to global memory.

How to Run

Prerequisites

Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.

Build and run

cd composable_kernel/example/04_gemm_add_add_fastgelu
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run
./gemm_add_add_fastgelu_xdl -M 2048 -N 8192 -K 2048 --verify=1 --time=1

Source Code Structure

Directory Layout

example/04_gemm_add_add_fastgelu/
├── gemm_add_add_fastgelu_xdl.cpp         # Main example: sets up, runs, and verifies GEMM+Add+Add+FastGELU
include/ck/tensor_operation/gpu/device/
│   └── device_gemm_multiple_d.hpp         # Device-level API for multi-tensor GEMM
include/ck/tensor_operation/gpu/device/impl/
│   └── device_gemm_xdl_cshuffle_v3.hpp    # XDL with C-Shuffle epilogue
│   └── device_gemm_fastgelu_impl.hpp      # FastGELU-specific implementation
include/ck/tensor_operation/gpu/grid/
│   └── gridwise_gemm_multiple_d_xdl.hpp   # Grid-level multi-stage GEMM
include/ck/tensor_operation/gpu/element/
    └── element_wise_operation.hpp         # Elementwise operation definitions

Key Classes and Functions

  • DeviceGemmMultipleD (in device_gemm_multiple_d.hpp):
    Device API for GEMM with multiple auxiliary tensors and fused epilogues.
  • gridwise_gemm_multiple_d_xdl (in gridwise_gemm_multiple_d_xdl.hpp):
    Implements the tiled/blocking GEMM kernel with multi-stage epilogue.
  • element_wise_operation (in element_wise_operation.hpp):
    Defines FastGELU and other elementwise operations.

This example demonstrates how Composable Kernel supports complex multi-stage epilogue fusion for advanced neural network architectures.