Files

Client Example: GEMM with LayerNorm Fusion

Theory

This client example demonstrates GEMM fused with layer normalization and additional elementwise operations. This pattern is common in transformer feed-forward networks and other architectures where a linear transformation is followed by normalization and activation.

Mathematical Formulation:

  • GEMM: Y = A \times B
  • Additions: Z = Y + D_0 + D_1 (bias, residual, etc.)
  • Activation: A = \text{ReLU}(Z) (or other activation)
  • LayerNorm: \text{LayerNorm}(A) = \gamma \cdot \frac{A - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

\mu, \sigma^2 are mean and variance over the normalization axis; \gamma, \beta are learnable scale and shift.

Algorithmic Background:

  • The GEMM result is kept in registers, elementwise ops and layer normalization are fused in the epilogue.
  • LayerNorm is typically applied over the last dimension (features).
  • This fusion reduces memory traffic and is common in transformer MLP blocks.

How to Run

Prerequisites

Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.

Build and run

cd composable_kernel/client_example/03_gemm_layernorm
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run (naive)
./gemm_add_add_layernorm_naive

# Example run (with ReLU and Welford)
./gemm_add_relu_add_layernorm_welford

Source Code Structure

Directory Layout

client_example/03_gemm_layernorm/
├── gemm_add_add_layernorm_naive.cpp         # GEMM + Add + Add + LayerNorm (naive)
├── gemm_add_relu_add_layernorm_welford.cpp  # GEMM + Add + ReLU + Add + LayerNorm (Welford)
├── CMakeLists.txt                           # Build configuration for the example

Key Functions

  • main() (in each .cpp):
    Sets up input matrices, configures GEMM and epilogue parameters, launches the fused kernel, and verifies the result.
  • LayerNorm implementation:
    Demonstrates both naive and numerically stable (Welford) algorithms for mean/variance.

This client example provides variants to demonstrate different levels of fusion and normalization for transformer-style MLP layers.