# Client Example: GEMM with LayerNorm Fusion

## Theory

This client example demonstrates **GEMM fused with layer normalization** and additional elementwise operations. This pattern is common in transformer feed-forward networks and other architectures where a linear transformation is followed by normalization and activation.

**Mathematical Formulation:**
- GEMM: $Y = A \times B$
- Additions: $Z = Y + D_0 + D_1$ (bias, residual, etc.)
- Activation: $A = \text{ReLU}(Z)$ (or other activation)
- LayerNorm: $\text{LayerNorm}(A) = \gamma \cdot \frac{A - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$

$\mu$, $\sigma^2$ are mean and variance over the normalization axis; $\gamma$, $\beta$ are learnable scale and shift.

**Algorithmic Background:**
- The GEMM result is kept in registers, elementwise ops and layer normalization are fused in the epilogue.
- LayerNorm is typically applied over the last dimension (features).
- This fusion reduces memory traffic and is common in transformer MLP blocks.

## How to Run

### Prerequisites

Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.

### Build and run
```bash
cd composable_kernel/client_example/03_gemm_layernorm
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run (naive)
./gemm_add_add_layernorm_naive

# Example run (with ReLU and Welford)
./gemm_add_relu_add_layernorm_welford
```

## Source Code Structure

### Directory Layout
```
client_example/03_gemm_layernorm/
├── gemm_add_add_layernorm_naive.cpp         # GEMM + Add + Add + LayerNorm (naive)
├── gemm_add_relu_add_layernorm_welford.cpp  # GEMM + Add + ReLU + Add + LayerNorm (Welford)
├── CMakeLists.txt                           # Build configuration for the example
```

### Key Functions

- **main()** (in each `.cpp`):  
  Sets up input matrices, configures GEMM and epilogue parameters, launches the fused kernel, and verifies the result.
- **LayerNorm implementation**:  
  Demonstrates both naive and numerically stable (Welford) algorithms for mean/variance.

This client example provides variants to demonstrate different levels of fusion and normalization for transformer-style MLP layers.