# GEMM with Bias and ReLU Activation Fusion

## Theory

This example demonstrates **GEMM fused with bias addition and ReLU activation**. This is the core pattern for fully connected (dense) neural network layers and the feed-forward blocks in transformers.

**Mathematical Formulation:**
$$
E = \text{ReLU}(A \times B + \text{bias})
$$
- $A$: [M, K] input matrix
- $B$: [K, N] weight matrix
- $\text{bias}$: [N] bias vector (broadcasted)
- $E$: [M, N] output

**Algorithmic Background:**
- The GEMM result is kept in registers, bias is added, and ReLU is applied before writing to global memory.
- This fusion eliminates intermediate memory traffic and is a standard optimization in deep learning frameworks.

## How to Run

### Prerequisites

Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.

### Build and run
```bash
cd composable_kernel/example/03_gemm_bias_relu
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run
./gemm_bias_relu_xdl -M 2048 -N 8192 -K 2048 --verify=1 --time=1
```

## Source Code Structure

### Directory Layout
```
example/03_gemm_bias_relu/
├── gemm_bias_relu_xdl.cpp         # Main example: sets up, runs, and verifies GEMM+Bias+ReLU
include/ck/tensor_operation/gpu/device/
│   └── device_gemm_multiple_d.hpp         # Device-level API for multi-tensor GEMM
include/ck/tensor_operation/gpu/device/impl/
│   └── device_gemm_xdl_cshuffle_v3.hpp    # XDL with C-Shuffle epilogue
│   └── device_gemm_bias_relu_impl.hpp     # Specialized bias+ReLU implementation
include/ck/tensor_operation/gpu/grid/
│   └── gridwise_gemm_xdl_cshuffle.hpp     # Grid-level GEMM with epilogue
include/ck/tensor_operation/gpu/element/
    └── element_wise_operation.hpp         # Elementwise operation definitions
```

### Key Classes and Functions

- **DeviceGemmMultipleD** (in `device_gemm_multiple_d.hpp`):  
  Device API for GEMM with auxiliary tensors and fused epilogues.
- **gridwise_gemm_xdl_cshuffle** (in `gridwise_gemm_xdl_cshuffle.hpp`):  
  Implements the tiled/blocking GEMM kernel with fused epilogue.
- **element_wise_operation** (in `element_wise_operation.hpp`):  
  Defines bias addition and ReLU activation.

This example demonstrates the standard epilogue fusion concept that enables efficient neural network layers in modern deep learning.