# Client Example: GEMM with bf16A/int8B and Fused Epilogues

## Theory

This client example demonstrates **GEMM with mixed-precision input types (bf16 for A, int8 for B)** and various fused epilogue operations (bias, GELU, FastGELU, multiply). Mixed-precision GEMM is used for efficient inference and training in deep learning, especially for transformer and MLP layers.

**Mathematical Formulation:**
- GEMM: $Y = A \times B$
  - $A$: bf16 (brain floating point)
  - $B$: int8 (8-bit integer)
- Fused epilogues:
  - Bias: $Z = Y + \text{bias}$
  - GELU: $E = \text{GELU}(Z)$
  - FastGELU: $E = \text{FastGELU}(Z)$
  - Multiply: $E = Z \odot D_1$

**Algorithmic Background:**
- Mixed-precision computation reduces memory and compute requirements.
- Fused epilogues improve efficiency by combining bias, activation, and scaling in a single kernel.

## How to Run

### Prerequisites

Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.

```bash
cd composable_kernel/build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="bf16;int8" ..
make -j
make install
```

### Build and run
```bash
cd composable_kernel/client_example/30_gemm_bf16Aint8B
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run (basic GEMM)
./gemm_xdl_bf16_i8

# Example run (GEMM + bias)
./gemm_bias_xdl_bf16_i8

# Example run (GEMM + bias + GELU)
./gemm_xdl_gelu_bf16_i8

# Example run (GEMM + bias + FastGELU)
./gemm_bias_fastgelu_xdl_bf16_i8

# Example run (GEMM + multiply)
./gemm_xdl_multiply_bf16_i8
```

## Source Code Structure

### Directory Layout
```
client_example/30_gemm_bf16Aint8B/
├── gemm_xdl_bf16_i8.cpp                # GEMM (bf16A, int8B)
├── gemm_bias_xdl_bf16_i8.cpp           # GEMM + bias
├── gemm_xdl_gelu_bf16_i8.cpp           # GEMM + bias + GELU
├── gemm_bias_fastgelu_xdl_bf16_i8.cpp  # GEMM + bias + FastGELU
├── gemm_xdl_multiply_bf16_i8.cpp       # GEMM + multiply
├── CMakeLists.txt                      # Build configuration for the example
```

### Key Functions

- **main()** (in each `.cpp`):  
  Sets up input matrices, configures GEMM and epilogue parameters, launches the kernel, and verifies the result.
- **Fused kernel invocation**:  
  Uses the Composable Kernel device API to launch GEMM with various fused epilogues.

---

## Additional Details

- Supports bf16 and int8 input types for efficient mixed-precision computation.
- Example parameters can be adjusted in the source for different workloads.

---

## Related Examples

- [14_gemm_quantization](../../example/14_gemm_quantization/README.md): GEMM quantization in the main example directory
- [46_gemm_add_multiply](../../example/46_gemm_add_multiply/README.md): GEMM with add and multiply in the main example directory

---
[Back to Client Examples](../README.md)