Files

Client Example: GEMM with bf16A/int8B and Fused Epilogues

Theory

This client example demonstrates GEMM with mixed-precision input types (bf16 for A, int8 for B) and various fused epilogue operations (bias, GELU, FastGELU, multiply). Mixed-precision GEMM is used for efficient inference and training in deep learning, especially for transformer and MLP layers.

Mathematical Formulation:

  • GEMM: Y = A \times B
    • A: bf16 (brain floating point)
    • B: int8 (8-bit integer)
  • Fused epilogues:
    • Bias: Z = Y + \text{bias}
    • GELU: E = \text{GELU}(Z)
    • FastGELU: E = \text{FastGELU}(Z)
    • Multiply: E = Z \odot D_1

Algorithmic Background:

  • Mixed-precision computation reduces memory and compute requirements.
  • Fused epilogues improve efficiency by combining bias, activation, and scaling in a single kernel.

How to Run

Prerequisites

Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.

cd composable_kernel/build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="bf16;int8" ..
make -j
make install

Build and run

cd composable_kernel/client_example/30_gemm_bf16Aint8B
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run (basic GEMM)
./gemm_xdl_bf16_i8

# Example run (GEMM + bias)
./gemm_bias_xdl_bf16_i8

# Example run (GEMM + bias + GELU)
./gemm_xdl_gelu_bf16_i8

# Example run (GEMM + bias + FastGELU)
./gemm_bias_fastgelu_xdl_bf16_i8

# Example run (GEMM + multiply)
./gemm_xdl_multiply_bf16_i8

Source Code Structure

Directory Layout

client_example/30_gemm_bf16Aint8B/
├── gemm_xdl_bf16_i8.cpp                # GEMM (bf16A, int8B)
├── gemm_bias_xdl_bf16_i8.cpp           # GEMM + bias
├── gemm_xdl_gelu_bf16_i8.cpp           # GEMM + bias + GELU
├── gemm_bias_fastgelu_xdl_bf16_i8.cpp  # GEMM + bias + FastGELU
├── gemm_xdl_multiply_bf16_i8.cpp       # GEMM + multiply
├── CMakeLists.txt                      # Build configuration for the example

Key Functions

  • main() (in each .cpp):
    Sets up input matrices, configures GEMM and epilogue parameters, launches the kernel, and verifies the result.
  • Fused kernel invocation:
    Uses the Composable Kernel device API to launch GEMM with various fused epilogues.

Additional Details

  • Supports bf16 and int8 input types for efficient mixed-precision computation.
  • Example parameters can be adjusted in the source for different workloads.


Back to Client Examples