mirror of https://github.com/ROCm/composable_kernel.git synced 2026-04-19 14:29:05 +00:00

Files

Aviral Goel 004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>

2025-11-28 13:49:54 -08:00

CMakeLists.txt

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

gemm_bias_fastgelu_xdl_bf16_i8.cpp

chore(copyright): update copyright header for test directory (#3252 )

2025-11-20 20:36:57 -05:00

gemm_bias_xdl_bf16_i8.cpp

chore(copyright): update copyright header for test directory (#3252 )

2025-11-20 20:36:57 -05:00

gemm_xdl_bf16_i8.cpp

chore(copyright): update copyright header for test directory (#3252 )

2025-11-20 20:36:57 -05:00

gemm_xdl_gelu_bf16_i8.cpp

chore(copyright): update copyright header for test directory (#3252 )

2025-11-20 20:36:57 -05:00

gemm_xdl_multiply_bf16_i8.cpp

chore(copyright): update copyright header for test directory (#3252 )

2025-11-20 20:36:57 -05:00

README.md

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

README.md

Client Example: GEMM with bf16A/int8B and Fused Epilogues

Theory

This client example demonstrates GEMM with mixed-precision input types (bf16 for A, int8 for B) and various fused epilogue operations (bias, GELU, FastGELU, multiply). Mixed-precision GEMM is used for efficient inference and training in deep learning, especially for transformer and MLP layers.

Mathematical Formulation:

GEMM: Y = A \times B
- A: bf16 (brain floating point)
- B: int8 (8-bit integer)
Fused epilogues:
- Bias: Z = Y + \text{bias}
- GELU: E = \text{GELU}(Z)
- FastGELU: E = \text{FastGELU}(Z)
- Multiply: E = Z \odot D_1

Algorithmic Background:

Mixed-precision computation reduces memory and compute requirements.
Fused epilogues improve efficiency by combining bias, activation, and scaling in a single kernel.

How to Run

Prerequisites

Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.

cd composable_kernel/build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="bf16;int8" ..
make -j
make install

Build and run

cd composable_kernel/client_example/30_gemm_bf16Aint8B
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run (basic GEMM)
./gemm_xdl_bf16_i8

# Example run (GEMM + bias)
./gemm_bias_xdl_bf16_i8

# Example run (GEMM + bias + GELU)
./gemm_xdl_gelu_bf16_i8

# Example run (GEMM + bias + FastGELU)
./gemm_bias_fastgelu_xdl_bf16_i8

# Example run (GEMM + multiply)
./gemm_xdl_multiply_bf16_i8

Source Code Structure

Directory Layout

client_example/30_gemm_bf16Aint8B/
├── gemm_xdl_bf16_i8.cpp                # GEMM (bf16A, int8B)
├── gemm_bias_xdl_bf16_i8.cpp           # GEMM + bias
├── gemm_xdl_gelu_bf16_i8.cpp           # GEMM + bias + GELU
├── gemm_bias_fastgelu_xdl_bf16_i8.cpp  # GEMM + bias + FastGELU
├── gemm_xdl_multiply_bf16_i8.cpp       # GEMM + multiply
├── CMakeLists.txt                      # Build configuration for the example

Key Functions

main() (in each .cpp):
Sets up input matrices, configures GEMM and epilogue parameters, launches the kernel, and verifies the result.
Fused kernel invocation:
Uses the Composable Kernel device API to launch GEMM with various fused epilogues.

Additional Details

Supports bf16 and int8 input types for efficient mixed-precision computation.
Example parameters can be adjusted in the source for different workloads.

14_gemm_quantization: GEMM quantization in the main example directory
46_gemm_add_multiply: GEMM with add and multiply in the main example directory

Back to Client Examples

README.md

Client Example: GEMM with bf16A/int8B and Fused Epilogues

Theory

How to Run

Prerequisites

Build and run

Source Code Structure

Directory Layout

Key Functions

Additional Details

Related Examples