mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-03-25 09:37:42 +00:00
* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
Fused Attention Examples
This directory contains comprehensive examples demonstrating CK's high-performance fused attention implementations, which are critical for modern transformer architectures and large language models.
Theory
Fused Multi-Head Attention Operation: The fused attention mechanism performs the core transformer operation in a single, optimized kernel:
\text{Attention}(Q, K, V) = \text{Softmax}(Q K^T / \sqrt{d_k}) V
Detailed Mathematical Steps:
- Query-Key Attention Scores:
S = Q K^T - Scale:
S_{\text{scaled}} = S / \sqrt{d_k} - Softmax:
A = \text{Softmax}(S_{\text{scaled}}) - Weighted Value Sum:
\text{Output} = A V
- Multi-head extension: Each head computes attention independently, then results are concatenated and projected.
- Tensor shapes: Q, K, V, Output are typically [Batch, Seq_len, Num_heads, Head_dim].
Algorithmic Background:
- Fused attention combines two GEMMs and a softmax in a single kernel, minimizing memory traffic.
- Supports bias, masking, and permutation for transformer and LLM workloads.
How to Run
Prerequisites
Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.
Build and run
cd composable_kernel/client_example/08_fused_attention
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j
# Example run (basic fused attention)
./fused_attention
# Example run (fused attention with bias)
./fused_attention_bias
Source Code Structure
Directory Layout
client_example/08_fused_attention/
├── fused_attention.cpp # Main client example: fused attention (Q, K, V)
├── fused_attention_bias.cpp # Fused attention with bias
├── CMakeLists.txt # Build configuration for the example
Key Functions
- main() (in each
.cpp):
Sets up Q, K, V tensors, configures attention parameters, launches the fused kernel, and verifies the result. - Fused attention kernel invocation:
Uses the Composable Kernel device API to launch the fused attention operation, optionally with bias.
Additional Details
- Supports FP16, BF16, FP32, and mixed precision.
- Handles causal and generic masking for autoregressive and variable-length models.
- Optimized for memory efficiency (no intermediate attention matrix in global memory).
- Example parameters can be adjusted in the source for different transformer workloads.
Related Examples
- 01_gemm: GEMM for Q×K^T and Attn×V
- 06_softmax: Softmax client API usage
- 03_gemm_layernorm: Fused GEMM + layer normalization
- 07_grouped_convnd_fwd: Grouped convolution for vision transformers