Files

Fused Attention Examples

This directory contains comprehensive examples demonstrating CK's high-performance fused attention implementations, which are critical for modern transformer architectures and large language models.


Theory

Fused Multi-Head Attention Operation: The fused attention mechanism performs the core transformer operation in a single, optimized kernel:


\text{Attention}(Q, K, V) = \text{Softmax}(Q K^T / \sqrt{d_k}) V

Detailed Mathematical Steps:

  1. Query-Key Attention Scores: S = Q K^T
  2. Scale: S_{\text{scaled}} = S / \sqrt{d_k}
  3. Softmax: A = \text{Softmax}(S_{\text{scaled}})
  4. Weighted Value Sum: \text{Output} = A V
  • Multi-head extension: Each head computes attention independently, then results are concatenated and projected.
  • Tensor shapes: Q, K, V, Output are typically [Batch, Seq_len, Num_heads, Head_dim].

Algorithmic Background:

  • Fused attention combines two GEMMs and a softmax in a single kernel, minimizing memory traffic.
  • Supports bias, masking, and permutation for transformer and LLM workloads.

How to Run

Prerequisites

Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.

Build and run

cd composable_kernel/client_example/08_fused_attention
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run (basic fused attention)
./fused_attention

# Example run (fused attention with bias)
./fused_attention_bias

Source Code Structure

Directory Layout

client_example/08_fused_attention/
├── fused_attention.cpp         # Main client example: fused attention (Q, K, V)
├── fused_attention_bias.cpp    # Fused attention with bias
├── CMakeLists.txt              # Build configuration for the example

Key Functions

  • main() (in each .cpp):
    Sets up Q, K, V tensors, configures attention parameters, launches the fused kernel, and verifies the result.
  • Fused attention kernel invocation:
    Uses the Composable Kernel device API to launch the fused attention operation, optionally with bias.

Additional Details

  • Supports FP16, BF16, FP32, and mixed precision.
  • Handles causal and generic masking for autoregressive and variable-length models.
  • Optimized for memory efficiency (no intermediate attention matrix in global memory).
  • Example parameters can be adjusted in the source for different transformer workloads.


Back to Client Examples