mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 18:17:44 +00:00
68 lines
4.6 KiB
Markdown
68 lines
4.6 KiB
Markdown
# Acronyms in Composable Kernel
|
|
|
|
The following acronyms are used in the Composable Kernel codebase:
|
|
|
|
| Acronym | Expansion | Explanation |
|
|
|---------|-----------|-------------|
|
|
| BF16 | Brain Floating Point 16 | 1 Signed bit, 8 Exponent bits, 7 Significand bits |
|
|
| BF8 | 8-bit Brain Floating Point | 1 Signed bit, 3 Exponent bits, 4 Significand bits |
|
|
| DLA | Deep Learning Accelerator | Specialized hardware for deep learning workloads |
|
|
| DRAM | Dynamic Random-Access Memory | Main memory. Global memory on GPU |
|
|
| E2E | End-to-End | Complete pipeline or process from input to output |
|
|
| ELU | Exponential Linear Unit | Activation function: $x$ if $x>0$ else $\alpha(e^x-1)$ |
|
|
| FMHA | Fused Multi-Head Attention | Efficient transformer attention kernel, fusing softmax, masking, and matmul |
|
|
| FP16 | Half-Precision Floating Point | 16-bit IEEE floating point format |
|
|
| FP32 | Single-Precision Floating Point | 32-bit IEEE floating point format |
|
|
| FP64 | Double-Precision Floating Point | 64-bit IEEE floating point format |
|
|
| FP8 | 8-bit Floating Point | Experimental 8-bit floating point format for inference |
|
|
| GEMM | General Matrix Multiply | Matrix multiplication operation: $C = A \times B$ |
|
|
| GELU | Gaussian Error Linear Unit | Activation function: $x \cdot \Phi(x)$ |
|
|
| GQA | Grouped Query Attention | Variant of multi-head attention with grouped queries/keys/values |
|
|
| HBM | High Bandwidth Memory | Fast memory used in modern GPUs |
|
|
| HIP | Heterogeneous-Compute Interface for Portability | AMD's CUDA-like GPU programming API |
|
|
| INT8 | 8-bit Integer | Quantized integer format for inference |
|
|
| KVS | Key-Value Store | Data structure for storing key-value pairs (context: QKV in transformers) |
|
|
| L2/L1 | Level 2/Level 1 Cache | On-chip memory hierarchy in CPUs/GPUs |
|
|
| LDS | Local Data Share | Shared memory on AMD GPUs (equivalent to CUDA's shared memory) |
|
|
| LLM | Large Language Model | Transformer-based model for NLP tasks |
|
|
| LSE | Log-Sum-Exp | Numerically stable softmax computation: $\log(\sum \exp(x))$ |
|
|
| MHA | Multi-Head Attention | Attention mechanism with multiple heads in transformers |
|
|
| MFMA | Matrix Fused Multiply-Add | AMD GPU hardware instruction for matrix-matrix multiplication |
|
|
| MoE | Mixture of Experts | Neural network architecture with multiple expert subnetworks |
|
|
| MQA | Multi-Query Attention | Variant of multi-head attention with shared keys/values across heads |
|
|
| RCCL | ROCm Collective Communications Library | AMD Library for multi-GPU communication |
|
|
| NCHW | Batch, Channel, Height, Width | Tensor layout: batch-major, channels-first |
|
|
| NHWC | Batch, Height, Width, Channel | Tensor layout: batch-major, channels-last |
|
|
| OOM | Out Of Memory | Error when memory allocation fails |
|
|
| QAT | Quantization Aware Training | Training technique for quantized inference |
|
|
| QKV | Query, Key, Value | Components of transformer attention mechanism |
|
|
| RDMA | Remote Direct Memory Access | High-speed network memory access |
|
|
| RDQuant | Rowwise Dynamic Quantization | Quantization technique with per-row scaling for int8 inference |
|
|
| ReLU | Rectified Linear Unit | Activation function: $\max(0, x)$ |
|
|
| ROCm | Radeon Open Compute | AMD's open GPU computing stack |
|
|
| SGD | Stochastic Gradient Descent | Optimization algorithm for training neural networks |
|
|
| SM | Streaming Multiprocessor | GPU compute unit (NVIDIA terminology) |
|
|
| SWA | Sliding Window Attention | Attention mechanism with a limited window for each token |
|
|
| TLB | Translation Lookaside Buffer | Memory management unit cache for virtual-to-physical address translation |
|
|
| VGPR | Vector General Purpose Register | GPU register for vector operations |
|
|
| WARP | Group of Threads | Smallest scheduling unit on NVIDIA GPUs (32 threads) |
|
|
| WMMA | Warp Matrix Multiply-Accumulate | NVIDIA's matrix-multiply hardware primitive |
|
|
| XLA | Accelerated Linear Algebra | Compiler for optimizing ML computations (Google) |
|
|
|
|
### Common Variable Acronyms in Code
|
|
|
|
| Symbol | Meaning | Context |
|
|
|--------|---------|---------|
|
|
| M, N, K | Matrix dimensions | GEMM: $A[M,K] \times B[K,N] = C[M,N]$ |
|
|
| Q, K, V | Query, Key, Value | Transformer attention |
|
|
| S | Sequence length | NLP, transformers |
|
|
| D | Dimension | Hidden size, feature dim |
|
|
| B | Batch size | ML batch processing |
|
|
| H | Head count | Multi-head attention |
|
|
| C | Channel | CNNs, tensor layouts |
|
|
| T | Token | NLP, sequence models |
|
|
|
|
---
|
|
|
|
If you find an acronym not listed here, please submit a pull request or issue!
|