Files
composable_kernel/example/22_cgemm/README.md
2025-10-16 10:13:27 +00:00

84 lines
5.3 KiB
Markdown

# Complex General Matrix-Matrix Multiplication (CGEMM)
This example demonstrates a General Matrix-Matrix Multiplication for complex-valued tensors (CGEMM). This operation is a fundamental building block in many scientific and engineering domains, including signal processing, quantum computing, and electromagnetics, where computations are naturally expressed using complex numbers.
## Mathematical Formulation
A complex number `z` can be represented as `z = a + bi`, where `a` is the real part and `b` is the imaginary part. The multiplication of two complex numbers `z1 = a + bi` and `z2 = c + di` is:
$z_1 \cdot z_2 = (a+bi)(c+di) = (ac - bd) + (ad + bc)i$
A CGEMM operation, $D = \alpha \cdot (A \times B) + \beta \cdot C$, involves matrices where each element is a complex number. The core matrix multiplication $A \times B$ is defined as:
$C_{ik} = \sum_j A_{ij} \cdot B_{jk}$
Where each multiplication and addition is a complex operation. This can be broken down into four real-valued GEMM operations:
Let $A = A_r + iA_i$ and $B = B_r + iB_i$. Then the product $C = A \times B$ is:
$C = (A_r + iA_i) \times (B_r + iB_i) = (A_r B_r - A_i B_i) + i(A_r B_i + A_i B_r)$
This shows that one CGEMM can be decomposed into four real GEMMs and two real matrix additions/subtractions.
## Algorithmic Strategy: Fused Complex Arithmetic
A naive implementation would launch six separate real-valued kernels (4 GEMMs, 2 additions). A much more efficient approach, and the one used by Composable Kernel, is to implement CGEMM in a single, fused kernel.
1. **Data Layout**: Complex numbers are typically stored in an interleaved format, where the real and imaginary parts of an element are adjacent in memory (e.g., `[r1, i1, r2, i2, ...]`). The kernel is designed to work efficiently with this layout.
2. **Tiled CGEMM**: The kernel uses a standard tiled GEMM algorithm, but the fundamental operations are adapted for complex numbers.
- **Loading**: A thread block loads tiles of the complex-valued matrices A and B from global memory into shared memory.
- **Complex Multiply-Accumulate**: The core of the algorithm is the multiply-accumulate (MAC) operation. Instead of a single `fma` instruction, each complex MAC involves multiple real-valued `fma` instructions to compute the real and imaginary parts of the product, as shown in the mathematical formulation.
- `real_part = (a_r * b_r) - (a_i * b_i)`
- `imag_part = (a_r * b_i) + (a_i * b_r)`
- These operations are carefully scheduled to maximize instruction-level parallelism and hide latency. The accumulators for both the real and imaginary parts are held in private registers.
3. **Storing**: After the tile is fully computed, the complex-valued result is written from registers back to the output matrix D in global memory.
By fusing the complex arithmetic directly into the GEMM kernel, we avoid launching multiple kernels and storing large intermediate real-valued matrices, which dramatically reduces kernel launch overhead and memory bandwidth requirements.
## Source Code Organization
- [`cgemm_xdl.cpp`](./cgemm_xdl.cpp): The main example file. It defines complex-valued input matrices and instantiates the `DeviceGemm` operation, specialized for complex data types.
- The standard `DeviceGemm` interface from [`../../include/ck/tensor_operation/gpu/device/device_gemm.hpp`](../../include/ck/tensor_operation/gpu/device/device_gemm.hpp) is used. Composable Kernel overloads this interface for complex types (`ck::complex<T>`).
- The grid-wise GEMM kernel is specialized to handle complex types. When the template arguments for data types are `ck::complex`, the compiler instantiates a version of the kernel where the MAC operations are replaced with the sequence of real-valued operations required for complex multiplication.
## Build and Run
### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```
### Build the Example
```bash
cd /path/to/composable_kernel/example/22_cgemm
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
```
### Run the Example
```bash
# Run the example with default settings
./cgemm_xdl
# Run with verification, data initialization, and timing
./cgemm_xdl 1 2 1
```
## Applications
CGEMM is a critical kernel in many high-performance computing applications:
- **Digital Signal Processing (DSP)**: The Fast Fourier Transform (FFT), a cornerstone of DSP, can be implemented using complex matrix multiplications. Filtering and convolution in the frequency domain also rely on complex arithmetic.
- **Quantum Computing Simulation**: The state of a quantum system is described by a vector of complex numbers, and quantum gates are represented by unitary matrices (a special type of complex matrix). Simulating a quantum circuit involves a sequence of CGEMM operations.
- **Electromagnetics and Wave Physics**: Simulating the propagation of electromagnetic or acoustic waves often involves solving systems of equations with complex numbers to represent the phase and amplitude of the waves.
- **Communications**: Modern communication systems (like 5G and Wi-Fi) use complex modulation schemes (like QAM) where signals are represented by complex numbers.