* GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>
5.3 KiB
Complex General Matrix-Matrix Multiplication (CGEMM)
This example demonstrates a General Matrix-Matrix Multiplication for complex-valued tensors (CGEMM). This operation is a fundamental building block in many scientific and engineering domains, including signal processing, quantum computing, and electromagnetics, where computations are naturally expressed using complex numbers.
Mathematical Formulation
A complex number z can be represented as z = a + bi, where a is the real part and b is the imaginary part. The multiplication of two complex numbers z1 = a + bi and z2 = c + di is:
z_1 \cdot z_2 = (a+bi)(c+di) = (ac - bd) + (ad + bc)i
A CGEMM operation, D = \alpha \cdot (A \times B) + \beta \cdot C, involves matrices where each element is a complex number. The core matrix multiplication A \times B is defined as:
C_{ik} = \sum_j A_{ij} \cdot B_{jk}
Where each multiplication and addition is a complex operation. This can be broken down into four real-valued GEMM operations:
Let A = A_r + iA_i and B = B_r + iB_i. Then the product C = A \times B is:
C = (A_r + iA_i) \times (B_r + iB_i) = (A_r B_r - A_i B_i) + i(A_r B_i + A_i B_r)
This shows that one CGEMM can be decomposed into four real GEMMs and two real matrix additions/subtractions.
Algorithmic Strategy: Fused Complex Arithmetic
A naive implementation would launch six separate real-valued kernels (4 GEMMs, 2 additions). A much more efficient approach, and the one used by Composable Kernel, is to implement CGEMM in a single, fused kernel.
-
Data Layout: Complex numbers are typically stored in an interleaved format, where the real and imaginary parts of an element are adjacent in memory (e.g.,
[r1, i1, r2, i2, ...]). The kernel is designed to work efficiently with this layout. -
Tiled CGEMM: The kernel uses a standard tiled GEMM algorithm, but the fundamental operations are adapted for complex numbers.
- Loading: A thread block loads tiles of the complex-valued matrices A and B from global memory into shared memory.
- Complex Multiply-Accumulate: The core of the algorithm is the multiply-accumulate (MAC) operation. Instead of a single
fmainstruction, each complex MAC involves multiple real-valuedfmainstructions to compute the real and imaginary parts of the product, as shown in the mathematical formulation.real_part = (a_r * b_r) - (a_i * b_i)imag_part = (a_r * b_i) + (a_i * b_r)
- These operations are carefully scheduled to maximize instruction-level parallelism and hide latency. The accumulators for both the real and imaginary parts are held in private registers.
-
Storing: After the tile is fully computed, the complex-valued result is written from registers back to the output matrix D in global memory.
By fusing the complex arithmetic directly into the GEMM kernel, we avoid launching multiple kernels and storing large intermediate real-valued matrices, which dramatically reduces kernel launch overhead and memory bandwidth requirements.
Source Code Organization
cgemm_xdl.cpp: The main example file. It defines complex-valued input matrices and instantiates theDeviceGemmoperation, specialized for complex data types.- The standard
DeviceGemminterface from../../include/ck/tensor_operation/gpu/device/device_gemm.hppis used. Composable Kernel overloads this interface for complex types (ck::complex<T>). - The grid-wise GEMM kernel is specialized to handle complex types. When the template arguments for data types are
ck::complex, the compiler instantiates a version of the kernel where the MAC operations are replaced with the sequence of real-valued operations required for complex multiplication.
Build and Run
Prerequisites
Ensure the Composable Kernel library is built and installed.
cd /path/to/composable_kernel/build
make -j install
Build the Example
cd /path/to/composable_kernel/example/22_cgemm
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the example with default settings
./cgemm_xdl
# Run with verification, data initialization, and timing
./cgemm_xdl 1 2 1
Applications
CGEMM is a critical kernel in many high-performance computing applications:
- Digital Signal Processing (DSP): The Fast Fourier Transform (FFT), a cornerstone of DSP, can be implemented using complex matrix multiplications. Filtering and convolution in the frequency domain also rely on complex arithmetic.
- Quantum Computing Simulation: The state of a quantum system is described by a vector of complex numbers, and quantum gates are represented by unitary matrices (a special type of complex matrix). Simulating a quantum circuit involves a sequence of CGEMM operations.
- Electromagnetics and Wave Physics: Simulating the propagation of electromagnetic or acoustic waves often involves solving systems of equations with complex numbers to represent the phase and amplitude of the waves.
- Communications: Modern communication systems (like 5G and Wi-Fi) use complex modulation schemes (like QAM) where signals are represented by complex numbers.