# GEMM with Double Multiply Operations This example demonstrates a **GEMM followed by two sequential elementwise multiplication operations**. This fusion pattern is useful for implementing layers that require matrix multiplication followed by multiple scaling or masking operations, such as certain attention mechanisms or gated neural network architectures. ## Mathematical Formulation The operation performs a matrix multiplication followed by two sequential elementwise multiplications. 1. **GEMM Stage**: A standard matrix multiplication. $C_{temp1} = A \times B$ 2. **First Multiplication**: Elementwise multiplication with tensor `D`. $C_{temp2} = C_{temp1} \odot D$ 3. **Second Multiplication**: Elementwise multiplication with tensor `E`. $F = C_{temp2} \odot E$ The key optimization is that the intermediate tensors `C_temp1` and `C_temp2` are **never written to global memory**. All operations are fused into the GEMM's epilogue, operating on data held in registers. ## Algorithmic Strategy: GEMM with Dual-Multiply Epilogue The implementation uses a tiled GEMM algorithm with a multi-stage fused epilogue that performs two sequential multiplications. 1. **Tiled GEMM Core**: The kernel begins with a standard tiled GEMM. A thread block computes a tile of the product $A \times B$, accumulating the result in registers. 2. **Dual-Multiply Epilogue**: Before any data is written to global memory, the following sequence occurs for the tile of data held in registers: - **Load First Multiplicand**: Threads load the corresponding elements of tensor `D`. - **First Multiplication**: The elementwise multiplication is performed in registers: `result *= D`. - **Load Second Multiplicand**: Threads load the corresponding elements of tensor `E`. - **Second Multiplication**: The second elementwise multiplication is performed in registers: `result *= E`. - **Store Final Result**: The final result `F` is written to global memory. This deep fusion eliminates multiple kernel launches and the memory bandwidth required to write and re-read intermediate tensors. ## Source Code Organization - [`gemm_multiply_multiply_xdl.cpp`](./gemm_multiply_multiply_xdl.cpp): The main example file. It sets up the input matrices (A, B) and auxiliary tensors (D, E), and instantiates the `DeviceGemmMultiplyMultiply` operation. - [`../../include/ck/tensor_operation/gpu/device/device_gemm_multiply_multiply.hpp`](../../include/ck/tensor_operation/gpu/device/device_gemm_multiply_multiply.hpp): The high-level device interface for this fused operation. - The underlying kernel implements the dual-multiply epilogue that performs both multiplication operations on register data before storing. ## Build and Run ### Prerequisites Ensure the Composable Kernel library is built and installed. ```bash cd /path/to/composable_kernel/build make -j install ``` ### Build the Example ```bash cd /path/to/composable_kernel/example/65_gemm_multiply_multiply mkdir build && cd build cmake \ -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \ .. make -j ``` ### Run the Example ```bash # Run the example with default settings ./gemm_multiply_multiply_xdl # Run with verification, data initialization, and timing ./gemm_multiply_multiply_xdl 1 2 1 ``` ## Applications This fusion pattern is useful for several types of neural network operations and advanced computational patterns. - **Multi-Scale Attention**: Some attention mechanisms apply multiple scaling factors sequentially, such as learned attention scales followed by positional scaling. - **Gated Mechanisms**: Advanced gating architectures that use multiple multiplicative gates in sequence, such as in some RNN variants or transformer modifications. - **Feature Modulation**: Computer vision models that apply multiple feature modulation operations, such as style-based generators or attention-based feature refinement. - **Masking Operations**: Applying multiple types of masks (e.g., attention mask followed by a dropout mask) in sequence. - **Custom Activations**: Implementing complex activation functions that involve multiple multiplicative terms. - **Mixture of Experts**: Some MoE architectures use multiple routing or gating multiplications in sequence. ## Performance Considerations The performance benefits of this fusion depend on several factors: - **Memory Bandwidth Savings**: Eliminates two full tensor read/write cycles for intermediate results - **Cache Locality**: Maintains data in registers throughout the computation pipeline - **Instruction Scheduling**: Allows better interleaving of compute and memory operations - **Kernel Launch Overhead**: Reduces from three separate kernel launches to one ## Comparison with Sequential Operations | Approach | Kernel Launches | Memory Bandwidth | Register Pressure | Implementation Complexity | |----------|----------------|------------------|-------------------|---------------------------| | **Sequential** | 3 kernels | 3× intermediate storage | Low | Simple | | **Fused** | 1 kernel | No intermediate storage | Medium | Moderate | ## Extension Possibilities This pattern can be extended in several ways: - **More Multiplications**: Additional sequential multiplications can be added to the epilogue - **Mixed Operations**: Combine multiplications with additions or other elementwise operations - **Conditional Operations**: Apply multiplications conditionally based on masks or thresholds - **Broadcasting**: Handle different broadcasting patterns for the multiplicand tensors This example demonstrates the flexibility of the epilogue fusion approach, showing how multiple sequential operations can be efficiently combined with matrix multiplication.