# Im2Col and Col2Im Operations This example demonstrates **Im2Col (image to column) and Col2Im (column to image)** operations. These are fundamental data layout transformations used in implementing convolution operations, particularly in frameworks that convert convolutions into matrix multiplications for efficient computation on GPUs. ## Mathematical Formulation ### Im2Col (Image to Column) Im2Col transforms a 4D image tensor into a 2D matrix where each column represents the input values for one convolution window. Given: - Input tensor `X` with shape `[N, C, H, W]` - Convolution parameters: kernel size `(KH, KW)`, stride `(SH, SW)`, padding `(PH, PW)`, dilation `(DH, DW)` The output matrix has shape `[C × KH × KW, N × OH × OW]` where: - `OH = (H + 2×PH - DH×(KH-1) - 1) / SH + 1` - `OW = (W + 2×PW - DW×(KW-1) - 1) / SW + 1` Each column `j` contains the flattened values from the convolution window at output position `j`: $\text{Col}[:, j] = \text{flatten}(\text{Window}_j(X))$ ### Col2Im (Column to Image) Col2Im is the inverse operation that reconstructs an image tensor from the column representation. Given: - Column matrix `Col` with shape `[C × KH × KW, N × OH × OW]` - Target image dimensions and convolution parameters The operation accumulates values from overlapping windows: $X[n, c, h, w] = \sum_{\text{windows covering } (h,w)} \text{Col}[\text{offset}, \text{window\_id}]$ Where multiple windows may contribute to the same image position, requiring accumulation. ## Algorithmic Strategy: Parallel Data Reshaping Both operations involve complex memory access patterns that require careful optimization. ### Im2Col Implementation 1. **Grid Scheduling**: Parallelize over output columns (convolution windows). 2. **Window Extraction**: For each output column: - Calculate the corresponding input window position - Handle padding by inserting zeros for out-of-bounds positions - Apply dilation by skipping elements in the kernel - Copy window values to the appropriate column 3. **Memory Optimization**: - Coalesced reads from input image - Coalesced writes to output matrix - Efficient padding handling ### Col2Im Implementation 1. **Grid Scheduling**: Parallelize over input image positions or column elements. 2. **Accumulation**: For each column element: - Calculate which image position it corresponds to - Accumulate the value using atomic operations (for overlapping windows) - Handle boundary conditions and padding 3. **Conflict Resolution**: Use atomic operations for thread-safe accumulation when multiple columns contribute to the same image position. ## Source Code Organization - [`im2col_col2im_xdl.cpp`](./im2col_col2im_xdl.cpp): The main example file. It demonstrates both Im2Col and Col2Im operations with verification that they are inverse operations. - [`../../include/ck/tensor_operation/gpu/device/device_im2col.hpp`](../../include/ck/tensor_operation/gpu/device/device_im2col.hpp): The high-level device interface for Im2Col operations. - [`../../include/ck/tensor_operation/gpu/device/device_col2im.hpp`](../../include/ck/tensor_operation/gpu/device/device_col2im.hpp): The high-level device interface for Col2Im operations. - The underlying kernels implement the complex address calculations and memory access patterns required for these transformations. ## Build and Run ### Prerequisites Ensure the Composable Kernel library is built and installed. ```bash cd /path/to/composable_kernel/build make -j install ``` ### Build the Example ```bash cd /path/to/composable_kernel/example/52_im2col_col2im mkdir build && cd build cmake \ -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \ -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \ .. make -j ``` ### Run the Example ```bash # Run the example with default settings ./im2col_col2im_xdl # Run with verification, data initialization, and timing ./im2col_col2im_xdl 1 2 1 ``` ## Applications in Deep Learning Im2Col and Col2Im are fundamental operations in convolution implementations: ### Im2Col Applications - **Convolution via GEMM**: Transform convolution into matrix multiplication, allowing use of highly optimized BLAS libraries - **Explicit Convolution**: Some frameworks prefer explicit Im2Col for better control over memory layouts - **Winograd Convolution**: Used in Winograd-based fast convolution algorithms - **Debugging and Visualization**: Understanding the convolution process by examining the column representation ### Col2Im Applications - **Transpose Convolution**: The backward pass of convolution (gradient w.r.t. input) uses Col2Im - **Deconvolution**: Upsampling operations that are the inverse of convolution - **Gradient Computation**: Computing gradients for convolution operations - **Memory Layout Restoration**: Converting back from optimized layouts to standard image formats ## Performance Characteristics - **Memory Bound**: Both operations are typically memory-bound rather than compute-bound - **Access Patterns**: Performance heavily depends on memory access patterns and coalescing - **Memory Overhead**: Im2Col can significantly increase memory usage due to data duplication - **Cache Behavior**: Complex strided access patterns can lead to poor cache utilization