composable_kernel/example/52_im2col_col2im/README.md

# Im2Col and Col2Im Operations

This example demonstrates **Im2Col (image to column) and Col2Im (column to image)** operations. These are fundamental data layout transformations used in implementing convolution operations, particularly in frameworks that convert convolutions into matrix multiplications for efficient computation on GPUs.

## Mathematical Formulation

### Im2Col (Image to Column)
Im2Col transforms a 4D image tensor into a 2D matrix where each column represents the input values for one convolution window.

Given:
- Input tensor `X` with shape `[N, C, H, W]`
- Convolution parameters: kernel size `(KH, KW)`, stride `(SH, SW)`, padding `(PH, PW)`, dilation `(DH, DW)`

The output matrix has shape `[C × KH × KW, N × OH × OW]` where:
- `OH = (H + 2×PH - DH×(KH-1) - 1) / SH + 1`
- `OW = (W + 2×PW - DW×(KW-1) - 1) / SW + 1`

Each column `j` contains the flattened values from the convolution window at output position `j`:
$\text{Col}[:, j] = \text{flatten}(\text{Window}_j(X))$

### Col2Im (Column to Image)
Col2Im is the inverse operation that reconstructs an image tensor from the column representation.

Given:
- Column matrix `Col` with shape `[C × KH × KW, N × OH × OW]`
- Target image dimensions and convolution parameters

The operation accumulates values from overlapping windows:
$X[n, c, h, w] = \sum_{\text{windows covering } (h,w)} \text{Col}[\text{offset}, \text{window\_id}]$

Where multiple windows may contribute to the same image position, requiring accumulation.

## Algorithmic Strategy: Parallel Data Reshaping

Both operations involve complex memory access patterns that require careful optimization.

### Im2Col Implementation
1.  **Grid Scheduling**: Parallelize over output columns (convolution windows).

2.  **Window Extraction**: For each output column:
    -   Calculate the corresponding input window position
    -   Handle padding by inserting zeros for out-of-bounds positions
    -   Apply dilation by skipping elements in the kernel
    -   Copy window values to the appropriate column

3.  **Memory Optimization**:
    -   Coalesced reads from input image
    -   Coalesced writes to output matrix
    -   Efficient padding handling

### Col2Im Implementation
1.  **Grid Scheduling**: Parallelize over input image positions or column elements.

2.  **Accumulation**: For each column element:
    -   Calculate which image position it corresponds to
    -   Accumulate the value using atomic operations (for overlapping windows)
    -   Handle boundary conditions and padding

3.  **Conflict Resolution**: Use atomic operations for thread-safe accumulation when multiple columns contribute to the same image position.

## Source Code Organization

-   [`im2col_col2im_xdl.cpp`](./im2col_col2im_xdl.cpp): The main example file. It demonstrates both Im2Col and Col2Im operations with verification that they are inverse operations.
-   [`../../include/ck/tensor_operation/gpu/device/device_im2col.hpp`](../../include/ck/tensor_operation/gpu/device/device_im2col.hpp): The high-level device interface for Im2Col operations.
-   [`../../include/ck/tensor_operation/gpu/device/device_col2im.hpp`](../../include/ck/tensor_operation/gpu/device/device_col2im.hpp): The high-level device interface for Col2Im operations.
-   The underlying kernels implement the complex address calculations and memory access patterns required for these transformations.

## Build and Run

### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```

### Build the Example
```bash
cd /path/to/composable_kernel/example/52_im2col_col2im
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j
```

### Run the Example
```bash
# Run the example with default settings
./im2col_col2im_xdl

# Run with verification, data initialization, and timing
./im2col_col2im_xdl 1 2 1
```

## Applications in Deep Learning

Im2Col and Col2Im are fundamental operations in convolution implementations:

### Im2Col Applications
-   **Convolution via GEMM**: Transform convolution into matrix multiplication, allowing use of highly optimized BLAS libraries
-   **Explicit Convolution**: Some frameworks prefer explicit Im2Col for better control over memory layouts
-   **Winograd Convolution**: Used in Winograd-based fast convolution algorithms
-   **Debugging and Visualization**: Understanding the convolution process by examining the column representation

### Col2Im Applications
-   **Transpose Convolution**: The backward pass of convolution (gradient w.r.t. input) uses Col2Im
-   **Deconvolution**: Upsampling operations that are the inverse of convolution
-   **Gradient Computation**: Computing gradients for convolution operations
-   **Memory Layout Restoration**: Converting back from optimized layouts to standard image formats

## Performance Characteristics

-   **Memory Bound**: Both operations are typically memory-bound rather than compute-bound
-   **Access Patterns**: Performance heavily depends on memory access patterns and coalescing
-   **Memory Overhead**: Im2Col can significantly increase memory usage due to data duplication
-   **Cache Behavior**: Complex strided access patterns can lead to poor cache utilization