mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-18 20:09:25 +00:00
121 lines
5.2 KiB
Markdown
121 lines
5.2 KiB
Markdown
# Im2Col and Col2Im Operations
|
||
|
||
This example demonstrates **Im2Col (image to column) and Col2Im (column to image)** operations. These are fundamental data layout transformations used in implementing convolution operations, particularly in frameworks that convert convolutions into matrix multiplications for efficient computation on GPUs.
|
||
|
||
## Mathematical Formulation
|
||
|
||
### Im2Col (Image to Column)
|
||
Im2Col transforms a 4D image tensor into a 2D matrix where each column represents the input values for one convolution window.
|
||
|
||
Given:
|
||
- Input tensor `X` with shape `[N, C, H, W]`
|
||
- Convolution parameters: kernel size `(KH, KW)`, stride `(SH, SW)`, padding `(PH, PW)`, dilation `(DH, DW)`
|
||
|
||
The output matrix has shape `[C × KH × KW, N × OH × OW]` where:
|
||
- `OH = (H + 2×PH - DH×(KH-1) - 1) / SH + 1`
|
||
- `OW = (W + 2×PW - DW×(KW-1) - 1) / SW + 1`
|
||
|
||
Each column `j` contains the flattened values from the convolution window at output position `j`:
|
||
$\text{Col}[:, j] = \text{flatten}(\text{Window}_j(X))$
|
||
|
||
### Col2Im (Column to Image)
|
||
Col2Im is the inverse operation that reconstructs an image tensor from the column representation.
|
||
|
||
Given:
|
||
- Column matrix `Col` with shape `[C × KH × KW, N × OH × OW]`
|
||
- Target image dimensions and convolution parameters
|
||
|
||
The operation accumulates values from overlapping windows:
|
||
$X[n, c, h, w] = \sum_{\text{windows covering } (h,w)} \text{Col}[\text{offset}, \text{window\_id}]$
|
||
|
||
Where multiple windows may contribute to the same image position, requiring accumulation.
|
||
|
||
## Algorithmic Strategy: Parallel Data Reshaping
|
||
|
||
Both operations involve complex memory access patterns that require careful optimization.
|
||
|
||
### Im2Col Implementation
|
||
1. **Grid Scheduling**: Parallelize over output columns (convolution windows).
|
||
|
||
2. **Window Extraction**: For each output column:
|
||
- Calculate the corresponding input window position
|
||
- Handle padding by inserting zeros for out-of-bounds positions
|
||
- Apply dilation by skipping elements in the kernel
|
||
- Copy window values to the appropriate column
|
||
|
||
3. **Memory Optimization**:
|
||
- Coalesced reads from input image
|
||
- Coalesced writes to output matrix
|
||
- Efficient padding handling
|
||
|
||
### Col2Im Implementation
|
||
1. **Grid Scheduling**: Parallelize over input image positions or column elements.
|
||
|
||
2. **Accumulation**: For each column element:
|
||
- Calculate which image position it corresponds to
|
||
- Accumulate the value using atomic operations (for overlapping windows)
|
||
- Handle boundary conditions and padding
|
||
|
||
3. **Conflict Resolution**: Use atomic operations for thread-safe accumulation when multiple columns contribute to the same image position.
|
||
|
||
## Source Code Organization
|
||
|
||
- [`im2col_col2im_xdl.cpp`](./im2col_col2im_xdl.cpp): The main example file. It demonstrates both Im2Col and Col2Im operations with verification that they are inverse operations.
|
||
- [`../../include/ck/tensor_operation/gpu/device/device_im2col.hpp`](../../include/ck/tensor_operation/gpu/device/device_im2col.hpp): The high-level device interface for Im2Col operations.
|
||
- [`../../include/ck/tensor_operation/gpu/device/device_col2im.hpp`](../../include/ck/tensor_operation/gpu/device/device_col2im.hpp): The high-level device interface for Col2Im operations.
|
||
- The underlying kernels implement the complex address calculations and memory access patterns required for these transformations.
|
||
|
||
## Build and Run
|
||
|
||
### Prerequisites
|
||
Ensure the Composable Kernel library is built and installed.
|
||
```bash
|
||
cd /path/to/composable_kernel/build
|
||
make -j install
|
||
```
|
||
|
||
### Build the Example
|
||
```bash
|
||
cd /path/to/composable_kernel/example/52_im2col_col2im
|
||
mkdir build && cd build
|
||
|
||
cmake \
|
||
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
||
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
|
||
..
|
||
|
||
make -j
|
||
```
|
||
|
||
### Run the Example
|
||
```bash
|
||
# Run the example with default settings
|
||
./im2col_col2im_xdl
|
||
|
||
# Run with verification, data initialization, and timing
|
||
./im2col_col2im_xdl 1 2 1
|
||
```
|
||
|
||
## Applications in Deep Learning
|
||
|
||
Im2Col and Col2Im are fundamental operations in convolution implementations:
|
||
|
||
### Im2Col Applications
|
||
- **Convolution via GEMM**: Transform convolution into matrix multiplication, allowing use of highly optimized BLAS libraries
|
||
- **Explicit Convolution**: Some frameworks prefer explicit Im2Col for better control over memory layouts
|
||
- **Winograd Convolution**: Used in Winograd-based fast convolution algorithms
|
||
- **Debugging and Visualization**: Understanding the convolution process by examining the column representation
|
||
|
||
### Col2Im Applications
|
||
- **Transpose Convolution**: The backward pass of convolution (gradient w.r.t. input) uses Col2Im
|
||
- **Deconvolution**: Upsampling operations that are the inverse of convolution
|
||
- **Gradient Computation**: Computing gradients for convolution operations
|
||
- **Memory Layout Restoration**: Converting back from optimized layouts to standard image formats
|
||
|
||
## Performance Characteristics
|
||
|
||
- **Memory Bound**: Both operations are typically memory-bound rather than compute-bound
|
||
- **Access Patterns**: Performance heavily depends on memory access patterns and coalescing
|
||
- **Memory Overhead**: Im2Col can significantly increase memory usage due to data duplication
|
||
- **Cache Behavior**: Complex strided access patterns can lead to poor cache utilization
|