Files
composable_kernel/example/49_maxpool2d_bwd/README.md
2025-10-16 10:13:27 +00:00

96 lines
4.9 KiB
Markdown

# 2D Max Pooling Backward
This example demonstrates the **backward pass of 2D max pooling**. This operation computes the gradient of the loss with respect to the input of a max pooling layer, which is essential for training convolutional neural networks that use max pooling for downsampling.
## Mathematical Formulation
The backward pass of max pooling propagates gradients only to the input positions that contributed to the maximum value in each pooling window.
Given:
- Input tensor `X` with shape `[N, C, H_in, W_in]`
- Output gradients `dL/dY` with shape `[N, C, H_out, W_out]`
- Pooling parameters: window size `(pool_h, pool_w)`, stride `(stride_h, stride_w)`, padding `(pad_h, pad_w)`
The backward pass computes input gradients `dL/dX` with the same shape as `X`.
For each pooling window, the gradient flows only to the position that had the maximum value:
$\frac{\partial L}{\partial X_{nchw}} = \sum_{\text{windows containing } (h,w)} \frac{\partial L}{\partial Y_{nch'w'}} \cdot \mathbf{1}[\text{argmax}_{(h'',w'')} X_{nch''w''} = (h,w)]$
Where the indicator function $\mathbf{1}[\cdot]$ is 1 if the position `(h,w)` was the argmax in its corresponding pooling window, and 0 otherwise.
## Algorithmic Strategy: Parallel Gradient Routing
The backward pass requires determining which input positions were selected during the forward pass and routing gradients accordingly.
1. **Grid Scheduling**: The computation can be parallelized over either the input or output tensor elements, depending on the implementation strategy.
2. **Argmax Information**: There are two main approaches to handle the argmax information:
- **Recomputation**: Recompute the argmax during the backward pass by examining each pooling window.
- **Stored Indices**: Use precomputed argmax indices from the forward pass (more memory efficient for multiple backward passes).
3. **Gradient Routing Algorithm** (using recomputation approach):
- **Initialize**: Set all input gradients to zero.
- **For each output position**: Each thread processes one output gradient position `(n, c, h_out, w_out)`.
- **Find Input Window**: Calculate the corresponding input window based on stride and padding.
- **Recompute Argmax**: Find the position with the maximum value in the input window.
- **Route Gradient**: Add the output gradient to the input position that had the maximum value (using atomic operations if necessary).
4. **Memory Access Optimization**: The kernel optimizes for:
- Coalesced access to gradient tensors
- Efficient atomic operations for gradient accumulation
- Minimal redundant computation of argmax positions
## Source Code Organization
- [`maxpool2d_bwd_xdl.cpp`](./maxpool2d_bwd_xdl.cpp): The main example file. It sets up the input tensor, output gradients, pooling parameters, and instantiates the `DeviceMaxpool2dBwd` operation.
- [`../../include/ck/tensor_operation/gpu/device/device_maxpool2d_bwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_maxpool2d_bwd.hpp): The high-level device interface for 2D max pooling backward operations.
- [`../../include/ck/tensor_operation/gpu/grid/gridwise_maxpool2d_bwd.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_maxpool2d_bwd.hpp): The grid-wise kernel implementing the gradient routing algorithm.
## Build and Run
### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```
### Build the Example
```bash
cd /path/to/composable_kernel/example/49_maxpool2d_bwd
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
```
### Run the Example
```bash
# Run the example with default settings
./maxpool2d_bwd_xdl
# Run with verification, data initialization, and timing
./maxpool2d_bwd_xdl 1 2 1
```
## Computational Characteristics
Max pooling backward has unique characteristics compared to other CNN operations:
- **Sparse Gradient Flow**: Unlike convolution or dense layers where gradients flow to all inputs, max pooling creates sparse gradient patterns where only selected input positions receive gradients.
- **Memory-bound Operation**: The operation is typically memory-bound rather than compute-bound, as it involves reading gradients and writing results with minimal arithmetic.
- **Atomic Operations**: When multiple output positions map to the same input position, atomic operations may be needed to correctly accumulate gradients.
## Relationship to Forward Pass
The backward pass must be consistent with the forward pass implementation:
- The same tie-breaking rules for equal maximum values
- Identical handling of padding and boundary conditions
- Consistent stride and window size interpretation
This ensures that the computed gradients correctly reflect the actual forward pass computation, which is essential for proper gradient-based optimization.