mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-17 19:40:04 +00:00
96 lines
4.9 KiB
Markdown
96 lines
4.9 KiB
Markdown
# 2D Max Pooling Backward
|
|
|
|
This example demonstrates the **backward pass of 2D max pooling**. This operation computes the gradient of the loss with respect to the input of a max pooling layer, which is essential for training convolutional neural networks that use max pooling for downsampling.
|
|
|
|
## Mathematical Formulation
|
|
|
|
The backward pass of max pooling propagates gradients only to the input positions that contributed to the maximum value in each pooling window.
|
|
|
|
Given:
|
|
- Input tensor `X` with shape `[N, C, H_in, W_in]`
|
|
- Output gradients `dL/dY` with shape `[N, C, H_out, W_out]`
|
|
- Pooling parameters: window size `(pool_h, pool_w)`, stride `(stride_h, stride_w)`, padding `(pad_h, pad_w)`
|
|
|
|
The backward pass computes input gradients `dL/dX` with the same shape as `X`.
|
|
|
|
For each pooling window, the gradient flows only to the position that had the maximum value:
|
|
$\frac{\partial L}{\partial X_{nchw}} = \sum_{\text{windows containing } (h,w)} \frac{\partial L}{\partial Y_{nch'w'}} \cdot \mathbf{1}[\text{argmax}_{(h'',w'')} X_{nch''w''} = (h,w)]$
|
|
|
|
Where the indicator function $\mathbf{1}[\cdot]$ is 1 if the position `(h,w)` was the argmax in its corresponding pooling window, and 0 otherwise.
|
|
|
|
## Algorithmic Strategy: Parallel Gradient Routing
|
|
|
|
The backward pass requires determining which input positions were selected during the forward pass and routing gradients accordingly.
|
|
|
|
1. **Grid Scheduling**: The computation can be parallelized over either the input or output tensor elements, depending on the implementation strategy.
|
|
|
|
2. **Argmax Information**: There are two main approaches to handle the argmax information:
|
|
- **Recomputation**: Recompute the argmax during the backward pass by examining each pooling window.
|
|
- **Stored Indices**: Use precomputed argmax indices from the forward pass (more memory efficient for multiple backward passes).
|
|
|
|
3. **Gradient Routing Algorithm** (using recomputation approach):
|
|
- **Initialize**: Set all input gradients to zero.
|
|
- **For each output position**: Each thread processes one output gradient position `(n, c, h_out, w_out)`.
|
|
- **Find Input Window**: Calculate the corresponding input window based on stride and padding.
|
|
- **Recompute Argmax**: Find the position with the maximum value in the input window.
|
|
- **Route Gradient**: Add the output gradient to the input position that had the maximum value (using atomic operations if necessary).
|
|
|
|
4. **Memory Access Optimization**: The kernel optimizes for:
|
|
- Coalesced access to gradient tensors
|
|
- Efficient atomic operations for gradient accumulation
|
|
- Minimal redundant computation of argmax positions
|
|
|
|
## Source Code Organization
|
|
|
|
- [`maxpool2d_bwd_xdl.cpp`](./maxpool2d_bwd_xdl.cpp): The main example file. It sets up the input tensor, output gradients, pooling parameters, and instantiates the `DeviceMaxpool2dBwd` operation.
|
|
- [`../../include/ck/tensor_operation/gpu/device/device_maxpool2d_bwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_maxpool2d_bwd.hpp): The high-level device interface for 2D max pooling backward operations.
|
|
- [`../../include/ck/tensor_operation/gpu/grid/gridwise_maxpool2d_bwd.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_maxpool2d_bwd.hpp): The grid-wise kernel implementing the gradient routing algorithm.
|
|
|
|
## Build and Run
|
|
|
|
### Prerequisites
|
|
Ensure the Composable Kernel library is built and installed.
|
|
```bash
|
|
cd /path/to/composable_kernel/build
|
|
make -j install
|
|
```
|
|
|
|
### Build the Example
|
|
```bash
|
|
cd /path/to/composable_kernel/example/49_maxpool2d_bwd
|
|
mkdir build && cd build
|
|
|
|
cmake \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
|
|
..
|
|
|
|
make -j
|
|
```
|
|
|
|
### Run the Example
|
|
```bash
|
|
# Run the example with default settings
|
|
./maxpool2d_bwd_xdl
|
|
|
|
# Run with verification, data initialization, and timing
|
|
./maxpool2d_bwd_xdl 1 2 1
|
|
```
|
|
|
|
## Computational Characteristics
|
|
|
|
Max pooling backward has unique characteristics compared to other CNN operations:
|
|
|
|
- **Sparse Gradient Flow**: Unlike convolution or dense layers where gradients flow to all inputs, max pooling creates sparse gradient patterns where only selected input positions receive gradients.
|
|
- **Memory-bound Operation**: The operation is typically memory-bound rather than compute-bound, as it involves reading gradients and writing results with minimal arithmetic.
|
|
- **Atomic Operations**: When multiple output positions map to the same input position, atomic operations may be needed to correctly accumulate gradients.
|
|
|
|
## Relationship to Forward Pass
|
|
|
|
The backward pass must be consistent with the forward pass implementation:
|
|
- The same tie-breaking rules for equal maximum values
|
|
- Identical handling of padding and boundary conditions
|
|
- Consistent stride and window size interpretation
|
|
|
|
This ensures that the computed gradients correctly reflect the actual forward pass computation, which is essential for proper gradient-based optimization.
|