composable_kernel/example/49_maxpool2d_bwd/README.md

# 2D Max Pooling Backward

This example demonstrates the **backward pass of 2D max pooling**. This operation computes the gradient of the loss with respect to the input of a max pooling layer, which is essential for training convolutional neural networks that use max pooling for downsampling.

## Mathematical Formulation

The backward pass of max pooling propagates gradients only to the input positions that contributed to the maximum value in each pooling window.

Given:
- Input tensor `X` with shape `[N, C, H_in, W_in]`
- Output gradients `dL/dY` with shape `[N, C, H_out, W_out]`
- Pooling parameters: window size `(pool_h, pool_w)`, stride `(stride_h, stride_w)`, padding `(pad_h, pad_w)`

The backward pass computes input gradients `dL/dX` with the same shape as `X`.

For each pooling window, the gradient flows only to the position that had the maximum value:
$\frac{\partial L}{\partial X_{nchw}} = \sum_{\text{windows containing } (h,w)} \frac{\partial L}{\partial Y_{nch'w'}} \cdot \mathbf{1}[\text{argmax}_{(h'',w'')} X_{nch''w''} = (h,w)]$

Where the indicator function $\mathbf{1}[\cdot]$ is 1 if the position `(h,w)` was the argmax in its corresponding pooling window, and 0 otherwise.

## Algorithmic Strategy: Parallel Gradient Routing

The backward pass requires determining which input positions were selected during the forward pass and routing gradients accordingly.

1.  **Grid Scheduling**: The computation can be parallelized over either the input or output tensor elements, depending on the implementation strategy.

2.  **Argmax Information**: There are two main approaches to handle the argmax information:
    -   **Recomputation**: Recompute the argmax during the backward pass by examining each pooling window.
    -   **Stored Indices**: Use precomputed argmax indices from the forward pass (more memory efficient for multiple backward passes).

3.  **Gradient Routing Algorithm** (using recomputation approach):
    -   **Initialize**: Set all input gradients to zero.
    -   **For each output position**: Each thread processes one output gradient position `(n, c, h_out, w_out)`.
    -   **Find Input Window**: Calculate the corresponding input window based on stride and padding.
    -   **Recompute Argmax**: Find the position with the maximum value in the input window.
    -   **Route Gradient**: Add the output gradient to the input position that had the maximum value (using atomic operations if necessary).

4.  **Memory Access Optimization**: The kernel optimizes for:
    -   Coalesced access to gradient tensors
    -   Efficient atomic operations for gradient accumulation
    -   Minimal redundant computation of argmax positions

## Source Code Organization

-   [`maxpool2d_bwd_xdl.cpp`](./maxpool2d_bwd_xdl.cpp): The main example file. It sets up the input tensor, output gradients, pooling parameters, and instantiates the `DeviceMaxpool2dBwd` operation.
-   [`../../include/ck/tensor_operation/gpu/device/device_maxpool2d_bwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_maxpool2d_bwd.hpp): The high-level device interface for 2D max pooling backward operations.
-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_maxpool2d_bwd.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_maxpool2d_bwd.hpp): The grid-wise kernel implementing the gradient routing algorithm.

## Build and Run

### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```

### Build the Example
```bash
cd /path/to/composable_kernel/example/49_maxpool2d_bwd
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j
```

### Run the Example
```bash
# Run the example with default settings
./maxpool2d_bwd_xdl

# Run with verification, data initialization, and timing
./maxpool2d_bwd_xdl 1 2 1
```

## Computational Characteristics

Max pooling backward has unique characteristics compared to other CNN operations:

-   **Sparse Gradient Flow**: Unlike convolution or dense layers where gradients flow to all inputs, max pooling creates sparse gradient patterns where only selected input positions receive gradients.
-   **Memory-bound Operation**: The operation is typically memory-bound rather than compute-bound, as it involves reading gradients and writing results with minimal arithmetic.
-   **Atomic Operations**: When multiple output positions map to the same input position, atomic operations may be needed to correctly accumulate gradients.

## Relationship to Forward Pass

The backward pass must be consistent with the forward pass implementation:
- The same tie-breaking rules for equal maximum values
- Identical handling of padding and boundary conditions
- Consistent stride and window size interpretation

This ensures that the computed gradients correctly reflect the actual forward pass computation, which is essential for proper gradient-based optimization.