# 3D Average Pooling Backward

This example demonstrates the **backward pass of 3D average pooling**. This operation computes the gradient of the loss with respect to the input of a 3D average pooling layer, which is essential for training 3D convolutional neural networks used in video analysis, medical imaging, and volumetric data processing.

## Mathematical Formulation

The backward pass of 3D average pooling distributes the output gradients uniformly across all input positions that contributed to each pooling window.

Given:
- Input tensor `X` with shape `[N, C, D_in, H_in, W_in]`
- Output gradients `dL/dY` with shape `[N, C, D_out, H_out, W_out]`
- Pooling parameters: window size `(pool_d, pool_h, pool_w)`, stride `(stride_d, stride_h, stride_w)`, padding `(pad_d, pad_h, pad_w)`

The backward pass computes input gradients `dL/dX` with the same shape as `X`.

For 3D average pooling, the gradient is distributed uniformly across all positions in each pooling window:
$\frac{\partial L}{\partial X_{ncdhw}} = \sum_{\text{windows containing } (d,h,w)} \frac{1}{|W|} \cdot \frac{\partial L}{\partial Y_{ncd'h'w'}}$

Where `|W|` is the effective window size (accounting for padding and boundaries), and the sum is over all output positions whose pooling windows include the input position `(d,h,w)`.

## Algorithmic Strategy: Parallel Gradient Distribution

The backward pass distributes gradients from output positions to all input positions that contributed to each pooling window.

1.  **Grid Scheduling**: The computation can be parallelized over either input or output tensor elements, depending on the implementation strategy.

2.  **Gradient Distribution Algorithm** (output-centric approach):
    -   **Initialize**: Set all input gradients to zero.
    -   **For each output position**: Each thread processes one output gradient position `(n, c, d_out, h_out, w_out)`.
    -   **Calculate Input Window**: Determine the 3D input window that contributed to this output position.
    -   **Effective Window Size**: Calculate the actual number of input elements in the window (accounting for padding and boundaries).
    -   **Distribute Gradient**: Add `grad_output / window_size` to each input position in the window (using atomic operations for thread safety).

3.  **Boundary Handling**: Careful handling of:
    -   **Padding**: Input positions outside the valid range should not receive gradients
    -   **Partial Windows**: Windows at boundaries may have fewer than `pool_d × pool_h × pool_w` elements
    -   **Edge Cases**: Zero-sized windows or invalid configurations

4.  **Memory Access Optimization**:
    -   Coalesced reading from output gradients
    -   Efficient atomic operations for gradient accumulation
    -   Minimized redundant boundary checks

## Source Code Organization

-   [`avgpool3d_bwd_xdl.cpp`](./avgpool3d_bwd_xdl.cpp): The main example file. It sets up the input tensor, output gradients, pooling parameters, and instantiates the `DeviceAvgpool3dBwd` operation.
-   [`../../include/ck/tensor_operation/gpu/device/device_avgpool3d_bwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_avgpool3d_bwd.hpp): The high-level device interface for 3D average pooling backward operations.
-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_avgpool3d_bwd.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_avgpool3d_bwd.hpp): The grid-wise kernel implementing the gradient distribution algorithm.

## Build and Run

### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```

### Build the Example
```bash
cd /path/to/composable_kernel/example/51_avgpool3d_bwd
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j
```

### Run the Example
```bash
# Run the example with default settings
./avgpool3d_bwd_xdl

# Run with verification, data initialization, and timing
./avgpool3d_bwd_xdl 1 2 1
```

## Comparison with Max Pooling Backward

3D average pooling backward differs significantly from max pooling backward:

| Aspect | Max Pooling | Average Pooling |
|--------|-------------|-----------------|
| **Gradient Flow** | Sparse (only to argmax positions) | Dense (to all window positions) |
| **Distribution** | Single position per window | Uniform across window |
| **Computation** | Requires argmax information | Simple arithmetic division |
| **Memory Pattern** | Irregular write pattern | Regular, predictable pattern |
| **Atomic Operations** | Needed for gradient routing | Needed for accumulation |

## Applications in 3D Deep Learning

3D average pooling backward is essential for training models that process volumetric data:

-   **Video Understanding**: 3D CNNs for action recognition, video classification, and temporal modeling
-   **Medical Imaging**: 3D segmentation and classification of CT scans, MRI, and other volumetric medical data
-   **3D Object Recognition**: Processing 3D point clouds, voxel grids, and depth data
-   **Scientific Computing**: Climate modeling, fluid dynamics, and other physics simulations
-   **Augmented Reality**: 3D scene understanding and object tracking in real-time applications