# 4D Layer Normalization Forward

This example demonstrates the forward pass of **4D Layer Normalization**. This extends the layer normalization operation to 4-dimensional tensors, which is commonly used in computer vision applications where tensors have shape `[N, C, H, W]` and normalization is applied across the channel and spatial dimensions.

## Mathematical Formulation

Given a 4D input tensor `X` with shape `[N, C, H, W]`, 4D layer normalization computes an output tensor `Y` of the same shape. The normalization is performed independently for each batch item across the channel and spatial dimensions.

For each batch item `n` from `0` to `N-1`:
1.  **Compute Mean**: The mean is calculated across the channel (`C`) and spatial (`H`, `W`) dimensions.
    $\mu_n = \frac{1}{C \cdot H \cdot W} \sum_{c=0}^{C-1} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} X_{nchw}$

2.  **Compute Variance**: The variance is calculated across the same dimensions.
    $\sigma_n^2 = \frac{1}{C \cdot H \cdot W} \sum_{c=0}^{C-1} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} (X_{nchw} - \mu_n)^2$

3.  **Normalize**: The input is normalized using the computed mean and variance.
    $\hat{X}_{nchw} = \frac{X_{nchw} - \mu_n}{\sqrt{\sigma_n^2 + \epsilon}}$
    Where `epsilon` is a small constant for numerical stability.

4.  **Scale and Shift**: The normalized output is scaled by learnable parameter `gamma` and shifted by learnable parameter `beta`.
    $Y_{nchw} = \gamma_{chw} \cdot \hat{X}_{nchw} + \beta_{chw}$
    
    Note: The scale and shift parameters can have different granularities:
    - **Per-element**: `gamma` and `beta` have shape `[C, H, W]`
    - **Per-channel**: `gamma` and `beta` have shape `[C]` (broadcast over H, W)
    - **Global**: `gamma` and `beta` are scalars (broadcast over C, H, W)

## Algorithmic Strategy: Batch-Parallel Reduction with Spatial Aggregation

The implementation treats this as a parallel reduction problem with spatial aggregation for each batch item.

1.  **Grid Scheduling**: The `N` batch items are distributed among the GPU's thread blocks. Each block is assigned one or more batch items to normalize.

2.  **Spatial-Channel Reduction**: For each assigned batch item:
    -   **Cooperative Loading**: Threads within a block cooperatively read the 3D slice `X[n, :, :, :]` corresponding to their batch item.
    -   **Welford's Algorithm**: Use Welford's online algorithm to compute mean and variance across all `C × H × W` elements with good numerical stability.
    -   **Intra-Block Reduction**: Threads perform parallel reduction using shared memory to compute the final statistics for each batch item.

3.  **Normalization and Scale/Shift**: 
    -   **Elementwise Processing**: Each thread processes one or more elements of the batch item.
    -   **Apply Normalization**: Use the computed mean and variance to normalize each element.
    -   **Apply Scale/Shift**: Apply the appropriate `gamma` and `beta` values based on the parameterization choice.
    -   **Store Result**: Write the final normalized result to the output tensor.

## Source Code Organization

-   [`layernorm4d_fwd_xdl.cpp`](./layernorm4d_fwd_xdl.cpp): The main example file. It sets up the 4D input tensor, `gamma` and `beta` parameters, and instantiates the `DeviceLayernorm4dFwd` operation.
-   [`../../include/ck/tensor_operation/gpu/device/device_layernorm4d_fwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_layernorm4d_fwd.hpp): The device interface for 4D layer normalization.
-   The underlying implementation uses reduction kernels optimized for the 4D tensor structure with efficient spatial dimension handling.

## Build and Run

### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```

### Build the Example
```bash
cd /path/to/composable_kernel/example/63_layernorm4d_fwd
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j
```

### Run the Example
```bash
# Run the example with default settings
./layernorm4d_fwd_xdl

# Run with verification, data initialization, and timing
./layernorm4d_fwd_xdl 1 2 1
```

## Applications in Computer Vision

4D layer normalization has specific applications in computer vision tasks:

-   **Vision Transformers**: Some vision transformer variants apply layer normalization to 4D feature maps instead of flattening them.
-   **Style Transfer**: Normalizing feature maps across spatial and channel dimensions for style transfer applications.
-   **Feature Normalization**: Normalizing intermediate feature maps in CNNs for improved training stability.
-   **Attention Mechanisms**: Some spatial attention mechanisms benefit from normalized 4D feature representations.
-   **Multi-Scale Processing**: When processing features at different spatial scales, 4D layer normalization can provide consistent normalization.

## Comparison with Other Normalizations for 4D Tensors

| Normalization | Reduction Dimensions | Parameter Shape | Batch Dependence |
|---------------|---------------------|-----------------|------------------|
| **BatchNorm** | `[N, H, W]` per channel | `[C]` | Yes |
| **LayerNorm (2D)** | `[C, H, W]` per sample | `[C, H, W]` or `[C]` | No |
| **LayerNorm (4D)** | `[C, H, W]` per sample | `[C, H, W]` or variants | No |
| **InstanceNorm** | `[H, W]` per channel per sample | `[C]` | No |
| **GroupNorm** | Groups of channels per sample | `[C]` | No |

4D layer normalization provides batch-independent normalization while maintaining the spatial structure of the data, making it valuable for applications where spatial relationships are important.