* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
Group Normalization Forward
This example demonstrates the forward pass of Group Normalization (GroupNorm). GroupNorm is a normalization technique that acts as a bridge between Layer Normalization and Instance Normalization. It divides channels into groups and computes the mean and variance for normalization within each group. This makes its performance stable across a wide range of batch sizes, unlike BatchNorm.
Mathematical Formulation
Given an input tensor X with shape [N, C, H, W] and a specified number of groups G:
The C channels are divided into G groups, with each group containing C/G channels. The normalization is performed independently for each group within each batch item.
For each batch item n and each group g:
-
Identify Channels: Identify the set of channels belonging to group
g. Let this set beS_g. The size of this set isC' = C/G. -
Compute Mean: The mean is calculated across the channels in the group and the spatial dimensions (
H,W).\mu_{ng} = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} X_{nchw} -
Compute Variance: The variance is also calculated across the same dimensions.
\sigma_{ng}^2 = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} (X_{nchw} - \mu_{ng})^2 -
Normalize: The input is normalized using the computed mean and variance for its corresponding group. For any channel
cin groupg:\hat{X}_{nchw} = \frac{X_{nchw} - \mu_{ng}}{\sqrt{\sigma_{ng}^2 + \epsilon}}Whereepsilonis a small constant for numerical stability. -
Scale and Shift: The normalized output is scaled by a learnable parameter
gammaand shifted by a learnable parameterbeta. Unlike BatchNorm,gammaandbetaare applied per-channel, not per-group.Y_{nchw} = \gamma_c \cdot \hat{X}_{nchw} + \beta_cBothgammaandbetaare vectors of shape[C].
Algorithmic Strategy: Two-Pass Parallel Reduction per Group
The implementation of GroupNorm is a parallel reduction problem, similar to LayerNorm and BatchNorm, but with a different scope for the reduction.
-
Grid Scheduling: The
N * Gindependent normalization problems (one for each batch item and each group) are distributed among the GPU's thread blocks. Each block is assigned one or more(n, g)pairs to normalize. -
Pass 1: Compute Moments (Mean and Variance)
- For an assigned
(n, g)pair, the threads within a block cooperatively read the data for the channels in that group and the spatial dimensions. - Welford's Algorithm: To compute mean and variance in a single pass with good numerical stability, Welford's online algorithm is used.
- Intra-Block Reduction: The threads perform a parallel reduction using shared memory to compute the final mean and variance for the
(n, g)pair. - The final mean and variance for each
(n, g)pair are written to temporary arrays in global memory.
- For an assigned
-
Pass 2: Normalize, Scale, and Shift
- A second kernel (or a second stage in the same kernel after a grid-wide sync) is launched.
- Threads read the input data
Xagain. - For each element
X_nchw, the thread identifies its groupg, reads the corresponding meanmu_ngand variancesigma_ng, and applies the normalization formula. - It then reads the per-channel
gamma_candbeta_cvalues and applies the scale and shift. - The final result
Yis written to global memory.
Composable Kernel encapsulates this two-pass logic into a single, efficient DeviceGroupnormFwd operation.
Source Code Organization
groupnorm_fwd_xdl.cpp: The main example file. It sets up the input tensor,gammaandbetavectors, the number of groups, and instantiates theDeviceGroupnormFwdoperation.../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp: The high-level device interface for the GroupNorm forward pass.- The implementation internally uses a reduction kernel based on Welford's algorithm to compute the statistics and an elementwise kernel to apply the normalization.
Build and Run
Prerequisites
Ensure the Composable Kernel library is built and installed.
cd /path/to/composable_kernel/build
make -j install
Build the Example
cd /path/to/composable_kernel/example/42_groupnorm_fwd
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the example with default settings
./groupnorm_fwd_xdl
# Run with verification, data initialization, and timing
./groupnorm_fwd_xdl 1 2 1
Comparison of Normalization Layers
- BatchNorm: Normalizes over
(N, H, W). Learnsgammaandbetaper channelC. Batch-size dependent. - LayerNorm: Normalizes over
(C, H, W). Learnsgammaandbetaper channelC. Batch-size independent. - InstanceNorm: Normalizes over
(H, W). Learnsgammaandbetaper channelC. A special case of GroupNorm whereG=C. - GroupNorm: Normalizes over
(C/G, H, W). Learnsgammaandbetaper channelC. Batch-size independent.
GroupNorm's flexibility has made it popular in GANs and in Transformer-based vision models where batch sizes can be small.