Files
composable_kernel/example/42_groupnorm_fwd/README.md
Vidyasagar Ananthan 92c67a824f [DOCS] Documentation Addition (Readme updates) (#2495)
* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-16 03:10:57 -07:00

5.4 KiB

Group Normalization Forward

This example demonstrates the forward pass of Group Normalization (GroupNorm). GroupNorm is a normalization technique that acts as a bridge between Layer Normalization and Instance Normalization. It divides channels into groups and computes the mean and variance for normalization within each group. This makes its performance stable across a wide range of batch sizes, unlike BatchNorm.

Mathematical Formulation

Given an input tensor X with shape [N, C, H, W] and a specified number of groups G: The C channels are divided into G groups, with each group containing C/G channels. The normalization is performed independently for each group within each batch item.

For each batch item n and each group g:

  1. Identify Channels: Identify the set of channels belonging to group g. Let this set be S_g. The size of this set is C' = C/G.

  2. Compute Mean: The mean is calculated across the channels in the group and the spatial dimensions (H, W). \mu_{ng} = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} X_{nchw}

  3. Compute Variance: The variance is also calculated across the same dimensions. \sigma_{ng}^2 = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} (X_{nchw} - \mu_{ng})^2

  4. Normalize: The input is normalized using the computed mean and variance for its corresponding group. For any channel c in group g: \hat{X}_{nchw} = \frac{X_{nchw} - \mu_{ng}}{\sqrt{\sigma_{ng}^2 + \epsilon}} Where epsilon is a small constant for numerical stability.

  5. Scale and Shift: The normalized output is scaled by a learnable parameter gamma and shifted by a learnable parameter beta. Unlike BatchNorm, gamma and beta are applied per-channel, not per-group. Y_{nchw} = \gamma_c \cdot \hat{X}_{nchw} + \beta_c Both gamma and beta are vectors of shape [C].

Algorithmic Strategy: Two-Pass Parallel Reduction per Group

The implementation of GroupNorm is a parallel reduction problem, similar to LayerNorm and BatchNorm, but with a different scope for the reduction.

  1. Grid Scheduling: The N * G independent normalization problems (one for each batch item and each group) are distributed among the GPU's thread blocks. Each block is assigned one or more (n, g) pairs to normalize.

  2. Pass 1: Compute Moments (Mean and Variance)

    • For an assigned (n, g) pair, the threads within a block cooperatively read the data for the channels in that group and the spatial dimensions.
    • Welford's Algorithm: To compute mean and variance in a single pass with good numerical stability, Welford's online algorithm is used.
    • Intra-Block Reduction: The threads perform a parallel reduction using shared memory to compute the final mean and variance for the (n, g) pair.
    • The final mean and variance for each (n, g) pair are written to temporary arrays in global memory.
  3. Pass 2: Normalize, Scale, and Shift

    • A second kernel (or a second stage in the same kernel after a grid-wide sync) is launched.
    • Threads read the input data X again.
    • For each element X_nchw, the thread identifies its group g, reads the corresponding mean mu_ng and variance sigma_ng, and applies the normalization formula.
    • It then reads the per-channel gamma_c and beta_c values and applies the scale and shift.
    • The final result Y is written to global memory.

Composable Kernel encapsulates this two-pass logic into a single, efficient DeviceGroupnormFwd operation.

Source Code Organization

  • groupnorm_fwd_xdl.cpp: The main example file. It sets up the input tensor, gamma and beta vectors, the number of groups, and instantiates the DeviceGroupnormFwd operation.
  • ../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp: The high-level device interface for the GroupNorm forward pass.
  • The implementation internally uses a reduction kernel based on Welford's algorithm to compute the statistics and an elementwise kernel to apply the normalization.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/42_groupnorm_fwd
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./groupnorm_fwd_xdl

# Run with verification, data initialization, and timing
./groupnorm_fwd_xdl 1 2 1

Comparison of Normalization Layers

  • BatchNorm: Normalizes over (N, H, W). Learns gamma and beta per channel C. Batch-size dependent.
  • LayerNorm: Normalizes over (C, H, W). Learns gamma and beta per channel C. Batch-size independent.
  • InstanceNorm: Normalizes over (H, W). Learns gamma and beta per channel C. A special case of GroupNorm where G=C.
  • GroupNorm: Normalizes over (C/G, H, W). Learns gamma and beta per channel C. Batch-size independent.

GroupNorm's flexibility has made it popular in GANs and in Transformer-based vision models where batch sizes can be small.