* GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>
Group Normalization Forward
This example demonstrates the forward pass of Group Normalization (GroupNorm). GroupNorm is a normalization technique that acts as a bridge between Layer Normalization and Instance Normalization. It divides channels into groups and computes the mean and variance for normalization within each group. This makes its performance stable across a wide range of batch sizes, unlike BatchNorm.
Mathematical Formulation
Given an input tensor X with shape [N, C, H, W] and a specified number of groups G:
The C channels are divided into G groups, with each group containing C/G channels. The normalization is performed independently for each group within each batch item.
For each batch item n and each group g:
-
Identify Channels: Identify the set of channels belonging to group
g. Let this set beS_g. The size of this set isC' = C/G. -
Compute Mean: The mean is calculated across the channels in the group and the spatial dimensions (
H,W).\mu_{ng} = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} X_{nchw} -
Compute Variance: The variance is also calculated across the same dimensions.
\sigma_{ng}^2 = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} (X_{nchw} - \mu_{ng})^2 -
Normalize: The input is normalized using the computed mean and variance for its corresponding group. For any channel
cin groupg:\hat{X}_{nchw} = \frac{X_{nchw} - \mu_{ng}}{\sqrt{\sigma_{ng}^2 + \epsilon}}Whereepsilonis a small constant for numerical stability. -
Scale and Shift: The normalized output is scaled by a learnable parameter
gammaand shifted by a learnable parameterbeta. Unlike BatchNorm,gammaandbetaare applied per-channel, not per-group.Y_{nchw} = \gamma_c \cdot \hat{X}_{nchw} + \beta_cBothgammaandbetaare vectors of shape[C].
Algorithmic Strategy: Two-Pass Parallel Reduction per Group
The implementation of GroupNorm is a parallel reduction problem, similar to LayerNorm and BatchNorm, but with a different scope for the reduction.
-
Grid Scheduling: The
N * Gindependent normalization problems (one for each batch item and each group) are distributed among the GPU's thread blocks. Each block is assigned one or more(n, g)pairs to normalize. -
Pass 1: Compute Moments (Mean and Variance)
- For an assigned
(n, g)pair, the threads within a block cooperatively read the data for the channels in that group and the spatial dimensions. - Welford's Algorithm: To compute mean and variance in a single pass with good numerical stability, Welford's online algorithm is used.
- Intra-Block Reduction: The threads perform a parallel reduction using shared memory to compute the final mean and variance for the
(n, g)pair. - The final mean and variance for each
(n, g)pair are written to temporary arrays in global memory.
- For an assigned
-
Pass 2: Normalize, Scale, and Shift
- A second kernel (or a second stage in the same kernel after a grid-wide sync) is launched.
- Threads read the input data
Xagain. - For each element
X_nchw, the thread identifies its groupg, reads the corresponding meanmu_ngand variancesigma_ng, and applies the normalization formula. - It then reads the per-channel
gamma_candbeta_cvalues and applies the scale and shift. - The final result
Yis written to global memory.
Composable Kernel encapsulates this two-pass logic into a single, efficient DeviceGroupnormFwd operation.
Source Code Organization
groupnorm_fwd_xdl.cpp: The main example file. It sets up the input tensor,gammaandbetavectors, the number of groups, and instantiates theDeviceGroupnormFwdoperation.../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp: The high-level device interface for the GroupNorm forward pass.- The implementation internally uses a reduction kernel based on Welford's algorithm to compute the statistics and an elementwise kernel to apply the normalization.
Build and Run
Prerequisites
Ensure the Composable Kernel library is built and installed.
cd /path/to/composable_kernel/build
make -j install
Build the Example
cd /path/to/composable_kernel/example/42_groupnorm_fwd
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the example with default settings
./groupnorm_fwd_xdl
# Run with verification, data initialization, and timing
./groupnorm_fwd_xdl 1 2 1
Comparison of Normalization Layers
- BatchNorm: Normalizes over
(N, H, W). Learnsgammaandbetaper channelC. Batch-size dependent. - LayerNorm: Normalizes over
(C, H, W). Learnsgammaandbetaper channelC. Batch-size independent. - InstanceNorm: Normalizes over
(H, W). Learnsgammaandbetaper channelC. A special case of GroupNorm whereG=C. - GroupNorm: Normalizes over
(C/G, H, W). Learnsgammaandbetaper channelC. Batch-size independent.
GroupNorm's flexibility has made it popular in GANs and in Transformer-based vision models where batch sizes can be small.