mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-03 05:01:25 +00:00
* GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>
93 lines
5.4 KiB
Markdown
93 lines
5.4 KiB
Markdown
# Group Normalization Forward
|
|
|
|
This example demonstrates the forward pass of **Group Normalization (GroupNorm)**. GroupNorm is a normalization technique that acts as a bridge between Layer Normalization and Instance Normalization. It divides channels into groups and computes the mean and variance for normalization within each group. This makes its performance stable across a wide range of batch sizes, unlike BatchNorm.
|
|
|
|
## Mathematical Formulation
|
|
|
|
Given an input tensor `X` with shape `[N, C, H, W]` and a specified number of groups `G`:
|
|
The `C` channels are divided into `G` groups, with each group containing `C/G` channels. The normalization is performed independently for each group within each batch item.
|
|
|
|
For each batch item `n` and each group `g`:
|
|
1. **Identify Channels**: Identify the set of channels belonging to group `g`. Let this set be $S_g$. The size of this set is $C' = C/G$.
|
|
|
|
2. **Compute Mean**: The mean is calculated across the channels in the group and the spatial dimensions (`H`, `W`).
|
|
$\mu_{ng} = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} X_{nchw}$
|
|
|
|
3. **Compute Variance**: The variance is also calculated across the same dimensions.
|
|
$\sigma_{ng}^2 = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} (X_{nchw} - \mu_{ng})^2$
|
|
|
|
4. **Normalize**: The input is normalized using the computed mean and variance for its corresponding group. For any channel `c` in group `g`:
|
|
$\hat{X}_{nchw} = \frac{X_{nchw} - \mu_{ng}}{\sqrt{\sigma_{ng}^2 + \epsilon}}$
|
|
Where `epsilon` is a small constant for numerical stability.
|
|
|
|
5. **Scale and Shift**: The normalized output is scaled by a learnable parameter `gamma` and shifted by a learnable parameter `beta`. Unlike BatchNorm, `gamma` and `beta` are applied per-channel, not per-group.
|
|
$Y_{nchw} = \gamma_c \cdot \hat{X}_{nchw} + \beta_c$
|
|
Both `gamma` and `beta` are vectors of shape `[C]`.
|
|
|
|
## Algorithmic Strategy: Two-Pass Parallel Reduction per Group
|
|
|
|
The implementation of GroupNorm is a parallel reduction problem, similar to LayerNorm and BatchNorm, but with a different scope for the reduction.
|
|
|
|
1. **Grid Scheduling**: The `N * G` independent normalization problems (one for each batch item and each group) are distributed among the GPU's thread blocks. Each block is assigned one or more `(n, g)` pairs to normalize.
|
|
|
|
2. **Pass 1: Compute Moments (Mean and Variance)**
|
|
- For an assigned `(n, g)` pair, the threads within a block cooperatively read the data for the channels in that group and the spatial dimensions.
|
|
- **Welford's Algorithm**: To compute mean and variance in a single pass with good numerical stability, Welford's online algorithm is used.
|
|
- **Intra-Block Reduction**: The threads perform a parallel reduction using shared memory to compute the final mean and variance for the `(n, g)` pair.
|
|
- The final mean and variance for each `(n, g)` pair are written to temporary arrays in global memory.
|
|
|
|
3. **Pass 2: Normalize, Scale, and Shift**
|
|
- A second kernel (or a second stage in the same kernel after a grid-wide sync) is launched.
|
|
- Threads read the input data `X` again.
|
|
- For each element `X_nchw`, the thread identifies its group `g`, reads the corresponding mean `mu_ng` and variance `sigma_ng`, and applies the normalization formula.
|
|
- It then reads the per-channel `gamma_c` and `beta_c` values and applies the scale and shift.
|
|
- The final result `Y` is written to global memory.
|
|
|
|
Composable Kernel encapsulates this two-pass logic into a single, efficient `DeviceGroupnormFwd` operation.
|
|
|
|
## Source Code Organization
|
|
|
|
- [`groupnorm_fwd_xdl.cpp`](./groupnorm_fwd_xdl.cpp): The main example file. It sets up the input tensor, `gamma` and `beta` vectors, the number of groups, and instantiates the `DeviceGroupnormFwd` operation.
|
|
- [`../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp): The high-level device interface for the GroupNorm forward pass.
|
|
- The implementation internally uses a reduction kernel based on Welford's algorithm to compute the statistics and an elementwise kernel to apply the normalization.
|
|
|
|
## Build and Run
|
|
|
|
### Prerequisites
|
|
Ensure the Composable Kernel library is built and installed.
|
|
```bash
|
|
cd /path/to/composable_kernel/build
|
|
make -j install
|
|
```
|
|
|
|
### Build the Example
|
|
```bash
|
|
cd /path/to/composable_kernel/example/42_groupnorm_fwd
|
|
mkdir build && cd build
|
|
|
|
cmake \
|
|
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
|
|
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
|
|
..
|
|
|
|
make -j
|
|
```
|
|
|
|
### Run the Example
|
|
```bash
|
|
# Run the example with default settings
|
|
./groupnorm_fwd_xdl
|
|
|
|
# Run with verification, data initialization, and timing
|
|
./groupnorm_fwd_xdl 1 2 1
|
|
```
|
|
|
|
## Comparison of Normalization Layers
|
|
|
|
- **BatchNorm**: Normalizes over `(N, H, W)`. Learns `gamma` and `beta` per channel `C`. Batch-size dependent.
|
|
- **LayerNorm**: Normalizes over `(C, H, W)`. Learns `gamma` and `beta` per channel `C`. Batch-size independent.
|
|
- **InstanceNorm**: Normalizes over `(H, W)`. Learns `gamma` and `beta` per channel `C`. A special case of GroupNorm where `G=C`.
|
|
- **GroupNorm**: Normalizes over `(C/G, H, W)`. Learns `gamma` and `beta` per channel `C`. Batch-size independent.
|
|
|
|
GroupNorm's flexibility has made it popular in GANs and in Transformer-based vision models where batch sizes can be small.
|