ROCm/composable_kernel

Fork 0

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-04-19 22:39:03 +00:00

Files

History

Vidyasagar Ananthan 92c67a824f [DOCS] Documentation Addition (Readme updates) (#2495 )

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

2025-10-16 03:10:57 -07:00

CMakeLists.txt

Layernorm4d (#1022 )

2023-11-09 08:34:51 +08:00

common.hpp

Layernorm4d (#1022 )

2023-11-09 08:34:51 +08:00

layernorm4d_fwd_fp16.cpp

Layernorm4d (#1022 )

2023-11-09 08:34:51 +08:00

layernorm4d_fwd_splitk_fp16.cpp

Layernorm4d (#1022 )

2023-11-09 08:34:51 +08:00

README.md

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

run_layernorm4d_fwd_example.inc

upgrade from clang-format-12 to clang-format-18 (#2568 )

2025-07-28 11:34:07 -07:00

README.md

4D Layer Normalization Forward

This example demonstrates the forward pass of 4D Layer Normalization. This extends the layer normalization operation to 4-dimensional tensors, which is commonly used in computer vision applications where tensors have shape [N, C, H, W] and normalization is applied across the channel and spatial dimensions.

Mathematical Formulation

Given a 4D input tensor X with shape [N, C, H, W], 4D layer normalization computes an output tensor Y of the same shape. The normalization is performed independently for each batch item across the channel and spatial dimensions.

For each batch item n from 0 to N-1:

Compute Mean: The mean is calculated across the channel (C) and spatial (H, W) dimensions. \mu_n = \frac{1}{C \cdot H \cdot W} \sum_{c=0}^{C-1} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} X_{nchw}
Compute Variance: The variance is calculated across the same dimensions. \sigma_n^2 = \frac{1}{C \cdot H \cdot W} \sum_{c=0}^{C-1} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} (X_{nchw} - \mu_n)^2
Normalize: The input is normalized using the computed mean and variance. \hat{X}_{nchw} = \frac{X_{nchw} - \mu_n}{\sqrt{\sigma_n^2 + \epsilon}} Where epsilon is a small constant for numerical stability.
Scale and Shift: The normalized output is scaled by learnable parameter gamma and shifted by learnable parameter beta. Y_{nchw} = \gamma_{chw} \cdot \hat{X}_{nchw} + \beta_{chw}

Note: The scale and shift parameters can have different granularities:
- Per-element: gamma and beta have shape [C, H, W]
- Per-channel: gamma and beta have shape [C] (broadcast over H, W)
- Global: gamma and beta are scalars (broadcast over C, H, W)

Algorithmic Strategy: Batch-Parallel Reduction with Spatial Aggregation

The implementation treats this as a parallel reduction problem with spatial aggregation for each batch item.

Grid Scheduling: The N batch items are distributed among the GPU's thread blocks. Each block is assigned one or more batch items to normalize.
Spatial-Channel Reduction: For each assigned batch item:
- Cooperative Loading: Threads within a block cooperatively read the 3D slice X[n, :, :, :] corresponding to their batch item.
- Welford's Algorithm: Use Welford's online algorithm to compute mean and variance across all C × H × W elements with good numerical stability.
- Intra-Block Reduction: Threads perform parallel reduction using shared memory to compute the final statistics for each batch item.
Normalization and Scale/Shift:
- Elementwise Processing: Each thread processes one or more elements of the batch item.
- Apply Normalization: Use the computed mean and variance to normalize each element.
- Apply Scale/Shift: Apply the appropriate gamma and beta values based on the parameterization choice.
- Store Result: Write the final normalized result to the output tensor.

Source Code Organization

layernorm4d_fwd_xdl.cpp: The main example file. It sets up the 4D input tensor, gamma and beta parameters, and instantiates the DeviceLayernorm4dFwd operation.
../../include/ck/tensor_operation/gpu/device/device_layernorm4d_fwd.hpp: The device interface for 4D layer normalization.
The underlying implementation uses reduction kernels optimized for the 4D tensor structure with efficient spatial dimension handling.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/63_layernorm4d_fwd
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./layernorm4d_fwd_xdl

# Run with verification, data initialization, and timing
./layernorm4d_fwd_xdl 1 2 1

Applications in Computer Vision

4D layer normalization has specific applications in computer vision tasks:

Vision Transformers: Some vision transformer variants apply layer normalization to 4D feature maps instead of flattening them.
Style Transfer: Normalizing feature maps across spatial and channel dimensions for style transfer applications.
Feature Normalization: Normalizing intermediate feature maps in CNNs for improved training stability.
Attention Mechanisms: Some spatial attention mechanisms benefit from normalized 4D feature representations.
Multi-Scale Processing: When processing features at different spatial scales, 4D layer normalization can provide consistent normalization.

Comparison with Other Normalizations for 4D Tensors

Normalization	Reduction Dimensions	Parameter Shape	Batch Dependence
BatchNorm	`[N, H, W]` per channel	`[C]`	Yes
LayerNorm (2D)	`[C, H, W]` per sample	`[C, H, W]` or `[C]`	No
LayerNorm (4D)	`[C, H, W]` per sample	`[C, H, W]` or variants	No
InstanceNorm	`[H, W]` per channel per sample	`[C]`	No
GroupNorm	Groups of channels per sample	`[C]`	No

4D layer normalization provides batch-independent normalization while maintaining the spatial structure of the data, making it valuable for applications where spatial relationships are important.

README.md Unescape Escape

4D Layer Normalization Forward

Mathematical Formulation

Algorithmic Strategy: Batch-Parallel Reduction with Spatial Aggregation

Source Code Organization

Build and Run

Prerequisites

Build the Example

Run the Example

Applications in Computer Vision

Comparison with Other Normalizations for 4D Tensors

README.md