ROCm/composable_kernel

Fork 0

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-03-19 22:57:38 +00:00

Files

Vidyasagar Ananthan 92c67a824f [DOCS] Documentation Addition (Readme updates) (#2495 )

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

2025-10-16 03:10:57 -07:00

5.0 KiB

Raw Permalink Blame History

Elementwise Normalization

This example demonstrates a fused elementwise operation followed by normalization. This pattern combines elementwise tensor arithmetic with a normalization operation in a single kernel, which is particularly useful for implementing custom normalization layers or fused activation-normalization blocks.

Mathematical Formulation

The operation performs an elementwise computation followed by a normalization operation.

Elementwise Stage: An elementwise operation is applied to one or more input tensors. C_{temp} = f(A, B, \dots) Where f is a user-defined elementwise function that operates on corresponding elements of the input tensors.
Normalization Stage: The result is then normalized. The normalization can be performed along specified dimensions.
- Compute Statistics: For each normalization group, compute the mean and variance. \mu = \frac{1}{N} \sum C_{temp} \sigma^2 = \frac{1}{N} \sum (C_{temp} - \mu)^2
- Normalize: Apply the normalization formula. \hat{C} = \frac{C_{temp} - \mu}{\sqrt{\sigma^2 + \epsilon}}
- Scale and Shift: Apply learnable parameters. D = \gamma \cdot \hat{C} + \beta

The key optimization is that the intermediate tensor C_temp is never written to global memory. The elementwise computation feeds directly into the normalization calculation.

Algorithmic Strategy: Fused Elementwise with Online Normalization

The implementation combines elementwise computation with an online normalization algorithm.

Grid Scheduling: The normalization groups are distributed among thread blocks. Each block handles one or more normalization groups.
Fused Two-Pass Algorithm:
- Pass 1 - Compute Elementwise and Moments:
  - Threads cooperatively load input tensors and apply the elementwise function f.
  - The elementwise results are kept in registers/shared memory.
  - Welford's Algorithm: Threads use Welford's online algorithm to compute the mean and variance of the elementwise results within their normalization group.
  - Intra-Block Reduction: A parallel reduction in shared memory computes the final statistics for the group.
- Pass 2 - Normalize and Store:
  - Using the computed statistics, threads apply the normalization formula to their elementwise results.
  - The final normalized result is written to the output tensor D.

This approach ensures that the elementwise computation is performed only once, and the results are immediately consumed by the normalization process without requiring additional memory bandwidth.

Source Code Organization

elementwise_normalization_xdl.cpp: The main example file. It sets up the input tensors, defines the elementwise operation and normalization parameters, and instantiates the DeviceElementwiseNormalization operation.
../../include/ck/tensor_operation/gpu/device/device_elementwise_normalization.hpp: The high-level device interface for the fused elementwise normalization operation.
The underlying grid-wise kernel implements the complex fusion of elementwise operations with the two-pass normalization algorithm.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/45_elementwise_normalization
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./elementwise_normalization_xdl

# Run with verification, data initialization, and timing
./elementwise_normalization_xdl 1 2 1

Applications

This fused operation is valuable for implementing custom normalization layers and optimizing activation-normalization sequences.

Custom Activation-Normalization Blocks: Some architectures use non-standard activation functions followed by normalization. For example, a Swish activation followed by layer normalization can be fused into a single kernel using this pattern.
Residual Connection with Normalization: In some variants of residual networks, the residual addition is immediately followed by normalization. This can be expressed as an elementwise addition (residual) followed by normalization.
Preprocessing Pipelines: In data preprocessing, tensors might need elementwise transformations (e.g., color space conversion) followed by normalization (e.g., standardization). This kernel can fuse these operations.
Research Architectures: Novel normalization techniques often involve custom elementwise operations before the normalization step. This kernel provides a flexible foundation for implementing such research ideas efficiently.

5.0 KiB Raw Permalink Blame History

Elementwise Normalization

Mathematical Formulation

Algorithmic Strategy: Fused Elementwise with Online Normalization

Source Code Organization

Build and Run

Prerequisites

Build the Example

Run the Example

Applications

5.0 KiB

Raw Permalink Blame History