mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-03-18 06:07:36 +00:00
* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
Client Example: Parallel Reduction (NHWC)
Theory
This client example demonstrates parallel reduction operations over NHWC tensors. Reduction is a fundamental operation in deep learning for computing statistics (such as batch mean/variance), loss aggregation, and normalization.
Mathematical Formulation:
Given a tensor X[N, H, W, C] and a reduction axis (e.g., channel C):
- Sum:
Y_{n,h,w} = \sum_c X_{n,h,w,c} - Max:
Y_{n,h,w} = \max_c X_{n,h,w,c} - Mean:
Y_{n,h,w} = \frac{1}{C} \sum_c X_{n,h,w,c}
Algorithmic Background:
- Reductions are implemented using parallel tree or segmented reduction algorithms.
- Efficient reductions require careful memory access, synchronization, and sometimes numerically stable algorithms.
How to Run
Prerequisites
Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.
Build and run
cd composable_kernel/client_example/26_reduce
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j
# Example run (reduce over channel dimension)
./reduce_nhwc_c
Source Code Structure
Directory Layout
client_example/26_reduce/
├── reduce_nhwc_c.cpp # Main client example: reduction over NHWC tensors (channel axis)
├── CMakeLists.txt # Build configuration for the example
Key Functions
- main() (in
reduce_nhwc_c.cpp):
Sets up input tensors, configures reduction parameters, launches the reduction kernel, and verifies the result. - Reduction kernel invocation:
Uses the Composable Kernel device API to launch the reduction operation.
Additional Details
- Supports sum, max, mean, and other reductions over NHWC tensors.
- Example parameters can be adjusted in the source for different workloads.
Related Examples
- 12_reduce: Parallel reduction in the main example directory