Files
composable_kernel/example/44_elementwise_permute/README.md
2025-10-16 10:13:27 +00:00

60 lines
2.5 KiB
Markdown

# Elementwise Operation with Permutation Fusion
## Theory
This example demonstrates **elementwise operations fused with tensor permutation**. This pattern is used in deep learning for applying activation functions or scaling while simultaneously reordering tensor dimensions (e.g., NCHW to NHWC).
**Mathematical Formulation:**
- Elementwise: $Z = f(X)$ or $Z = f(X, Y)$
- Permute: $Y_{i_{p_0}, i_{p_1}, ..., i_{p_{n-1}}} = Z_{i_0, i_1, ..., i_{n-1}}$
- $P = [p_0, p_1, ..., p_{n-1}]$ is the permutation pattern
**Algorithmic Background:**
- The elementwise operation and permutation are fused in a single kernel.
- Intermediate results are kept in registers, not written to global memory.
- Used for layout conversion with activation, attention head reshaping, and more.
## How to Run
### Prerequisites
Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
### Build and run
```bash
cd composable_kernel/example/44_elementwise_permute
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j
# Example run (ReLU + NCHW to NHWC)
./elementwise_permute_xdl --input_shape=32,128,56,56 --permutation=0,2,3,1 --operation=relu --verify=1 --time=1
```
## Source Code Structure
### Directory Layout
```
example/44_elementwise_permute/
├── elementwise_permute_xdl.cpp # Main example: sets up, runs, and verifies elementwise+permute
include/ck/tensor_operation/gpu/device/
│ └── device_elementwise_permute.hpp # Device-level API for fused elementwise+permute
include/ck/tensor_operation/gpu/device/impl/
│ └── device_elementwise_permute_impl.hpp # Implementation
include/ck/tensor_operation/gpu/grid/
│ └── gridwise_elementwise_permute.hpp # Grid-level kernel
include/ck/tensor_operation/gpu/element/
└── element_wise_operation.hpp # Elementwise operation definitions
```
### Key Classes and Functions
- **DeviceElementwisePermute** (in `device_elementwise_permute.hpp`):
Device API for fused elementwise and permutation.
- **gridwise_elementwise_permute** (in `gridwise_elementwise_permute.hpp`):
Implements the tiled/blocking elementwise+permute kernel.
- **element_wise_operation** (in `element_wise_operation.hpp`):
Defines elementwise operations (e.g., relu, scale).
This example demonstrates how Composable Kernel supports efficient fusion of elementwise operations and tensor permutation for deep learning and data layout transformations.