mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-15 10:37:44 +00:00
60 lines
2.5 KiB
Markdown
60 lines
2.5 KiB
Markdown
# Elementwise Operation with Permutation Fusion
|
|
|
|
## Theory
|
|
|
|
This example demonstrates **elementwise operations fused with tensor permutation**. This pattern is used in deep learning for applying activation functions or scaling while simultaneously reordering tensor dimensions (e.g., NCHW to NHWC).
|
|
|
|
**Mathematical Formulation:**
|
|
- Elementwise: $Z = f(X)$ or $Z = f(X, Y)$
|
|
- Permute: $Y_{i_{p_0}, i_{p_1}, ..., i_{p_{n-1}}} = Z_{i_0, i_1, ..., i_{n-1}}$
|
|
- $P = [p_0, p_1, ..., p_{n-1}]$ is the permutation pattern
|
|
|
|
**Algorithmic Background:**
|
|
- The elementwise operation and permutation are fused in a single kernel.
|
|
- Intermediate results are kept in registers, not written to global memory.
|
|
- Used for layout conversion with activation, attention head reshaping, and more.
|
|
|
|
## How to Run
|
|
|
|
### Prerequisites
|
|
|
|
Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
|
|
|
|
### Build and run
|
|
```bash
|
|
cd composable_kernel/example/44_elementwise_permute
|
|
mkdir build && cd build
|
|
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
|
|
make -j
|
|
|
|
# Example run (ReLU + NCHW to NHWC)
|
|
./elementwise_permute_xdl --input_shape=32,128,56,56 --permutation=0,2,3,1 --operation=relu --verify=1 --time=1
|
|
```
|
|
|
|
## Source Code Structure
|
|
|
|
### Directory Layout
|
|
```
|
|
example/44_elementwise_permute/
|
|
├── elementwise_permute_xdl.cpp # Main example: sets up, runs, and verifies elementwise+permute
|
|
include/ck/tensor_operation/gpu/device/
|
|
│ └── device_elementwise_permute.hpp # Device-level API for fused elementwise+permute
|
|
include/ck/tensor_operation/gpu/device/impl/
|
|
│ └── device_elementwise_permute_impl.hpp # Implementation
|
|
include/ck/tensor_operation/gpu/grid/
|
|
│ └── gridwise_elementwise_permute.hpp # Grid-level kernel
|
|
include/ck/tensor_operation/gpu/element/
|
|
└── element_wise_operation.hpp # Elementwise operation definitions
|
|
```
|
|
|
|
### Key Classes and Functions
|
|
|
|
- **DeviceElementwisePermute** (in `device_elementwise_permute.hpp`):
|
|
Device API for fused elementwise and permutation.
|
|
- **gridwise_elementwise_permute** (in `gridwise_elementwise_permute.hpp`):
|
|
Implements the tiled/blocking elementwise+permute kernel.
|
|
- **element_wise_operation** (in `element_wise_operation.hpp`):
|
|
Defines elementwise operations (e.g., relu, scale).
|
|
|
|
This example demonstrates how Composable Kernel supports efficient fusion of elementwise operations and tensor permutation for deep learning and data layout transformations.
|