* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
Tensor Permutation (Dimension Reordering)
Theory
This example demonstrates tensor permutation operations, which reorder the dimensions of tensors according to a specified permutation pattern. Permutation is fundamental for many machine learning operations, including tensor layout transformations, data format conversions, and implementing complex tensor operations.
Mathematical Formulation:
Given an input tensor X with shape [D_0, D_1, ..., D_{n-1}] and a permutation pattern P = [p_0, p_1, ..., p_{n-1}], the permutation operation produces an output tensor Y with shape [D_{p_0}, D_{p_1}, ..., D_{p_{n-1}}] such that:
Y_{i_{p_0}, i_{p_1}, ..., i_{p_{n-1}}} = X_{i_0, i_1, ..., i_{n-1}}
Algorithmic Background:
- Permutation is used for matrix transpose, NCHW/NHWC layout conversion, attention head reshaping, and more.
- Efficient permutation requires optimizing memory access patterns for coalescing and bandwidth.
How to Run
Prerequisites
Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.
Build and run
cd composable_kernel/example/39_permute
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j
# Example run (matrix transpose)
./permute_xdl --input_shape=4096,4096 --permutation=1,0 --verify=1 --time=1
# Example run (NCHW to NHWC)
./permute_xdl --input_shape=32,256,56,56 --permutation=0,2,3,1 --verify=1 --time=1
Source Code Structure
Directory Layout
example/39_permute/
├── permute_xdl.cpp # Main example: sets up, runs, and verifies tensor permutation
include/ck/tensor_operation/gpu/device/
│ └── device_permute.hpp # Device-level permutation API
include/ck/tensor_operation/gpu/grid/
│ └── gridwise_permute.hpp # Grid-level permutation kernel
Key Classes and Functions
- DevicePermute (in
device_permute.hpp):
Device API for tensor permutation. - gridwise_permute (in
gridwise_permute.hpp):
Implements the tiled/blocking permutation kernel.
This example demonstrates how Composable Kernel implements efficient tensor dimension reordering for layout transformations and deep learning operations.