Files
composable_kernel/docs/conceptual/ck_tile/descriptors.rst
spolifroni-amd 0cc83cb8e8 CK: removed the api reference (#3571)
* removed the api reference

* updating to the latest rocm-docs-core min version

* fixed a formatting issue with buffer views

* removed reference links from code snippets

* removed reference links from code snippets

---------

Co-authored-by: John Afaganis <john.afaganis@amd.com>
2026-01-27 07:36:47 -08:00

384 lines
13 KiB
ReStructuredText

.. _ck_tile_descriptors:
Tensor Descriptors - Complete Tensor Specifications
===================================================
Overview
--------
A TensorDescriptor is the complete blueprint for a tensor. It combines a shape, stride information, and a series of :ref:`transformations <ck_tile_transforms>` into a single object that defines exactly how a tensor's data is laid out in memory. This specification enables CK Tile to create complex tensor views without any data movement.
In CK Tile, TensorDescriptors serve as the foundation for all tensor operations, providing:
- **Memory Layout Specification**: How data is arranged in physical memory
- **Logical View Definition**: How the tensor appears to the programmer
- **Transformation Pipeline**: A series of :ref:`coordinate transformations <ck_tile_coordinate_systems>`
- **Zero-Copy Views**: Different logical representations of the same data, building on :ref:`BufferViews <ck_tile_buffer_views>` and :ref:`TensorViews <ck_tile_tensor_views>`
Creating Basic Tensor Layouts
-----------------------------
CK Tile provides several ways to create tensor descriptors for common memory layouts.
Custom Strides
~~~~~~~~~~~~~~
The most fundamental way to define a tensor is with custom strides. This provides full control over how many elements to "jump" in memory to move to the next item along each dimension. This is particularly useful for creating padded layouts required by GPU algorithms.
.. code-block:: cpp
using namespace ck_tile;
// Create a 3x4 tensor, but make each row take up 8 elements in memory
// (4 for data, 4 for padding)
constexpr auto M = 3;
constexpr auto N = 4;
constexpr auto RowStride = 8; // Padded stride
auto descriptor = make_naive_tensor_descriptor(
make_tuple(M, N), // Shape: [3, 4]
make_tuple(RowStride, 1) // Strides: [8, 1]
);
// The total memory needed is 3 rows * 8 elements/row = 24
constexpr auto element_space_size = M * RowStride;
// Calculate offset of the element at [row=1, col=2]
multi_index<2> coord{1, 2};
auto offset = descriptor.calculate_offset(coord);
// offset = 1*8 + 2*1 = 10
Packed Row-Major Layout
~~~~~~~~~~~~~~~~~~~~~~~~~
For most cases, a tightly packed, row-major layout is sufficient. The strides are calculated automatically, leaving no unused space between elements.
.. code-block:: cpp
using namespace ck_tile;
// Create a packed 3x4 tensor
auto descriptor_packed = make_naive_tensor_descriptor_packed(
make_tuple(3, 4)
);
// Total memory is 3 * 4 = 12 elements
// Strides are automatically [4, 1] for row-major layout
// Calculate offset of the element at [row=1, col=2]
multi_index<2> coord{1, 2};
auto offset = descriptor_packed.calculate_offset(coord);
// offset = 1*4 + 2*1 = 6
Aligned Layout
~~~~~~~~~~~~~~
For GPU performance, memory layouts often need to be aligned. This function creates a row-major layout but ensures that each row's starting address is a multiple of a given alignment value, adding padding if necessary.
.. code-block:: cpp
using namespace ck_tile;
// Create a 4x5 tensor with 8-element alignment
constexpr auto align = 8; // Align each row to 8-element boundary
auto descriptor_aligned = make_naive_tensor_descriptor_aligned(
make_tuple(4, 5),
align
);
// Without alignment, size would be 4*5=20
// With alignment, the row stride becomes 8 (smallest multiple of 8 >= 5)
// Total size = 4 rows * 8 elements/row = 32
The Pipeline Concept
--------------------
Every TensorDescriptor in CK Tile can be thought of as a **transformation pipeline**. The functions above create the *first stage* of this pipeline, defining the initial :ref:`transformation <ck_tile_transforms>` that takes a simple, one-dimensional block of memory and presents it as a logical, multi-dimensional tensor view.
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
graph LR
subgraph "Pipeline Stages"
S1["Stage 1<br/>Base Layout<br/>[M, N]"]
S2["Stage 2<br/>Transform<br/>Unmerge"]
S3["Stage 3<br/>New View<br/>[M1, M2, N]"]
S4["Stage N<br/>Final View<br/>[...]"]
end
subgraph "Same Data"
D["Physical Memory<br/>No data movement"]
end
S1 --> S2
S2 --> S3
S3 --> S4
S1 -.-> D
S2 -.-> D
S3 -.-> D
S4 -.-> D
style D fill:#ffebee,stroke:#d32f2f,stroke-width:2px
style S1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style S3 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
.. image:: diagrams/descriptors_1.svg
:alt: Diagram
:align: center
.. image:: diagrams/descriptors_1.svg
:alt: Diagram
:align: center
The Initial Pipeline Stage
~~~~~~~~~~~~~~~~~~~~~~~~~~
A simple packed descriptor sets up a pipeline with a single transform:
- **Input**: The raw, one-dimensional memory buffer (hidden dimension ID 0)
- **Output**: The logical dimensions that you interact with (hidden dimension IDs 1, 2, ...)
This initial stage converts linear memory addresses into multi-dimensional coordinates. See :ref:`ck_tile_adaptors` for how transforms chain together.
Advanced Layouts: Step-by-Step Transformation
---------------------------------------------
The ``transform_tensor_descriptor`` function adds new stages to an existing descriptor's pipeline using :ref:`transforms <ck_tile_transforms>`.
Transform a [2, 6] Tensor into a [2, 2, 3] View
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This example reinterprets a 2D tensor with shape [2, 6] as a 3D tensor with shape [2, 2, 3], without changing the underlying 12-element memory buffer.
**Step 1: Define the Base Descriptor**
.. code-block:: cpp
using namespace ck_tile;
// Create the [2, 6] base descriptor
auto base_descriptor = make_naive_tensor_descriptor_packed(
make_tuple(2, 6)
);
// This creates an initial pipeline stage that:
// - Takes the raw buffer (hidden ID 0) as input
// - Produces two outputs (hidden IDs 1 and 2)
// - These outputs become logical dimensions 0 and 1
**Step 2: Define the New Transformation Stage**
To get from [2, 6] to [2, 2, 3], we need:
- **For logical dimension 0 (length 2)**: Preserve it with PassThroughTransform
- **For logical dimension 1 (length 6)**: Split it with UnmergeTransform([2, 3])
**Step 3: Apply Transformation**
.. code-block:: cpp
// Create the transformed descriptor
auto transformed_descriptor = transform_tensor_descriptor(
base_descriptor,
make_tuple(
make_pass_through_transform(2), // For dim 0
make_unmerge_transform(make_tuple(2, 3)) // For dim 1
),
make_tuple(sequence<0>{}, sequence<1>{}), // Input mapping
make_tuple(sequence<0>{}, sequence<1, 2>{}) // Output mapping
);
// Result: A [2, 2, 3] view of the same data
Analysis of the Final Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
graph TB
subgraph "Transform Pipeline"
T0["Transform 0<br/>Base Unmerge<br/>Input: [0]<br/>Output: [1,2]"]
T1["Transform 1<br/>PassThrough<br/>Input: [1]<br/>Output: [3]"]
T2["Transform 2<br/>Unmerge<br/>Input: [2]<br/>Output: [4,5]"]
end
subgraph "Hidden Dimensions"
H0["Hidden ID 0<br/>Raw Buffer"]
H1["Hidden ID 1<br/>Dim 0 (size 2)"]
H2["Hidden ID 2<br/>Dim 1 (size 6)"]
H3["Hidden ID 3<br/>Final Dim 0"]
H4["Hidden ID 4<br/>Final Dim 1"]
H5["Hidden ID 5<br/>Final Dim 2"]
end
H0 --> T0
T0 --> H1
T0 --> H2
H1 --> T1
H2 --> T2
T1 --> H3
T2 --> H4
T2 --> H5
style H0 fill:#ffebee,stroke:#d32f2f,stroke-width:2px
style H3 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style H4 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style H5 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
.. image:: diagrams/descriptors_2.svg
:alt: Diagram
:align: center
.. image:: diagrams/descriptors_2.svg
:alt: Diagram
:align: center
The pipeline now has three stages:
1. **Base UnmergeTransform**: Converts raw buffer to [2, 6] layout
2. **PassThroughTransform**: Preserves the first dimension
3. **UnmergeTransform**: Splits the second dimension into [2, 3]
5D to 3D Block Transformation
-----------------------------------------------------
These concepts are critical in :ref:`GPU programming <ck_tile_gpu_basics>`. This example transforms a 5D tensor representing a GPU thread block's workload into a simpler 3D view using MergeTransform. See :ref:`ck_tile_thread_mapping` for thread distribution details.
.. code-block:: cpp
using namespace ck_tile;
// Define parameters (typical for a GPU block)
constexpr auto Block_M = 256;
constexpr auto NumWarps = 8;
constexpr auto WarpSize = 64;
constexpr auto KVector = 4;
constexpr auto wavesPerK = 2;
constexpr auto wavesPerM = NumWarps / wavesPerK;
constexpr auto NumIssues = Block_M / wavesPerM;
// Create the base 5D descriptor
auto base_descriptor = make_naive_tensor_descriptor_packed(
make_tuple(NumIssues, wavesPerM, wavesPerK, WarpSize, KVector)
);
// Transform to 3D by merging dimensions
auto transformed_descriptor = transform_tensor_descriptor(
base_descriptor,
make_tuple(
make_pass_through_transform(NumIssues),
make_merge_transform(make_tuple(wavesPerM, wavesPerK)),
make_merge_transform(make_tuple(WarpSize, KVector))
),
make_tuple(sequence<0>{}, sequence<1, 2>{}, sequence<3, 4>{}),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{})
);
// Result: [NumIssues, wavesPerM*wavesPerK, WarpSize*KVector]
// This simplifies thread block management while preserving data layout
Common Descriptor Patterns
--------------------------
Matrix Transposition
~~~~~~~~~~~~~~~~~~~~
.. code-block:: cpp
// Create a transposed view of a matrix
auto transposed = transform_tensor_descriptor(
original_matrix,
make_tuple(
make_pass_through_transform(N),
make_pass_through_transform(M)
),
make_tuple(sequence<1>{}, sequence<0>{}), // Swap dimensions
make_tuple(sequence<0>{}, sequence<1>{})
);
Padding for Convolution
~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: cpp
// Add padding to spatial dimensions
auto padded = transform_tensor_descriptor(
input_tensor,
make_tuple(
make_pass_through_transform(N), // Batch
make_pass_through_transform(C), // Channel
make_pad_transform(H, pad_h, pad_h), // Height
make_pad_transform(W, pad_w, pad_w) // Width
),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
);
For a complete convolution example, see :ref:`ck_tile_convolution_example`.
Tensor Slicing
~~~~~~~~~~~~~~
.. code-block:: cpp
// Extract a sub-tensor
auto slice = transform_tensor_descriptor(
full_tensor,
make_tuple(
make_slice_transform(M, start_m, end_m),
make_slice_transform(N, start_n, end_n)
),
make_tuple(sequence<0>{}, sequence<1>{}),
make_tuple(sequence<0>{}, sequence<1>{})
);
Key Concepts Summary
--------------------
TensorDescriptors provide a key abstraction for tensor manipulation:
- **Pipeline Architecture**: Each descriptor is a transformation pipeline
- **Zero-Copy Views**: All transformations are logical, no data movement
- **Composability**: Complex layouts built from simple transforms
- **GPU Optimization**: Designed for efficient GPU memory access patterns
Important principles:
1. **Always Handle All Dimensions**: When transforming, provide a transform for each input dimension
2. **Hidden Dimension IDs**: Track the flow of data through the pipeline
3. **Compile-Time Resolution**: All transformations resolved at compile time
4. **Type Safety**: Template metaprogramming ensures correctness
Performance Considerations
--------------------------
When designing tensor descriptors for GPU kernels:
1. **Memory Coalescing**: Ensure contiguous threads access contiguous memory
2. **Bank Conflicts**: Avoid patterns that cause :ref:`shared memory conflicts <ck_tile_lds_bank_conflicts>`
3. **Alignment**: Use aligned layouts for better memory throughput
4. **Padding**: Strategic padding can improve access patterns. Ssee :ref:`ck_tile_lds_index_swapping` for advanced techniques.
Next Steps
----------
- :ref:`ck_tile_tile_window` - Using descriptors for efficient data loading
- :ref:`ck_tile_tile_distribution` - How descriptors enable automatic work distribution
- :ref:`ck_tile_convolution_example` - Real-world application of complex descriptors
- :ref:`ck_tile_static_distributed_tensor` - Managing distributed tensors with descriptors
- :ref:`ck_tile_gemm_optimization` - GEMM kernels using descriptor transformations