Files
composable_kernel/docs/conceptual/ck_tile/tensor_views.rst
spolifroni-amd d9d4c9c3df [composable_kernel] initial draft of the ck tile conceptual doc (#3242)
* Adding CK Tile documentation

* Updates based on feedback

* Fix tile window API description

* Fix remaining images

* add documentation about flush_cache and rotating_buffer functionality in ck_tile

* Supplement the documentation

* light edit of the ck tile conceptual doc

* Fixes for ruff check.

* Fixes for ruff check 2.

* Fixes for ruff check 3.

---------

Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
2025-12-04 11:09:21 -08:00

483 lines
18 KiB
ReStructuredText
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
.. _ck_tile_tensor_views:
Tensor Views - Multi-Dimensional Structure
==========================================
Overview
--------
While :ref:`BufferView <ck_tile_buffer_views>` provides the foundation for raw memory access, TensorView adds multi-dimensional structure to flat memory regions. This abstraction bridges the gap between how developers conceptualize data and how that data is physically stored in linear memory. TensorView enables coordinate-based access patterns that match the natural structure of algorithms while maintaining the performance characteristics necessary for efficient GPU computation.
TensorView presents different logical views of the same underlying memory without copying data. A single memory region can be viewed as a row-major matrix, a column-major matrix, or a transposed matrix, using different TensorView configurations. This zero-copy abstraction enables flexible transformations and access patterns while maintaining optimal memory bandwidth utilization.
TensorView Architecture
-----------------------
..
Original mermaid diagram (edit here, then run update_diagrams.py)
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
graph TB
subgraph "Memory Foundation"
Memory["Flat Memory Array<br/>0 1 2 3 4 5 6 7 8 9 10 11"]
end
subgraph "Access Layer"
BufferView["BufferView<br/>Linear Memory Access"]
Descriptor["TensorDescriptor<br/>Shape & Stride Info"]
end
subgraph "Tensor Layer"
TensorView["TensorView<br/>Multi-dimensional Access"]
end
subgraph "Logical View"
Matrix["2D Matrix View<br/>[3×4]<br/>[[0,1,2,3]<br/>[4,5,6,7]<br/>[8,9,10,11]]"]
end
Memory --> BufferView
Memory --> Descriptor
BufferView --> TensorView
Descriptor --> TensorView
TensorView --> Matrix
style Memory fill:#d1fae5,stroke:#10b981,stroke-width:2px
style BufferView fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
style Descriptor fill:#fed7aa,stroke:#f59e0b,stroke-width:2px
style TensorView fill:#fce7f3,stroke:#ec4899,stroke-width:2px
style Matrix fill:#e9d5ff,stroke:#9333ea,stroke-width:2px
.. image:: diagrams/tensor_views_1.svg
:alt: Diagram
:align: center
The Foundation: BufferView and TensorDescriptor
------------------------------------------------
TensorView builds upon two fundamental components that work in concert to provide structured access to memory. The :ref:`BufferView <ck_tile_buffer_views>` component handles the low-level memory access, providing type-safe operations with address space awareness. The :ref:`TensorDescriptor <ck_tile_descriptors>` component encodes the multi-dimensional structure, including shape information and stride patterns that determine how coordinates map to memory offsets.
This separation of concerns enables optimizations. The BufferView can optimize for the specific memory space while the TensorDescriptor can encode complex access patterns without concern for the underlying memory type. Together, they provide a complete abstraction for multi-dimensional data access.
C++ Implementation
------------------
**File**: ``include/ck_tile/core/tensor/tensor_view.hpp``
Creating TensorViews
~~~~~~~~~~~~~~~~~~~~
The creation of a TensorView involves combining a BufferView with a TensorDescriptor. This process can be done explicitly for maximum control or through convenience functions for common patterns:
.. code-block:: cpp
#include <ck_tile/core/tensor/tensor_view.hpp>
#include <ck_tile/core/tensor/tensor_descriptor.hpp>
#include <ck_tile/core/numeric/tuple.hpp>
// The actual C++ template signature from tensor_view.hpp:
// template <typename BufferView_,
// typename TensorDesc_,
// memory_operation_enum DstInMemOp_ = memory_operation_enum::set>
// struct tensor_view
__device__ void example_tensor_creation()
{
// Create a 3x4 matrix in global memory
float data[12] = {0,1,2,3,4,5,6,7,8,9,10,11};
// Method 1: Create buffer and descriptor separately
auto buffer = make_buffer_view<address_space_enum::global>(data, 12);
auto desc = make_tensor_descriptor(
make_tuple(3, 4), // shape: 3 rows, 4 columns
make_tuple(4, 1) // strides: row stride=4, col stride=1
);
// Create tensor view
auto tensor = make_tensor_view<address_space_enum::global>(buffer, desc);
// Method 2: Use convenience function for packed layout
auto tensor2 = make_naive_tensor_view_packed<address_space_enum::global>(
data, // pointer
make_tuple(3, 4) // shape (strides calculated automatically)
);
// Access element at (1, 2)
float value = tensor(make_tuple(1, 2)); // Returns 6
// Update element
tensor(make_tuple(2, 1)) = 99.0f;
}
Coordinate-Based Access
~~~~~~~~~~~~~~~~~~~~~~~
The fundamental operation of TensorView is translating multi-dimensional coordinates into memory accesses. This translation happens through an advanced pipeline that maintains efficiency while providing flexibility:
..
Original mermaid diagram (edit here, then run update_diagrams.py)
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
flowchart LR
subgraph "User Input"
Coord["Coordinate<br/>(1, 2)"]
end
subgraph "TensorView Processing"
Shape["Shape Check<br/>row < 3?<br/>col < 4?"]
Stride["Apply Strides<br/>offset = 1×4 + 2×1"]
Buffer["BufferView Access<br/>buffer[6]"]
end
subgraph "Result"
Value["Value: 6"]
end
Coord --> Shape
Shape -->|Valid| Stride
Stride --> Buffer
Buffer --> Value
style Coord fill:#e0e7ff,stroke:#4338ca,stroke-width:2px
style Shape fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style Stride fill:#dcfce7,stroke:#10b981,stroke-width:2px
style Buffer fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
style Value fill:#d1fae5,stroke:#10b981,stroke-width:2px
.. image:: diagrams/tensor_views_2.svg
:alt: Diagram
:align: center
Memory Layouts and Strides
--------------------------
A key feature of TensorView is its ability to represent different memory layouts through stride manipulation. This capability enables zero-copy transformations that would otherwise require expensive memory operations:
..
Original mermaid diagram (edit here, then run update_diagrams.py)
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
graph TB
subgraph "Row-Major Layout (C-style)"
RM["Memory: [0,1,2,3,4,5,6,7,8,9,10,11]<br/>Shape: (3,4)<br/>Strides: (4,1)"]
RMMatrix["[[0, 1, 2, 3]<br/> [4, 5, 6, 7]<br/> [8, 9, 10, 11]]"]
RM --> RMMatrix
end
subgraph "Column-Major Layout (Fortran-style)"
CM["Memory: [0,3,6,9,1,4,7,10,2,5,8,11]<br/>Shape: (3,4)<br/>Strides: (1,3)"]
CMMatrix["[[0, 1, 2, 3]<br/> [4, 5, 6, 7]<br/> [8, 9, 10, 11]]"]
CM --> CMMatrix
end
subgraph "Custom Stride (Transposed View)"
TV["Memory: [0,1,2,3,4,5,6,7,8,9,10,11]<br/>Shape: (4,3)<br/>Strides: (1,4)"]
TVMatrix["[[0, 4, 8]<br/> [1, 5, 9]<br/> [2, 6, 10]<br/> [3, 7, 11]]"]
TV --> TVMatrix
end
style RM fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
style CM fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style TV fill:#f3e8ff,stroke:#9333ea,stroke-width:2px
.. image:: diagrams/tensor_views_3.svg
:alt: Diagram
:align: center
Row-Major vs Column-Major Layouts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The choice of memory layout has profound implications for performance. Row-major layout, where consecutive elements in a row are stored contiguously, optimizes for row-wise traversal. Column-major layout optimizes for column-wise traversal. CK's TensorView abstraction allows algorithms to work with their natural access patterns regardless of the underlying storage:
.. code-block:: cpp
__device__ void example_memory_layouts()
{
float data[12] = {0,1,2,3,4,5,6,7,8,9,10,11};
// Row-major layout (default)
auto row_major = make_naive_tensor_view_packed<address_space_enum::global>(
data, make_tuple(3, 4)
);
// Strides: (4, 1) - moving one row advances by 4 elements
// Column-major layout through custom strides
auto col_major = make_tensor_view<address_space_enum::global>(
make_buffer_view<address_space_enum::global>(data, 12),
make_tensor_descriptor(
make_tuple(3, 4), // shape
make_tuple(1, 3) // strides: row stride=1, col stride=3
)
);
// Transposed view (no data copy!)
auto transposed = make_tensor_view<address_space_enum::global>(
make_buffer_view<address_space_enum::global>(data, 12),
make_tensor_descriptor(
make_tuple(4, 3), // transposed shape
make_tuple(1, 4) // transposed strides
)
);
// All three views access the same memory, just differently
// row_major(1,2) == col_major(2,1) == transposed(2,1)
}
Advanced Operations
-------------------
Slicing and Subviews
~~~~~~~~~~~~~~~~~~~~
TensorView supports advanced slicing operations that create new views of subsets of the data. These operations are essential for algorithms that process data in blocks or tiles. See :ref:`ck_tile_tile_window` for production use.
.. code-block:: cpp
__device__ void example_slicing_operations()
{
// Create a larger tensor
float data[100];
auto tensor = make_naive_tensor_view_packed<address_space_enum::global>(
data, make_tuple(10, 10)
);
// Create a subview using transforms
// This would typically be done with tile_window in production code
auto subview = make_tensor_view<address_space_enum::global>(
tensor.get_buffer_view(),
transform_tensor_descriptor(
tensor.get_tensor_descriptor(),
make_tuple(
make_pass_through_transform(number<5>{}), // 5 rows
make_pass_through_transform(number<5>{}) // 5 columns
),
make_tuple(number<2>{}, number<3>{}) // offset (2,3)
)
);
// subview now represents a 5x5 region starting at (2,3)
}
Vectorized Access
~~~~~~~~~~~~~~~~~
GPUs achieve maximum memory bandwidth through vectorized operations. TensorView provides native support for vector loads and stores. See :ref:`ck_tile_load_store_traits` for more details.
.. code-block:: cpp
__device__ void example_vectorized_access()
{
float data[256];
auto tensor = make_naive_tensor_view_packed<address_space_enum::global>(
data, make_tuple(16, 16)
);
// Create coordinate for vectorized access
auto coord = make_tensor_coordinate(
tensor.get_tensor_descriptor(),
make_tuple(4, 0) // row 4, starting at column 0
);
// Load 4 consecutive elements as float4
using float4 = vector_type<float, 4>::type;
auto vec4 = tensor.get_vectorized_elements<float4>(coord, 0);
// Process vector data
vec4.x *= 2.0f;
vec4.y *= 2.0f;
vec4.z *= 2.0f;
vec4.w *= 2.0f;
// Store back
tensor.set_vectorized_elements<float4>(coord, 0, vec4);
}
Performance Considerations
--------------------------
Memory Access Patterns
~~~~~~~~~~~~~~~~~~~~~~
The efficiency of TensorView operations depends on memory access patterns. Understanding these patterns is important for achieving optimal performance. See :ref:`ck_tile_gpu_basics` for hardware considerations.
..
Original mermaid diagram (edit here, then run update_diagrams.py)
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
graph LR
subgraph "Memory Access Patterns"
Seq["Sequential Access<br/>(Good cache usage)"]
Stride["Strided Access<br/>(May cause cache misses)"]
Random["Random Access<br/>(Poor cache usage)"]
end
subgraph "Optimization Strategies"
Opt1["Use row-major for row iteration"]
Opt2["Use col-major for column iteration"]
Opt3["Minimize stride between accesses"]
Opt4["Vectorize when possible"]
end
Seq --> Opt1
Stride --> Opt2
Stride --> Opt3
Random --> Opt4
style Seq fill:#d1fae5,stroke:#10b981,stroke-width:2px
style Stride fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style Random fill:#fee2e2,stroke:#ef4444,stroke-width:2px
.. image:: diagrams/tensor_views_4.svg
:alt: Diagram
:align: center
Compile-Time Optimization
~~~~~~~~~~~~~~~~~~~~~~~~~
CK's TensorView leverages compile-time optimization to achieve zero-overhead abstraction. When tensor dimensions and strides are known at compile time, the entire coordinate-to-offset calculation can be resolved during compilation:
.. code-block:: cpp
// Compile-time known dimensions enable optimization
constexpr auto shape = make_tuple(number<256>{}, number<256>{});
constexpr auto strides = make_tuple(number<256>{}, number<1>{});
auto tensor = make_tensor_view<address_space_enum::global>(
buffer,
make_tensor_descriptor(shape, strides)
);
// This access compiles to a single memory instruction
constexpr auto coord = make_tuple(number<5>{}, number<10>{});
auto value = tensor(coord); // Offset calculated at compile time
TensorView vs BufferView
------------------------
Understanding when to use TensorView versus BufferView is crucial for writing efficient code:
..
Original mermaid diagram (edit here, then run update_diagrams.py)
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
graph TB
subgraph "BufferView"
BV1["Linear indexing only"]
BV2["buffer[5]"]
BV3["No shape information"]
BV4["Direct memory access"]
end
subgraph "TensorView"
TV1["Multi-dimensional indexing"]
TV2["tensor(1, 2)"]
TV3["Shape-aware operations"]
TV4["Coordinate transformations"]
end
subgraph "Use Cases"
UC1["BufferView: Low-level memory ops"]
UC2["TensorView: Matrix/tensor algorithms"]
end
BV1 --> UC1
TV1 --> UC2
style BV1 fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
style TV1 fill:#fce7f3,stroke:#ec4899,stroke-width:2px
.. image:: diagrams/tensor_views_5.svg
:alt: Diagram
:align: center
BufferView excels at raw memory operations where linear access is natural or where the overhead of coordinate calculation would be prohibitive. TensorView is best suited for algorithms that operate in terms of multi-dimensional coordinates, such as matrix operations, image processing, or tensor contractions.
Integration with Tile Distribution
----------------------------------
TensorView serves as the foundation for :ref:`tile distribution's <ck_tile_tile_distribution>` higher-level abstractions. When combined with :ref:`tile windows <ck_tile_tile_window>` and distribution patterns, TensorView enables the automatic generation of efficient access patterns:
.. code-block:: cpp
// TensorView provides the base abstraction
auto tensor_view = make_naive_tensor_view_packed<address_space_enum::global>(
global_memory, make_tuple(M, N)
);
// Tile window builds on TensorView for distributed access
auto tile_window = make_tile_window(
tensor_view,
tile_shape,
origin,
distribution
);
// The distribution automatically generates optimal access patterns
auto distributed_tensor = tile_window.load();
Summary
-------
TensorView bridges the gap between logical multi-dimensional data structures and physical memory layout. Through its advanced design, TensorView provides:
**Multi-dimensional Indexing**: Natural coordinate-based access to data, matching how algorithms conceptualize their operations. This abstraction eliminates error-prone manual index calculations while maintaining performance.
**Flexible Memory Layouts**: Support for row-major, column-major, and custom stride patterns enables algorithms to work with data in its most natural form. Zero-copy transformations like transposition become stride manipulations.
**Zero-Copy Views**: The ability to create different logical views of the same physical memory enables flexible transformations without the overhead of data movement. This capability is essential for efficient GPU programming where memory bandwidth is often the limiting factor.
**Type Safety**: Dimensions and memory spaces are encoded in the type system, catching errors at compile time rather than runtime. This safety comes without performance overhead thanks to template metaprogramming.
**Seamless Integration**: TensorView works harmoniously with :ref:`BufferView <ck_tile_buffer_views>` for low-level access and serves as the foundation for higher-level abstractions like :ref:`tile windows <ck_tile_tile_window>` and :ref:`distributed tensors <ck_tile_static_distributed_tensor>`.
The abstraction enables writing dimension-agnostic algorithms while maintaining high performance through compile-time optimizations.
Next Steps
----------
Continue to :ref:`ck_tile_coordinate_systems` to understand the mathematical foundation of coordinate transformations in CK Tile.