mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-04-19 22:39:03 +00:00
CK Tile: fix some issues (#3557)
* Adding CK Tile documentation * Updates based on feedback * Fix tile window API description * Fix remaining images * add documentation about flush_cache and rotating_buffer functionality in ck_tile * Supplement the documentation * light edit of the ck tile conceptual doc --------- Co-authored-by: Vidyasagar <vanantha@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>
This commit is contained in:
@@ -1,35 +1,13 @@
|
||||
.. meta::
|
||||
:description: Composable Kernel CK Tile buffer views
|
||||
:keywords: composable kernel, CK, CK Tile, ROCm, API, buffer view, raw memory
|
||||
|
||||
.. _ck_tile_buffer_views:
|
||||
|
||||
CK Tile buffer view
|
||||
=======================
|
||||
|
||||
Buffer view is an abstraction that provides structured access to memory. The ``buffer_view`` class is exposed in ``include/ck_tile/core/tensor/buffer_view.hpp``.
|
||||
|
||||
Buffer view serves as the foundation for :ref:`ck_tile_tensor_views`. BufferView handles memory addressing and type safety, while TensorView builds upon this to add multi-dimensional coordinates (shape and strides).
|
||||
|
||||
|
||||
Buffer view provides the following advantages:
|
||||
|
||||
* A unified interface across global, shared, and register memory
|
||||
* Address spaces encoded in types, taking advantage of compile-time type checking
|
||||
* Configurable handling of invalid values, out-of-bounds operations, and conditional access patterns
|
||||
* Atomic operations for parallel algorithms
|
||||
* AMD GPU-specific optimizations
|
||||
* Automatic application of appropriate memory ordering constraints and cache control directives based on the target address space and operation type
|
||||
|
||||
|
||||
[TO DO: do we want to say more about these items? There wasn't a lot of detail in the original text, so I put them in a list for now]
|
||||
|
||||
|
||||
Buffer Views - Raw Memory Access
|
||||
|
||||
Address Space Usage Patterns
|
||||
----------------------------
|
||||
|
||||
[TO DO: explain in words what the diagram shows]
|
||||
..
|
||||
Original mermaid diagram (edit here, then run update_diagrams.py)
|
||||
|
||||
..
|
||||
Original mermaid diagram (edit here, then run update_diagrams.py)
|
||||
|
||||
@@ -66,18 +44,26 @@ Address Space Usage Patterns
|
||||
style Compute fill:#e0e7ff,stroke:#4338ca,stroke-width:2px
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
.. image:: diagrams/buffer_views_1.svg
|
||||
:alt: Diagram
|
||||
:align: center
|
||||
C++ Implementation
|
||||
------------------
|
||||
|
||||
**File**: ``include/ck_tile/core/tensor/buffer_view.hpp``
|
||||
|
||||
Basic Creation
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
[TO DO: remove "modern C++ template metaprogramming" and "zero-overhead abstraction"]
|
||||
By encoding critical properties such as buffer size and address space as template parameters, BufferView transforms what would traditionally be runtime decisions into compile-time constants. This design philosophy enables the compiler to perform aggressive optimizations, including constant propagation, loop unrolling, and instruction selection, that would be impossible with runtime parameters.
|
||||
|
||||
[TO DO: might want to move the implementation details to a separate section under "reference"]
|
||||
The use of compile-time constants extends beyond mere optimization. When the buffer size is encoded in the type system using constructs like ``number<8>{}``, the compiler can statically verify that array accesses are within bounds, eliminate unnecessary bounds checks, and even restructure algorithms to better match the known data dimensions. This compile-time knowledge propagates through the entire computation, enabling optimizations at every level of the abstraction hierarchy.
|
||||
|
||||
The address space template parameter represents another crucial design decision. By making the memory space part of the type system, BufferView ensures that operations appropriate for one memory space cannot be accidentally applied to another. This type safety prevents common errors such as attempting atomic operations on register memory or using global memory synchronization primitives on local memory. The compiler enforces these constraints at compile time, transforming potential runtime errors into compile-time diagnostics.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
@@ -98,7 +84,6 @@ Basic Creation
|
||||
buffer_size // number of elements
|
||||
);
|
||||
|
||||
|
||||
// Implementation detail: The actual C++ template is:
|
||||
// template <address_space_enum BufferAddressSpace,
|
||||
// typename T,
|
||||
@@ -123,17 +108,14 @@ Basic Creation
|
||||
static_assert(space == address_space_enum::global, "Should be global memory");
|
||||
}
|
||||
|
||||
[TO DO: add details and remove unnecessary comments; the "implementation detail" comment can be moved out and either placed outside and explained further, or just removed, depending on what we want to do]
|
||||
Out-of-Bounds Handling
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
[TO DO: might want to put this implementation detail in the reference section]
|
||||
Traditional approaches to bounds checking often involve conditional branches that can severely impact performance on GPU architectures, where divergent execution paths within a warp lead to serialization. BufferView's approach sidesteps this problem through two carefully designed modes that maintain performance while providing predictable behavior.
|
||||
|
||||
Buffer view uses two modes, zero value mode and custom value mode, that can prevent serialization during bounds checking.
|
||||
The Zero Value Mode leverages the mathematical property that zero often serves as a neutral element in computations. When an access falls outside the valid buffer range, this mode returns numerical zero without branching. This approach proves particularly effective for algorithms like convolution, where out-of-bounds accesses naturally correspond to zero-padding. The branchless implementation ensures that all threads in a warp follow the same execution path, maintaining the SIMD efficiency that is crucial for GPU performance.
|
||||
|
||||
Zero value mode returns zero without branching when an access falls outside the valid buffer range. This is useful in convolutions where out-of-bounds accesses correspond to zero-padding.
|
||||
|
||||
Custom value mode returns a custom value without branching when an access falls outside the valid buffer range. Custom value mode accommodates algorithms that require specific values for boundary conditions.
|
||||
|
||||
[TO DO: there were two examples of custom value mode that I removed. I removed them because unlike for zero value mode where the example was convolution, the example was vague in custom value. Is there a more specific example of where custom value would be used?]
|
||||
The Custom Value Mode extends this concept by letting developers specify arbitrary sentinel values for invalid accesses. This flexibility accommodates algorithms that require specific values for boundary conditions, such as using negative infinity for maximum operations or special markers for missing data. The implementation maintains the same branchless characteristics, using conditional move instructions or predicated execution to avoid divergent control flow.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
@@ -158,92 +140,39 @@ Custom value mode returns a custom value without branching when an access falls
|
||||
data, buffer_size, custom_invalid);
|
||||
}
|
||||
|
||||
|
||||
When ``InvalidElementUseNumericalZeroValue`` is set to true, the system uses zero value mode for out of bounds checking. When ``InvalidElementUseNumericalZeroValue`` is set to false, custom value mode is used. Zero value mode is used by default.
|
||||
|
||||
.. note::
|
||||
|
||||
Zero or custom invalid value is only returned for complete invalid values or out of bound access, for example when the first address of the vector is invalid. Partial out of bounds access during vector reads will not return useful results.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// Create data array
|
||||
constexpr size_t buffer_size = 8;
|
||||
float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
|
||||
float custom_invalid = 13.0f;
|
||||
|
||||
// Create global memory buffer view with zero invalid value mode (default)
|
||||
auto buffer_view = make_buffer_view<address_space_enum::global>(data, buffer_size, custom_invalid);
|
||||
|
||||
// Invalid element access with is_valid_element=false
|
||||
// Returns custom_invalid due to custom invalid value mode
|
||||
auto invalid_value = buffer_view.template get<float>(0, 0, false);
|
||||
printf("Invalid element: %.1f\n", invalid_value.get(0));
|
||||
|
||||
// Out of bounds access - AMD buffer addressing handles bounds checking
|
||||
// Will return custom_invalid when accessing beyond buffer_size
|
||||
auto oob_value = buffer_view.template get<float>(0, 100, true);
|
||||
printf("Out of bounds: %.1f\n", oob_value.get(0));
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Get Operations
|
||||
--------------
|
||||
|
||||
[TO DO: might want to put this implementation detail in the reference section]
|
||||
Scalar Access
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The signature for the ``buffer_view`` ``get()`` takes four parameters:
|
||||
The get operations in BufferView form the cornerstone of memory access patterns in CK Tile. These operations embody a advanced understanding of GPU memory systems and the patterns that lead to optimal performance. The scalar access interface incorporates multiple layers of optimization and safety mechanisms that work together to provide both performance and correctness.
|
||||
|
||||
``i``: the primary offset into the buffer expressed in terms of elements of type T rather than raw bytes.
|
||||
The parameter structure of scalar access operations reflects careful design choices aimed at maximizing flexibility while maintaining efficiency. The base index parameter ``i`` represents the primary offset into the buffer, expressed in terms of elements of type T rather than raw bytes. This type-aware indexing prevents common errors related to pointer arithmetic and ensures that vector types are handled correctly. The additional ``linear_offset`` parameter provides fine-grained control over the final access location, enabling complex access patterns without requiring expensive index calculations in the kernel code.
|
||||
|
||||
``linear_offset``: [TO DO: what is this?]
|
||||
The ``is_valid_element`` parameter provides a solution to conditional memory access. Rather than using traditional if-statements that would cause warp divergence, this boolean parameter enables predicated execution where the memory access occurs unconditionally but the result is conditionally used. This approach maintains uniform control flow across all threads in a warp, preserving the SIMD execution model that is fundamental to GPU performance.
|
||||
|
||||
``is_valid_element``: [TO DO: what is this?]
|
||||
The invalid value modes provide a mechanism for handling the boundary conditions that arise in parallel algorithms. When ``InvalidElementUseNumericalZeroValue`` is set to true, the system returns zero for any invalid access, whether due to the ``is_valid_element`` flag or out-of-bounds indexing. This mode is important for algorithms where zero serves as a natural extension value, such as in image processing with zero-padding or sparse matrix operations where missing elements are implicitly zero.
|
||||
|
||||
[TO DO: the last param, that's the out of bounds handling, yes?
|
||||
.. code:: cpp
|
||||
The custom invalid value mode, activated when ``InvalidElementUseNumericalZeroValue`` is false, offers additional flexibility for algorithms with specific boundary requirements. This mode returns a user-specified value for invalid accesses, accommodating use cases such as sentinel values in sorting algorithms, infinity values in optimization problems, or special markers in data processing pipelines. The implementation ensures that this flexibility comes without performance penalty, using the same branchless execution strategies as the zero mode.
|
||||
|
||||
get(index_t i,
|
||||
index_t linear_offset,
|
||||
bool is_valid_element,
|
||||
bool_constant<oob_conditional_check> = {})
|
||||
Out-of-bounds handling leverages AMD GPU hardware capabilities to provide safety with minimal impact to performance. When AMD buffer addressing is enabled, the hardware automatically clamps memory accesses to valid ranges, preventing the segmentation faults that would occur on CPU systems. This hardware-assisted bounds checking operates at wire speed, adding no overhead to the memory access path while ensuring that kernels cannot corrupt memory outside their allocated regions.
|
||||
|
||||
Vector Access
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
[TO DO: need some context around the code]
|
||||
Vector memory operations represent one of the most critical optimizations available in modern GPU programming, and BufferView's vector access interface exposes this capability. By using template parameters to specify vector types through constructs like ``ext_vector_t<float, N>``, the interface enables compile-time selection of optimal load and store instructions that can transfer multiple data elements in a single memory transaction. This vectorization is crucial for :ref:`ck_tile_load_store_traits`, which automatically selects optimal access patterns.
|
||||
|
||||
[TO DO: code chunks need to have detail and explanation so that the reader can see what they're trying to demonstrate.]
|
||||
The significance of vector operations extends beyond bandwidth improvements. GPUs are designed with wide memory buses that can transfer 128, 256, or even 512 bits per transaction. When scalar operations access only 32 bits at a time, they utilize only a fraction of this available bandwidth. Vector operations align with these wide buses, enabling full bandwidth utilization and reducing the total number of memory transactions required.
|
||||
|
||||
The implementation of vector access maintains the same parameter structure as scalar operations, providing consistency across the API while automatically handling the complexities of multi-element transfers. The system manages alignment requirements, ensures that vector loads and stores use the optimal hardware instructions, and handles cases where vector operations extend beyond buffer boundaries. This transparent handling of edge cases allows developers to use vector operations confidently without manual boundary checks or special-case code for partial vectors.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// Create buffer view
|
||||
float data[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
|
||||
auto buffer_view = make_buffer_view<address_space_enum::global>(data, 8);
|
||||
|
||||
// Simple get - compile-time bounds checking when possible
|
||||
auto value_buf = buffer_view.template get<float>(0,1,true); //get the buffer from the buffer view
|
||||
float value = value_buf.get(0); //get the value from the buffer
|
||||
|
||||
// Get with valid flag - branchless conditional access
|
||||
bool valid_flag = false;
|
||||
value_buf = buffer_view.template get<float>(0,1,valid_flag);
|
||||
value = value_buf.get(0);
|
||||
// Returns 0 valid_flag is false
|
||||
|
||||
// vectorized get
|
||||
using float2 = ext_vector_t<float, 2>;
|
||||
auto vector_buf = buffer_view.template get<float2>(0, 0, true);
|
||||
// Loads 2 floats in a single instruction
|
||||
float val1 = vector_buf.get(0);
|
||||
float val2 = vector_buf.get(1);
|
||||
}
|
||||
|
||||
``ext_vector_t<float, N>`` enables compile-time selection of optimal load and store instructions that can transfer multiple data elements in a single memory transaction.
|
||||
|
||||
[TO DO: what is it actually doing? When does one use scalars vs vectors? Is it application specific or are there ]
|
||||
Scalar vs Vectorized Memory Access
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
..
|
||||
Original mermaid diagram (edit here, then run update_diagrams.py)
|
||||
|
||||
..
|
||||
Original mermaid diagram (edit here, then run update_diagrams.py)
|
||||
|
||||
@@ -287,8 +216,9 @@ The signature for the ``buffer_view`` ``get()`` takes four parameters:
|
||||
Understanding BufferView Indexing
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
[TO DO: an explanation of the diagram is needed]
|
||||
|
||||
..
|
||||
Original mermaid diagram (edit here, then run update_diagrams.py)
|
||||
|
||||
..
|
||||
Original mermaid diagram (edit here, then run update_diagrams.py)
|
||||
|
||||
@@ -335,14 +265,69 @@ Understanding BufferView Indexing
|
||||
.. image:: diagrams/buffer_views_3.svg
|
||||
:alt: Diagram
|
||||
:align: center
|
||||
|
||||
|
||||
|
||||
C++ Get Operations
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
__device__ void example_get_operations()
|
||||
{
|
||||
// Create buffer view
|
||||
float data[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
|
||||
auto buffer_view = make_buffer_view<address_space_enum::global>(data, 8);
|
||||
|
||||
// Simple get - compile-time bounds checking when possible
|
||||
auto value_buf = buffer_view.template get<float>(0,1,true); //get the buffer from the buffer view
|
||||
float value = value_buf.get(0); //get the value from the buffer
|
||||
|
||||
// Get with valid flag - branchless conditional access
|
||||
bool valid_flag = false;
|
||||
value_buf = buffer_view.template get<float>(0,1,valid_flag);
|
||||
value = value_buf.get(0);
|
||||
// Returns 0 valid_flag is false
|
||||
|
||||
// vectorized get
|
||||
using float2 = ext_vector_t<float, 2>;
|
||||
auto vector_buf = buffer_view.template get<float2>(0, 0, true);
|
||||
// Loads 2 floats in a single instruction
|
||||
float val1 = vector_buf.get(0);
|
||||
float val2 = vector_buf.get(1);
|
||||
}
|
||||
|
||||
Custom Value Return Mode for OOB & Invalid Access
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
void scalar_get_operations_example() {
|
||||
|
||||
// Create data array
|
||||
constexpr size_t buffer_size = 8;
|
||||
float data[buffer_size] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
|
||||
float custom_invalid = 13.0f;
|
||||
|
||||
// Create global memory buffer view with zero invalid value mode (default)
|
||||
auto buffer_view = make_buffer_view<address_space_enum::global>(data, buffer_size, custom_invalid);
|
||||
|
||||
// Invalid element access with is_valid_element=false
|
||||
// Returns custom_invalid due to custom invalid value mode
|
||||
auto invalid_value = buffer_view.template get<float>(0, 0, false);
|
||||
printf("Invalid element: %.1f\n", invalid_value.get(0));
|
||||
|
||||
// Out of bounds access - AMD buffer addressing handles bounds checking
|
||||
// Will return custom_invalid when accessing beyond buffer_size
|
||||
auto oob_value = buffer_view.template get<float>(0, 100, true);
|
||||
printf("Out of bounds: %.1f\n", oob_value.get(0));
|
||||
}
|
||||
|
||||
.. note::
|
||||
|
||||
Partial Out Of Bound (OOB) access during vector reads will return 'junk' values for the OOB access. Zero or custom invalid value is only returned for complete invalid/OOB access, in other words, it is only returned when the first address of the vector is invalid.
|
||||
|
||||
Update Operations
|
||||
-----------------
|
||||
|
||||
Update operations modify the buffer content. The ``set()`` method writes a value to a specific location.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
void scalar_set_operations_example() {
|
||||
@@ -373,8 +358,6 @@ Update operations modify the buffer content. The ``set()`` method writes a value
|
||||
Atomic Operations
|
||||
-----------------
|
||||
|
||||
[TO DO: this needs information]
|
||||
|
||||
Atomic vs Non-Atomic Operations
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
@@ -441,3 +424,21 @@ C++ Atomic Operations
|
||||
|
||||
__syncthreads();
|
||||
}
|
||||
|
||||
Summary
|
||||
-------
|
||||
|
||||
BufferView abstracts GPU memory hierarchies behind a concise interface. The approach is intended to keep overhead small while enabling optimizations that are otherwise awkward in low-level code.
|
||||
|
||||
BufferView offers a unified interface across global, shared, and register memory. Using the same API for each space can lower cognitive overhead, reduce certain classes of mistakes, and support code reuse via template parameters.
|
||||
|
||||
Address spaces are encoded in types so that common errors are reported at compile time. Consistent with CK Tile’s zero-overhead design aim, compile-time checks are favored over runtime guards. The C++ type system enforces memory-space constraints and can make valid cases more amenable to compiler optimization.
|
||||
|
||||
BufferView supports configurable handling of invalid values, optional runtime bounds checks, and conditional access patterns. It also provides atomic operations for thread-safe updates. These features are intended to cover common edge cases without adding unnecessary overhead.
|
||||
|
||||
By hiding the complexity of different memory spaces while exposing the operations needed for high-performance GPU computing, BufferView establishes a pattern that the rest of CK Tile follows: compile-time abstractions that enhance rather than compromise performance. The :ref:`ck_tile_tensor_views` and :ref:`ck_tile_distribution` add capability while maintaining the efficiency established at the base. For hardware-specific details about memory hierarchies, see :ref:`ck_tile_gpu_basics`.
|
||||
|
||||
Next Steps
|
||||
----------
|
||||
|
||||
Continue to :ref:`ck_tile_tensor_views` to learn how to build structured tensor views on top of buffer views.
|
||||
|
||||
Reference in New Issue
Block a user