Merge commit '0cc83cb8e8c9d9d926469f862bc1272ef0cf0dc8' into develop

This commit is contained in:
assistant-librarian[bot]
2026-01-27 16:17:03 +00:00
parent fea562ee53
commit 21657a1f32
25 changed files with 130 additions and 3160 deletions

View File

@@ -1,14 +1,13 @@
.. _ck_tile_index:
************************
CK Tile Index
************************
CK Tile documentation structure:
****************************************************
CK Tile conceptual documentation table of contents
****************************************************
.. toctree::
:maxdepth: 2
index
introduction_motivation
buffer_views
tensor_views

View File

@@ -1,156 +0,0 @@
# Mermaid Diagram Management
This document explains how to manage mermaid diagrams in the CK Tile documentation.
## Overview
All mermaid diagrams in the CK Tile documentation have been converted to SVG files for better rendering compatibility. The original mermaid source code is preserved as commented blocks in the RST files, allowing easy updates when needed.
## Directory Structure
- `docs/conceptual/ck_tile/diagrams/` - Contains all SVG diagram files
- `docs/conceptual/ck_tile/convert_mermaid_to_svg.py` - Initial conversion script (one-time use)
- `docs/conceptual/ck_tile/update_diagrams.py` - Helper script to regenerate diagrams from comments
## Diagram Format in RST Files
Each diagram follows this format:
```rst
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
graph TB
A --> B
B --> C
.. image:: diagrams/diagram_name.svg
:alt: Diagram
:align: center
```
The commented mermaid block won't appear in the rendered documentation but serves as the source for regenerating the SVG.
## Updating Diagrams
### When to Update
You need to regenerate SVG files when:
- Modifying the mermaid source in a commented block
- Adding new diagrams
- Updating diagram styling
### How to Update
1. **Edit the commented mermaid source** in the RST file
2. **Run the update script**:
```bash
# Update all diagrams
python docs/conceptual/ck_tile/update_diagrams.py
# Update diagrams in a specific file
python docs/conceptual/ck_tile/update_diagrams.py transforms.rst
# Force regenerate all diagrams (even if SVGs exist)
python docs/conceptual/ck_tile/update_diagrams.py --force
```
### Prerequisites
The update script requires [mermaid-cli](https://github.com/mermaid-js/mermaid-cli):
```bash
npm install -g @mermaid-js/mermaid-cli
```
## Adding New Diagrams
To add a new mermaid diagram:
1. **Create the commented block** in your RST file:
```rst
..
Original mermaid diagram (edit here, then run update_diagrams.py)
.. mermaid::
graph TB
A --> B
```
2. **Add the image reference** immediately after:
```rst
.. image:: diagrams/my_new_diagram.svg
:alt: My New Diagram
:align: center
```
3. **Generate the SVG**:
```bash
python docs/conceptual/ck_tile/update_diagrams.py your_file.rst
```
## Current Diagrams
The following RST files contain mermaid diagrams (40 total):
- `adaptors.rst` (2 diagrams)
- `convolution_example.rst` (1 diagram)
- `coordinate_movement.rst` (1 diagram)
- `descriptors.rst` (2 diagrams)
- `encoding_internals.rst` (2 diagrams)
- `lds_index_swapping.rst` (3 diagrams)
- `load_store_traits.rst` (2 diagrams)
- `space_filling_curve.rst` (1 diagram)
- `static_distributed_tensor.rst` (1 diagram)
- `sweep_tile.rst` (4 diagrams)
- `tensor_coordinates.rst` (2 diagrams)
- `thread_mapping.rst` (2 diagrams)
- `tile_window.rst` (5 diagrams)
- `transforms.rst` (12 diagrams)
## Troubleshooting
### SVG not generated
- Check that mermaid-cli is installed: `mmdc --version`
- Verify the mermaid syntax is valid
- Look for error messages in the script output
### Diagram not updating
- Use `--force` flag to regenerate: `python docs/update_diagrams.py --force`
- Check that the image reference matches the generated filename
### Pattern not matching
If the update script can't find your commented diagram:
- Ensure proper indentation (3 spaces for comment block content)
- Verify the `.. mermaid::` directive is commented
- Check that the image reference immediately follows the comment block
## Script Details
### update_diagrams.py
This script:
1. Scans RST files for commented mermaid blocks
2. Extracts the mermaid source code
3. Converts to SVG using `mmdc`
4. Saves to the diagrams directory
**Usage:**
- `python docs/conceptual/ck_tile/update_diagrams.py` - Check all files, update missing SVGs
- `python docs/conceptual/ck_tile/update_diagrams.py --force` - Regenerate all SVGs
- `python docs/conceptual/ck_tile/update_diagrams.py <file.rst>` - Update specific file
### convert_mermaid_to_svg.py
This was the initial conversion script. It:
1. Found all active `.. mermaid::` directives
2. Converted them to SVGs
3. Replaced directives with commented source + image references
This script was used once for the initial conversion and typically doesn't need to be run again.

View File

@@ -59,8 +59,8 @@ A TensorAdaptor encapsulates a sequence of :ref:`coordinate transformations <ck_
.. image:: diagrams/adaptors_1.svg
:alt: Diagram
:align: center
Core Components
Core Components
~~~~~~~~~~~~~~~
Each TensorAdaptor contains:
@@ -115,7 +115,7 @@ Custom adaptors can be created by specifying which transforms to use and how the
make_tuple(sequence<0>{}) // to single dim 0
);
// The adaptor is embedded in the :ref:`descriptor <ck_tile_descriptors>`
// The adaptor is embedded in the descriptor
// To use it:
multi_index<1> top_coord{5}; // 1D coordinate
// This internally calculates: row = 5/3 = 1, col = 5%3 = 2
@@ -309,7 +309,6 @@ A practical example showing how adaptors create efficient :ref:`GPU memory acces
// - Dimension 0,1: Thread indices
// - Dimension 2,3: Vector indices within thread
// Enables coalesced memory access on GPU
// See :ref:`ck_tile_thread_mapping` for thread mapping details
Common Transform Chains
-----------------------

View File

@@ -1,6 +1,25 @@
.. _ck_tile_buffer_views:
**********************************
Buffer Views - Raw Memory Access
**********************************
Overview
--------
At the foundation of the CK Tile system lies BufferView, a compile-time abstraction that provides structured access to raw memory regions within GPU kernels. This serves as the bridge between the hardware's physical memory model and the higher-level abstractions that enable efficient GPU programming. BufferView encapsulates the complexity of GPU memory hierarchies while exposing a unified interface that works seamlessly across different memory address spaces including global memory shared across the entire device, local data share (LDS) memory shared within a workgroup, or the ultra-fast register files private to each thread.
BufferView serves as the foundation for :ref:`ck_tile_tensor_views`, which add multi-dimensional structure on top of raw memory access. Understanding BufferView is essential before moving on to more complex abstractions like :ref:`ck_tile_distribution` and :ref:`ck_tile_tile_window`.
By providing compile-time knowledge of buffer properties through template metaprogramming, BufferView enables the compiler to generate optimal machine code for each specific use case. This zero-overhead abstraction ensures that the convenience of a high-level interface comes with no runtime performance penalty.
One of BufferView's most important features is its advanced handling of out-of-bounds memory access. Unlike CPU programming where such accesses typically result in segmentation faults or undefined behavior, GPU programming must gracefully handle cases where threads attempt to access memory beyond allocated boundaries. BufferView provides configurable strategies for these scenarios, where developers can choose between returning either numerical zero values or custom sentinel values for invalid accesses. This flexibility is important for algorithms that naturally extend beyond data boundaries, such as convolutions with padding or matrix operations with non-aligned dimensions.
The abstraction extends beyond simple memory access to encompass both scalar and vector data types. GPUs achieve their highest efficiency when loading or storing multiple data elements in a single instruction. BufferView seamlessly supports these vectorized operations, automatically selecting the appropriate hardware instructions based on the data type and access pattern. This capability transforms what would be multiple memory transactions into single, efficient operations that fully utilize the available memory bandwidth.
BufferView also incorporates AMD GPU-specific optimizations that leverage unique hardware features. The AMD buffer addressing mode, for instance, provides hardware-accelerated bounds checking that ensures memory safety without the performance overhead of software-based checks. Similarly, BufferView exposes atomic operations that are crucial for parallel algorithms requiring thread-safe updates to shared data structures. These hardware-specific optimizations are abstracted behind a portable interface, ensuring that code remains maintainable while achieving optimal performance.
Memory coherence and caching policies represent another layer of complexity that BufferView manages transparently. Different GPU memory spaces have different coherence guarantees and caching behaviors. Global memory accesses can be cached in L1 and L2 caches with various coherence protocols, while LDS memory provides workgroup-level coherence with specialized banking structures (see :ref:`ck_tile_lds_bank_conflicts` for details on avoiding bank conflicts). BufferView encapsulates these details, automatically applying the appropriate memory ordering constraints and cache control directives based on the target address space and operation type.
Address Space Usage Patterns
----------------------------
@@ -51,6 +70,7 @@ Address Space Usage Patterns
.. image:: diagrams/buffer_views_1.svg
:alt: Diagram
:align: center
C++ Implementation
------------------

View File

@@ -59,10 +59,6 @@ The key insight is that convolution can be transformed from a complex nested loo
.. image:: diagrams/convolution_example.svg
:alt: Diagram
:align: center
.. image:: diagrams/convolution_example.svg
:alt: Diagram
:align: center
@@ -88,7 +84,6 @@ Non-overlapping tiles:
// Original matrix: shape=(6, 6), strides=(6, 1)
// Tiled view: shape=(3, 3, 2, 2), strides=(12, 2, 6, 1)
// See :ref:`ck_tile_descriptors` for descriptor details
using TileDescriptor = TensorDescriptor<
Sequence<kNumTiles, kNumTiles, kTileSize, kTileSize>,
Sequence<12, 2, 6, 1>
@@ -243,7 +238,6 @@ The im2col transformation converts the 4D windows tensor into a 2D matrix suitab
>;
// Step 2: Apply merge transforms to create 2D im2col layout
// See :ref:`ck_tile_transforms` for transform operations
using Im2colDescriptor = decltype(
transform_tensor_descriptor(
WindowsDescriptor{},
@@ -312,7 +306,6 @@ Combining all components into an optimized convolution implementation:
>;
// Tile distribution for matrix multiplication
// See :ref:`ck_tile_tile_distribution` for details
using ATileDist = TileDistribution<
Sequence<TileM, TileK>,
Sequence<BlockM, 1>
@@ -327,7 +320,6 @@ Combining all components into an optimized convolution implementation:
>;
// Thread-local accumulator
// See :ref:`ck_tile_static_distributed_tensor`
StaticDistributedTensor<DataType, CTileDist> c_accumulator;
// Initialize accumulator
@@ -339,7 +331,6 @@ Combining all components into an optimized convolution implementation:
// Main GEMM loop over K dimension
for (index_t k_tile = 0; k_tile < PatchSize; k_tile += TileK) {
// Create tile windows for im2col matrix and kernel
// See :ref:`ck_tile_tile_window` for window operations
auto a_window = make_tile_window<ATileDist>(
input, Im2colDesc{H, W, K},
{blockIdx.y * TileM, k_tile}
@@ -350,7 +341,7 @@ Combining all components into an optimized convolution implementation:
{k_tile, 0}
);
// Load tiles - see :ref:`ck_tile_load_store_traits` for optimization
// Load tiles
auto a_tile = a_window.load();
auto b_tile = b_window.load();
@@ -476,7 +467,6 @@ CK Tile enables several optimizations for convolution:
__shared__ float smem_b[TileK][TileN];
// Collaborative loading with proper bank conflict avoidance
// See :ref:`ck_tile_lds_bank_conflicts` for optimization
auto load_tile_to_smem = [&](auto& window, float smem[][TileK]) {
#pragma unroll
for (index_t i = threadIdx.y; i < TileM; i += blockDim.y) {
@@ -560,7 +550,7 @@ This example demonstrates how CK Tile transforms convolution from a memory-bound
- **Sliding windows** can be efficiently represented using tensor descriptors with appropriate strides
- **Im2col transformation** converts convolution to matrix multiplication without data copies
- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_tile_distribution`)
- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_distribution`)
- **Multi-channel support** extends naturally through higher-dimensional descriptors
- **Performance optimizations** like vectorization and shared memory are seamlessly integrated (see :ref:`ck_tile_gemm_optimization` for similar techniques)

View File

@@ -317,7 +317,7 @@ Movement Through Adaptors
Advanced Movement Patterns
==========================
Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_tile_distribution` concepts:
Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_distribution` concepts:
Tiled Access Pattern
--------------------

View File

@@ -315,18 +315,18 @@ Padding for Convolution
.. code-block:: cpp
// Add padding to spatial dimensions
auto padded = transform_tensor_descriptor(
input_tensor,
make_tuple(
make_pass_through_transform(N), // Batch
make_pass_through_transform(C), // Channel
make_pad_transform(H, pad_h, pad_h), // Height
make_pad_transform(W, pad_w, pad_w) // Width
),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
);
// Add padding to spatial dimensions
auto padded = transform_tensor_descriptor(
input_tensor,
make_tuple(
make_pass_through_transform(N), // Batch
make_pass_through_transform(C), // Channel
make_pad_transform(H, pad_h, pad_h), // Height
make_pad_transform(W, pad_w, pad_w) // Width
),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
);
For a complete convolution example, see :ref:`ck_tile_convolution_example`.

View File

@@ -260,7 +260,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
index_t K)
{
// Define tile distribution encoding
// See :ref:`ck_tile_encoding_internals` and :ref:`ck_tile_tile_distribution`
using Encoding = tile_distribution_encoding<
sequence<>, // No replication
tuple<sequence<4, 2, 8, 4>, // M dimension hierarchy
@@ -274,7 +273,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
constexpr auto tile_dist = make_static_tile_distribution(Encoding{});
// Create tensor views for global memory
// See :ref:`ck_tile_tensor_views` and :ref:`ck_tile_buffer_views`
auto a_global_view = make_naive_tensor_view<address_space_enum::global>(
a_global, make_tuple(M, K), make_tuple(K, 1));
auto b_global_view = make_naive_tensor_view<address_space_enum::global>(
@@ -287,7 +285,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
const index_t block_n_id = blockIdx.x;
// Create tile windows for loading
// See :ref:`ck_tile_tile_window` for tile window details
auto a_window = make_tile_window(
a_global_view,
make_tuple(number<MPerBlock>{}, number<KPerBlock>{}),
@@ -301,7 +298,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
tile_dist);
// Allocate LDS storage
// See :ref:`ck_tile_static_distributed_tensor` for distributed tensors
auto a_lds = make_static_distributed_tensor<ADataType,
decltype(tile_dist)>();
auto b_lds = make_static_distributed_tensor<BDataType,
@@ -310,7 +306,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
// Initialize accumulator
auto c_reg = make_static_distributed_tensor<CDataType,
decltype(tile_dist)>();
// See :ref:`ck_tile_sweep_tile` for sweep operations
sweep_tile(c_reg, [](auto idx, auto& val) { val = 0; });
// Main GEMM loop with pipelining
@@ -324,7 +319,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
// Pipeline loop
for(index_t k_tile = 0; k_tile < num_k_tiles - 1; ++k_tile) {
// Move windows for next iteration
// See :ref:`ck_tile_coordinate_movement` for window movement
a_window.move_slice_window(make_tuple(0, KPerBlock));
b_window.move_slice_window(make_tuple(0, KPerBlock));

View File

@@ -172,7 +172,6 @@ Example usage in CK Tile:
a_window.load(a_lds_tensor);
// Subsequent reads from LDS are conflict-free
// See :ref:`ck_tile_sweep_tile` for sweep operations
sweep_tile(a_lds_tensor, [](auto idx, auto& val) {
// Process data...
});

View File

@@ -276,7 +276,7 @@ The foundation of the exploration begins with raw memory access through :ref:`ck
With these foundational concepts established, the documentation delves into the :ref:`ck_tile_coordinate_systems` that powers tile distribution. This engine implements the mathematical framework that have been introduced, providing compile-time transformations between P-space, Y-space, X-space, and D-space. Understanding these transformations at a deep level enables developers to reason about performance implications and design custom distribution strategies for novel algorithms. The :ref:`ck_tile_transforms` and :ref:`ck_tile_adaptors` provide the building blocks for these transformations.
The high-level :ref:`ck_tile_distribution` APIs represent the culmination of these lower-level abstractions. These APIs provide an accessible interface for common patterns while exposing enough flexibility for advanced optimizations. Through concrete examples and detailed explanations, the documentation will demonstrate how to leverage these APIs to achieve near-optimal performance across a variety of computational patterns. The :ref:`ck_tile_window` abstraction provides the gateway for efficient data access.
The high-level :ref:`ck_tile_distribution` APIs represent the culmination of these lower-level abstractions. These APIs provide an accessible interface for common patterns while exposing enough flexibility for advanced optimizations. Through concrete examples and detailed explanations, the documentation will demonstrate how to leverage these APIs to achieve near-optimal performance across a variety of computational patterns. The :ref:`ck_tile_tile_window` abstraction provides the gateway for efficient data access.
The exploration of coordinate systems goes beyond the basic P, Y, X, D framework to encompass advanced topics such as multi-level tiling, replication strategies, and specialized coordinate systems for specific algorithm classes. The :ref:`ck_tile_encoding_internals` reveals the mathematical foundations, while :ref:`ck_tile_thread_mapping` shows how these abstractions map to hardware. This comprehensive treatment ensures that developers can handle not just common cases but also novel algorithms that require custom distribution strategies.

View File

@@ -5,7 +5,7 @@
.. _ck_tile_lds_index_swapping:
********************************
Load Datat Share Index Swapping
Load Data Share Index Swapping
********************************
Overview
@@ -70,9 +70,9 @@ The original K coordinate is split into K0 and K1, where K1 represents the threa
The XOR transformation updates the K0 coordinate using the formula:
.. code-block:: cpp
.. math::
K0' = K0 ^ (M % (KPerBlock / KPack * MLdsLayer))
K0' = K0^{(M \% (KPerBlock / KPack * MLdsLayer))}
This XOR operation redistributes accesses across memory banks by mixing bits from the M and K dimensions.
@@ -132,10 +132,10 @@ The transformed K0' is split into L and K0'' components, creating an intermediat
The unmerge operation:
.. code-block:: cpp
.. math::
L = K0' / (KPerBlock/KPack)
K0'' = K0' % (KPerBlock/KPack)
K0'' = K0' \% (KPerBlock/KPack)
When MLdsLayer == 1, this simplifies to L=0 and K0''=K0'.

View File

@@ -71,7 +71,6 @@ The LoadStoreTraits class analyzes distribution patterns at compile time:
static constexpr index_t scalars_per_access = scalar_per_vector;
// Space-filling curve for optimal traversal
// See :ref:`ck_tile_space_filling_curve` for details
using sfc_type = space_filling_curve<ndim_y>;
static constexpr sfc_type sfc_ys = make_space_filling_curve<Distribution>();
@@ -274,7 +273,7 @@ LoadStoreTraits optimizes for several performance metrics:
return Traits::num_access;
}
// Check coalescing efficiency (see :ref:`ck_tile_gpu_basics`)
// Check coalescing efficiency
static constexpr bool is_perfectly_coalesced()
{
// Perfect coalescing when adjacent threads access adjacent memory
@@ -316,7 +315,6 @@ Comparing Different Configurations
static_assert(OptimizedAnalyzer::bandwidth_utilization() == 50.0f); // 8*4/64
// Better bandwidth utilization leads to improved performance
// See :ref:`ck_tile_gemm_optimization` for real-world examples
Integration with Space-Filling Curves
-------------------------------------

View File

@@ -254,7 +254,6 @@ For :ref:`matrix multiplication <ck_tile_gemm_optimization>`, optimal access pat
// GEMM tile: 16x32 with vector-8 loads
// Column-major for coalesced access in GEMM
// See :ref:`ck_tile_gemm_optimization` for complete example
using GemmTileCurve = space_filling_curve<
2,
sequence<16, 32>, // Tile size
@@ -336,7 +335,7 @@ Optimizing for Hardware
.. code-block:: cpp
// Optimize for GPU memory coalescing (see :ref:`ck_tile_gpu_basics`)
// Optimize for GPU memory coalescing
template <typename DataType, index_t WarpSize = 32>
struct coalesced_access_pattern
{
@@ -411,7 +410,6 @@ LoadStoreTraits Integration
struct load_store_traits
{
// Create optimized space-filling curve
// See :ref:`ck_tile_tile_distribution` for Distribution details
using sfc_type = space_filling_curve<
Distribution::ndim_y,
typename Distribution::y_lengths,
@@ -461,7 +459,6 @@ Best Practices
.. code-block:: cpp
// Match vector size to cache line for optimal bandwidth
// See :ref:`ck_tile_lds_bank_conflicts` for cache optimization
constexpr index_t optimal_vector = min(
tensor_length_fast_dim,
cache_line_size / sizeof(DataType)

View File

@@ -17,9 +17,9 @@ Each thread in a workgroup owns a portion of the overall tensor data, stored in
This design enables three critical optimizations:
* It maximizes register utilization by keeping frequently accessed data in the fastest memory hierarchy.
* It eliminates redundant memory accesses since each thread maintains its own working set.
* It provides a clean abstraction for complex algorithms like matrix multiplication where each thread accumulates partial results that eventually combine into the final output.
* It maximizes register utilization by keeping frequently accessed data in the fastest memory hierarchy.
* It eliminates redundant memory accesses since each thread maintains its own working set.
* It provides a clean abstraction for complex algorithms like matrix multiplication where each thread accumulates partial results that eventually combine into the final output.
Thread-Local Storage Model
==========================
@@ -384,8 +384,7 @@ Static distributed tensors integrate seamlessly with other CK Tile components:
// Main GEMM loop
for(index_t k_tile = 0; k_tile < K; k_tile += kTileK) {
// Create tile windows for this iteration
// See :ref:`ck_tile_tile_window` for details
auto a_window = make_tile_window(
auto a_window = make_tile_window(
a_ptr, ALayout{M, K},
ATileDist{},
{blockIdx.y * kTileM, k_tile}
@@ -398,7 +397,6 @@ Static distributed tensors integrate seamlessly with other CK Tile components:
);
// Load tiles to distributed tensors
// See :ref:`ck_tile_load_store_traits` for optimized loading
auto a_tile = a_window.load();
auto b_tile = b_window.load();

View File

@@ -356,7 +356,6 @@ CK uses several techniques to optimize memory access:
float>>>;
// 2. Swizzling to avoid bank conflicts
// See :ref:`ck_tile_lds_index_swapping` and :ref:`ck_tile_swizzling_example`
template <index_t BankSize = 32>
__device__ index_t swizzle_offset(index_t tid, index_t offset)
{
@@ -434,7 +433,6 @@ The following example shows how thread mapping works in a CK kernel:
__shared__ ComputeType shared_sum[BlockSize];
// 5. Create tensor view and tile window
// See :ref:`ck_tile_tensor_views` and :ref:`ck_tile_tile_window`
auto x_view = make_naive_tensor_view<address_space_enum::global>(
x + bid * hidden_size,
make_tuple(hidden_size),

View File

@@ -1,4 +1,4 @@
.. _ck_tile_distribution:
.. _ck_tile_tile_distribution:
Tile Distribution - The Core API
================================

View File

@@ -283,7 +283,7 @@ Creating and Using TileWindow
using namespace ck_tile;
// Create a tensor view for input data (see :ref:`ck_tile_tensor_views`)
// Create a tensor view for input data
auto tensor_view = make_naive_tensor_view(
data_ptr,
make_tuple(256, 256), // Shape
@@ -314,7 +314,7 @@ Creating and Using TileWindow
distribution
);
// Load data into distributed tensor (see :ref:`ck_tile_static_distributed_tensor`)
// Load data into distributed tensor
auto distributed_data = make_static_distributed_tensor<float>(distribution);
window.load(distributed_data);
@@ -558,7 +558,6 @@ Complete Load-Compute-Store Pipeline
c_dist);
// Create distributed tensors for register storage
// See :ref:`ck_tile_static_distributed_tensor` for details
auto a_reg = make_static_distributed_tensor<AType>(a_dist);
auto b_reg = make_static_distributed_tensor<BType>(b_dist);
auto c_reg = make_static_distributed_tensor<CType>(c_dist);
@@ -620,6 +619,8 @@ Performance Characteristics
.. image:: diagrams/tile_window_5.svg
:alt: Diagram
:align: center
Best Practices
--------------

View File

@@ -302,7 +302,7 @@ EmbedTransform expands linear indices from the lower coordinate space into multi
using namespace ck_tile;
// Create embed transform for 2x3 tensor with strides [12, 1]
// This is commonly used in :ref:`descriptors <ck_tile_descriptors>`
// This is commonly used in descriptors
auto transform = make_embed_transform(make_tuple(2, 3), make_tuple(12, 1));
// Forward: Linear → 2D (Manual calculation)

View File

@@ -30,8 +30,6 @@ release = version_number
external_toc_path = "./sphinx/_toc.yml"
docs_core = ROCmDocs(left_nav_title)
docs_core.run_doxygen(doxygen_root="doxygen", doxygen_path="doxygen/xml")
docs_core.enable_api_reference()
docs_core.setup()
external_projects_current_project = "composable_kernel"
@@ -50,4 +48,4 @@ for sphinx_var in ROCmDocs.SPHINX_VARS:
extensions += ['sphinxcontrib.bibtex']
bibtex_bibfiles = ['refs.bib']
cpp_id_attributes = ["__global__", "__device__", "__host__"]
cpp_id_attributes = ["__global__", "__device__", "__host__"]

File diff suppressed because it is too large Load Diff

View File

@@ -25,7 +25,7 @@ The Composable Kernel repository is located at `https://github.com/ROCm/composab
* :doc:`Composable Kernel structure <./conceptual/Composable-Kernel-structure>`
* :doc:`Composable Kernel mathematical basis <./conceptual/Composable-Kernel-math>`
* :doc:`CK Tile conceptual documentation <./conceptual/ck_tile/index>`
* :doc:`CK Tile conceptual documentation <./conceptual/ck_tile/CK-tile-index>`
.. grid-item-card:: Tutorials
@@ -37,9 +37,6 @@ The Composable Kernel repository is located at `https://github.com/ROCm/composab
* :doc:`Composable Kernel custom types <./reference/Composable_Kernel_custom_types>`
* :doc:`Composable Kernel vector utilities <./reference/Composable_Kernel_vector_utilities>`
* :ref:`wrapper`
* :doc:`Composable Kernel API reference <./doxygen/html/namespace_c_k>`
* :doc:`CK Tile API reference <./doxygen/html/namespaceck__tile>`
* :doc:`Composable Kernel complete API class list <./doxygen/html/annotated>`
* :doc:`Composable Kernel glossary <./reference/Composable-Kernel-Glossary>`
To contribute to the documentation refer to `Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.

View File

@@ -4,7 +4,6 @@
***************************************************
Composable Kernel glossary
***************************************************
.. glossary::
@@ -14,7 +13,7 @@ Composable Kernel glossary
The arithmetic logic unit (ALU) is the GPU component responsible for arithmetic and logic operations.
compute unit
The compute unit (CU) is the parallel vector processor in an AMD GPU with multiple :term:`ALUs<arithmetic logic unit>`. Each compute unit will run all the :term:`wavefronts<wavefront>` in a :term:`work group>`. A compute unit is equivalent to NVIDIA's streaming multiprocessor.
The compute unit (CU) is the parallel vector processor in an AMD GPU with multiple :term:`ALUs<arithmetic logic unit>`. Each compute unit will run all the :term:`wavefronts<wavefront>` in a :term:`work group`. A compute unit is equivalent to NVIDIA's streaming multiprocessor.
matrix core
A matrix core is a specialized GPU unit that accelerate matrix operations for AI and deep learning tasks. A GPU contains multiple matrix cores.
@@ -32,7 +31,7 @@ Composable Kernel glossary
See :term:`scalar general purpose register`.
scalar general purpose register
A scalar general purpose register (SGPR) is a :term:`register` shared by all the :term:`work items<work item>` in a :term:`wave<wavefront>`. SGPRs are used for constants, addresses, and control flow common across the entire wave.
A scalar general purpose register (SGPR) is a :term:`register` shared by all the :term:`work-items<work-item>` in a :term:`wave<wavefront>`. SGPRs are used for constants, addresses, and control flow common across the entire wave.
LDS
See :term:`local data share`.
@@ -101,7 +100,7 @@ Composable Kernel glossary
A Composable Kernel pipeline schedules the sequence of operations for a :term:`kernel`, such as the data loading, computation, and storage phases. A pipeline consists of a :term:`problem` and a :term:`policy`.
tile partitioner
The tile partitioner defines the mapping between the :term:`problem` dimensions and GPU hierarchy. It specifies :term:`workgroup`-level :term:`tile` sizes and determines :term:`grid` dimensions by dividing the problem size by the tile sizes.
The tile partitioner defines the mapping between the :term:`problem` dimensions and GPU hierarchy. It specifies :term:`work group`-level :term:`tile` sizes and determines :term:`grid` dimensions by dividing the problem size by the tile sizes.
problem
The problem is the part of the :term:`pipeline` that defines input and output shapes, data types, and mathematical :term:`operations<operation>`.
@@ -186,10 +185,10 @@ Composable Kernel glossary
Viewport into a larger tensor that defines the current tile's position and boundaries for computation.
load tile
Load tile is an operation that transfers data from :term:`global memory` or the :term:`load data share` to :term:`vector general purpose registers<vector general purpose register>`.
Load tile is an operation that transfers data from :term:`global memory` or the :term:`local data share` to :term:`vector general purpose registers<vector general purpose register>`.
store tile
Store tile is an operation that transfers data from :term:`vector general purpose registers<vector general purpose register>` to :term:`global memory` or the :term:`load data share`.
Store tile is an operation that transfers data from :term:`vector general purpose registers<vector general purpose register>` to :term:`global memory` or the :term:`local data share`.
descriptor
Metadata structure that defines :term:`tile` properties, memory layouts, and coordinate transformations for Composable Kernel :term:`operations<operation>`.

View File

@@ -54,36 +54,3 @@ Advanced examples:
* `Image to column <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_img2col.cpp>`_
* `Basic gemm <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_basic_gemm.cpp>`_
* `Optimized gemm <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_optimized_gemm.cpp>`_
-------------------------------------
Layout
-------------------------------------
.. doxygenstruct:: Layout
-------------------------------------
Layout helpers
-------------------------------------
.. doxygenfile:: include/ck/wrapper/utils/layout_utils.hpp
-------------------------------------
Tensor
-------------------------------------
.. doxygenstruct:: Tensor
-------------------------------------
Tensor helpers
-------------------------------------
.. doxygenfile:: include/ck/wrapper/utils/tensor_utils.hpp
.. doxygenfile:: include/ck/wrapper/utils/tensor_partition.hpp
-------------------------------------
Operations
-------------------------------------
.. doxygenfile:: include/ck/wrapper/operations/copy.hpp
.. doxygenfile:: include/ck/wrapper/operations/gemm.hpp

View File

@@ -6,42 +6,36 @@ subtrees:
- caption: Install
entries:
- file: install/Composable-Kernel-prerequisites.rst
title: Composable Kernel prerequisites
title: Prerequisites
- file: install/Composable-Kernel-install.rst
title: Build and install Composable Kernel
- file: install/Composable-Kernel-Docker.rst
title: Composable Kernel Docker images
title: Docker images
- caption: Conceptual
entries:
- file: conceptual/Composable-Kernel-structure.rst
title: Composable Kernel structure
title: Structure
- file: conceptual/Composable-Kernel-math.rst
title: Composable Kernel mathematical basis
- file: conceptual/ck_tile/index.rst
title: Mathematical basis
- file: conceptual/ck_tile/CK-tile-index.rst
title: CK Tile conceptual documentation
- caption: Tutorial
entries:
- file: tutorial/Composable-Kernel-examples.rst
title: Composable Kernel examples
title: Examples
- caption: Reference
entries:
- file: reference/Composable_Kernel_supported_scalar_types.rst
title: Composable Kernel scalar types
title: Scalar types
- file: reference/Composable_Kernel_custom_types.rst
title: Composable Kernel custom types
title: Custom types
- file: reference/Composable_Kernel_vector_utilities.rst
title: Composable Kernel vector utilities
title: Vector utilities
- file: reference/Composable-Kernel-wrapper.rst
title: Composable Kernel wrapper
- file: doxygen/html/namespace_c_k.rst
title: CK API reference
- file: doxygen/html/namespaceck__tile.rst
title: CK Tile API reference
- file: doxygen/html/annotated.rst
title: Full API class list
title: Wrapper
- file: reference/Composable-Kernel-Glossary.rst
title: Glossary

View File

@@ -8,9 +8,9 @@ accessible-pygments==0.0.5
# via pydata-sphinx-theme
alabaster==1.0.0
# via sphinx
asttokens==3.0.0
asttokens==3.0.1
# via stack-data
attrs==25.3.0
attrs==25.4.0
# via
# jsonschema
# jupyter-cache
@@ -19,40 +19,30 @@ babel==2.17.0
# via
# pydata-sphinx-theme
# sphinx
beautifulsoup4==4.13.4
beautifulsoup4==4.14.3
# via pydata-sphinx-theme
breathe==4.36.0
# via rocm-docs-core
certifi==2025.1.31
certifi==2026.1.4
# via requests
cffi==1.17.1
cffi==2.0.0
# via
# cryptography
# pynacl
charset-normalizer==3.4.1
charset-normalizer==3.4.4
# via requests
click==8.1.8
click==8.3.1
# via
# click-log
# doxysphinx
# jupyter-cache
# sphinx-external-toc
click-log==0.4.0
# via doxysphinx
comm==0.2.2
comm==0.2.3
# via ipykernel
contourpy==1.3.2
# via matplotlib
cryptography==44.0.2
cryptography==46.0.3
# via pyjwt
cycler==0.12.1
# via matplotlib
debugpy==1.8.14
debugpy==1.8.19
# via ipykernel
decorator==5.2.1
# via ipython
deprecated==1.2.18
# via pygithub
docutils==0.21.2
# via
# myst-parser
@@ -60,35 +50,31 @@ docutils==0.21.2
# pydata-sphinx-theme
# sphinx
# sphinxcontrib-bibtex
doxysphinx==3.3.12
# via rocm-docs-core
exceptiongroup==1.2.2
exceptiongroup==1.3.1
# via ipython
executing==2.2.0
executing==2.2.1
# via stack-data
fastjsonschema==2.21.1
fastjsonschema==2.21.2
# via
# nbformat
# rocm-docs-core
fonttools==4.57.0
# via matplotlib
gitdb==4.0.12
# via gitpython
gitpython==3.1.44
gitpython==3.1.46
# via rocm-docs-core
greenlet==3.2.1
greenlet==3.3.0
# via sqlalchemy
idna==3.10
idna==3.11
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==8.6.1
importlib-metadata==8.7.1
# via
# jupyter-cache
# myst-nb
ipykernel==6.29.5
ipykernel==7.1.0
# via myst-nb
ipython==8.35.0
ipython==8.38.0
# via
# ipykernel
# myst-nb
@@ -98,53 +84,43 @@ jinja2==3.1.6
# via
# myst-parser
# sphinx
jsonschema==4.23.0
jsonschema==4.26.0
# via nbformat
jsonschema-specifications==2024.10.1
jsonschema-specifications==2025.9.1
# via jsonschema
jupyter-cache==1.0.1
# via myst-nb
jupyter-client==8.6.3
jupyter-client==8.8.0
# via
# ipykernel
# nbclient
jupyter-core==5.7.2
jupyter-core==5.9.1
# via
# ipykernel
# jupyter-client
# nbclient
# nbformat
kiwisolver==1.4.8
# via matplotlib
latexcodec==3.0.0
latexcodec==3.0.1
# via pybtex
libsass==0.22.0
# via doxysphinx
lxml==5.2.1
# via doxysphinx
markdown-it-py==3.0.0
# via
# mdit-py-plugins
# myst-parser
markupsafe==3.0.2
markupsafe==3.0.3
# via jinja2
matplotlib==3.10.1
# via doxysphinx
matplotlib-inline==0.1.7
matplotlib-inline==0.2.1
# via
# ipykernel
# ipython
mdit-py-plugins==0.4.2
mdit-py-plugins==0.5.0
# via myst-parser
mdurl==0.1.2
# via markdown-it-py
mpire==2.10.2
# via doxysphinx
myst-nb==1.2.0
myst-nb==1.3.0
# via rocm-docs-core
myst-parser==4.0.1
# via myst-nb
nbclient==0.10.2
nbclient==0.10.4
# via
# jupyter-cache
# myst-nb
@@ -155,28 +131,20 @@ nbformat==5.10.4
# nbclient
nest-asyncio==1.6.0
# via ipykernel
numpy==1.26.4
# via
# contourpy
# doxysphinx
# matplotlib
packaging==25.0
# via
# ipykernel
# matplotlib
# pydata-sphinx-theme
# sphinx
parso==0.8.4
parso==0.8.5
# via jedi
pexpect==4.9.0
# via ipython
pillow==11.2.1
# via matplotlib
platformdirs==4.3.7
platformdirs==4.5.1
# via jupyter-core
prompt-toolkit==3.0.51
prompt-toolkit==3.0.52
# via ipython
psutil==7.0.0
psutil==7.2.1
# via ipykernel
ptyprocess==0.7.0
# via pexpect
@@ -188,36 +156,27 @@ pybtex==0.25.1
# sphinxcontrib-bibtex
pybtex-docutils==1.0.3
# via sphinxcontrib-bibtex
pycparser==2.22
pycparser==2.23
# via cffi
pydata-sphinx-theme==0.15.4
# via
# rocm-docs-core
# sphinx-book-theme
pygithub==2.6.1
pygithub==2.8.1
# via rocm-docs-core
pygments==2.19.1
pygments==2.19.2
# via
# accessible-pygments
# ipython
# mpire
# pydata-sphinx-theme
# sphinx
pyjson5==1.6.8
# via doxysphinx
pyjwt[crypto]==2.10.1
# via pygithub
pynacl==1.5.0
pynacl==1.6.2
# via pygithub
pyparsing==3.2.3
# via
# doxysphinx
# matplotlib
python-dateutil==2.9.0.post0
# via
# jupyter-client
# matplotlib
pyyaml==6.0.2
# via jupyter-client
pyyaml==6.0.3
# via
# jupyter-cache
# myst-nb
@@ -225,21 +184,21 @@ pyyaml==6.0.2
# pybtex
# rocm-docs-core
# sphinx-external-toc
pyzmq==26.4.0
pyzmq==27.1.0
# via
# ipykernel
# jupyter-client
referencing==0.36.2
referencing==0.37.0
# via
# jsonschema
# jsonschema-specifications
requests==2.32.3
requests==2.32.5
# via
# pygithub
# sphinx
rocm-docs-core[api-reference]==1.31.3
# via -r requirements.in
rpds-py==0.24.0
rpds-py==0.30.0
# via
# jsonschema
# referencing
@@ -247,9 +206,9 @@ six==1.17.0
# via python-dateutil
smmap==5.0.2
# via gitdb
snowballstemmer==2.2.0
snowballstemmer==3.0.1
# via sphinx
soupsieve==2.7
soupsieve==2.8.1
# via beautifulsoup4
sphinx==8.1.3
# via
@@ -288,23 +247,20 @@ sphinxcontrib-qthelp==2.0.0
# via sphinx
sphinxcontrib-serializinghtml==2.0.0
# via sphinx
sqlalchemy==2.0.40
sqlalchemy==2.0.45
# via jupyter-cache
stack-data==0.6.3
# via ipython
tabulate==0.9.0
# via jupyter-cache
tomli==2.2.1
tomli==2.4.0
# via sphinx
tornado==6.4.2
tornado==6.5.4
# via
# ipykernel
# jupyter-client
tqdm==4.67.1
# via mpire
traitlets==5.14.3
# via
# comm
# ipykernel
# ipython
# jupyter-client
@@ -312,22 +268,22 @@ traitlets==5.14.3
# matplotlib-inline
# nbclient
# nbformat
typing-extensions==4.13.2
typing-extensions==4.15.0
# via
# beautifulsoup4
# cryptography
# exceptiongroup
# ipython
# myst-nb
# pydata-sphinx-theme
# pygithub
# referencing
# sqlalchemy
urllib3==2.4.0
urllib3==2.6.3
# via
# pygithub
# requests
wcwidth==0.2.13
wcwidth==0.2.14
# via prompt-toolkit
wrapt==1.17.2
# via deprecated
zipp==3.21.0
zipp==3.23.0
# via importlib-metadata