Merge commit '0cc83cb8e8c9d9d926469f862bc1272ef0cf0dc8' into develop

2026-05-14 02:02:46 +00:00 · 2026-01-27 16:17:03 +00:00
parent fea562ee53
commit 21657a1f32
25 changed files with 130 additions and 3160 deletions
--- a/docs/conceptual/ck_tile/CK-tile-index.rst
+++ b/docs/conceptual/ck_tile/CK-tile-index.rst
@@ -1,14 +1,13 @@
 .. _ck_tile_index:

-************************
-CK Tile Index
-************************
-
-CK Tile documentation structure:
+****************************************************
+CK Tile conceptual documentation table of contents
+****************************************************

 .. toctree::
   :maxdepth: 2

+   index
   introduction_motivation
   buffer_views
   tensor_views
--- a/docs/conceptual/ck_tile/MERMAID_DIAGRAMS.md
+++ b/docs/conceptual/ck_tile/MERMAID_DIAGRAMS.md
@@ -1,156 +0,0 @@
-# Mermaid Diagram Management
-
-This document explains how to manage mermaid diagrams in the CK Tile documentation.
-
-## Overview
-
-All mermaid diagrams in the CK Tile documentation have been converted to SVG files for better rendering compatibility. The original mermaid source code is preserved as commented blocks in the RST files, allowing easy updates when needed.
-
-## Directory Structure
-
- `docs/conceptual/ck_tile/diagrams/` - Contains all SVG diagram files
- `docs/conceptual/ck_tile/convert_mermaid_to_svg.py` - Initial conversion script (one-time use)
- `docs/conceptual/ck_tile/update_diagrams.py` - Helper script to regenerate diagrams from comments
-
-## Diagram Format in RST Files
-
-Each diagram follows this format:
-
-```rst
-.. 
-   Original mermaid diagram (edit here, then run update_diagrams.py)
-   
-   .. mermaid::
-   
-      graph TB
-          A --> B
-          B --> C
-
-.. image:: diagrams/diagram_name.svg
-   :alt: Diagram
-   :align: center
-```
-
-The commented mermaid block won't appear in the rendered documentation but serves as the source for regenerating the SVG.
-
-## Updating Diagrams
-
-### When to Update
-
-You need to regenerate SVG files when:
- Modifying the mermaid source in a commented block
- Adding new diagrams
- Updating diagram styling
-
-### How to Update
-
-1. **Edit the commented mermaid source** in the RST file
-2. **Run the update script**:
-   ```bash
-   # Update all diagrams
-   python docs/conceptual/ck_tile/update_diagrams.py
-   
-   # Update diagrams in a specific file
-   python docs/conceptual/ck_tile/update_diagrams.py transforms.rst
-   
-   # Force regenerate all diagrams (even if SVGs exist)
-   python docs/conceptual/ck_tile/update_diagrams.py --force
-   ```
-
-### Prerequisites
-
-The update script requires [mermaid-cli](https://github.com/mermaid-js/mermaid-cli):
-
-```bash
-npm install -g @mermaid-js/mermaid-cli
-```
-
-## Adding New Diagrams
-
-To add a new mermaid diagram:
-
-1. **Create the commented block** in your RST file:
-   ```rst
-   .. 
-      Original mermaid diagram (edit here, then run update_diagrams.py)
-      
-      .. mermaid::
-      
-         graph TB
-             A --> B
-   ```
-
-2. **Add the image reference** immediately after:
-   ```rst
-   .. image:: diagrams/my_new_diagram.svg
-      :alt: My New Diagram
-      :align: center
-   ```
-
-3. **Generate the SVG**:
-   ```bash
-   python docs/conceptual/ck_tile/update_diagrams.py your_file.rst
-   ```
-
-## Current Diagrams
-
-The following RST files contain mermaid diagrams (40 total):
-
- `adaptors.rst` (2 diagrams)
- `convolution_example.rst` (1 diagram)
- `coordinate_movement.rst` (1 diagram)
- `descriptors.rst` (2 diagrams)
- `encoding_internals.rst` (2 diagrams)
- `lds_index_swapping.rst` (3 diagrams)
- `load_store_traits.rst` (2 diagrams)
- `space_filling_curve.rst` (1 diagram)
- `static_distributed_tensor.rst` (1 diagram)
- `sweep_tile.rst` (4 diagrams)
- `tensor_coordinates.rst` (2 diagrams)
- `thread_mapping.rst` (2 diagrams)
- `tile_window.rst` (5 diagrams)
- `transforms.rst` (12 diagrams)
-
-## Troubleshooting
-
-### SVG not generated
-
- Check that mermaid-cli is installed: `mmdc --version`
- Verify the mermaid syntax is valid
- Look for error messages in the script output
-
-### Diagram not updating
-
- Use `--force` flag to regenerate: `python docs/update_diagrams.py --force`
- Check that the image reference matches the generated filename
-
-### Pattern not matching
-
-If the update script can't find your commented diagram:
- Ensure proper indentation (3 spaces for comment block content)
- Verify the `.. mermaid::` directive is commented
- Check that the image reference immediately follows the comment block
-
-## Script Details
-
-### update_diagrams.py
-
-This script:
-1. Scans RST files for commented mermaid blocks
-2. Extracts the mermaid source code
-3. Converts to SVG using `mmdc`
-4. Saves to the diagrams directory
-
-**Usage:**
- `python docs/conceptual/ck_tile/update_diagrams.py` - Check all files, update missing SVGs
- `python docs/conceptual/ck_tile/update_diagrams.py --force` - Regenerate all SVGs
- `python docs/conceptual/ck_tile/update_diagrams.py <file.rst>` - Update specific file
-
-### convert_mermaid_to_svg.py
-
-This was the initial conversion script. It:
-1. Found all active `.. mermaid::` directives
-2. Converted them to SVGs
-3. Replaced directives with commented source + image references
-
-This script was used once for the initial conversion and typically doesn't need to be run again.
--- a/docs/conceptual/ck_tile/adaptors.rst
+++ b/docs/conceptual/ck_tile/adaptors.rst
@@ -59,8 +59,8 @@ A TensorAdaptor encapsulates a sequence of :ref:`coordinate transformations <ck_
 .. image:: diagrams/adaptors_1.svg
   :alt: Diagram
   :align: center
-Core Components

+Core Components
 ~~~~~~~~~~~~~~~

 Each TensorAdaptor contains:
@@ -115,7 +115,7 @@ Custom adaptors can be created by specifying which transforms to use and how the
       make_tuple(sequence<0>{})      // to single dim 0
   );
   
-   // The adaptor is embedded in the :ref:`descriptor <ck_tile_descriptors>`
+   // The adaptor is embedded in the descriptor
   // To use it:
   multi_index<1> top_coord{5};  // 1D coordinate
   // This internally calculates: row = 5/3 = 1, col = 5%3 = 2
@@ -309,7 +309,6 @@ A practical example showing how adaptors create efficient :ref:`GPU memory acces
   // - Dimension 0,1: Thread indices
   // - Dimension 2,3: Vector indices within thread
   // Enables coalesced memory access on GPU
-   // See :ref:`ck_tile_thread_mapping` for thread mapping details

 Common Transform Chains
 -----------------------
--- a/docs/conceptual/ck_tile/buffer_views.rst
+++ b/docs/conceptual/ck_tile/buffer_views.rst
@@ -1,6 +1,25 @@
 .. _ck_tile_buffer_views:

+**********************************
 Buffer Views - Raw Memory Access
+**********************************
+
+Overview
+--------
+
+At the foundation of the CK Tile system lies BufferView, a compile-time abstraction that provides structured access to raw memory regions within GPU kernels. This serves as the bridge between the hardware's physical memory model and the higher-level abstractions that enable efficient GPU programming. BufferView encapsulates the complexity of GPU memory hierarchies while exposing a unified interface that works seamlessly across different memory address spaces including global memory shared across the entire device, local data share (LDS) memory shared within a workgroup, or the ultra-fast register files private to each thread.
+
+BufferView serves as the foundation for :ref:`ck_tile_tensor_views`, which add multi-dimensional structure on top of raw memory access. Understanding BufferView is essential before moving on to more complex abstractions like :ref:`ck_tile_distribution` and :ref:`ck_tile_tile_window`.
+
+By providing compile-time knowledge of buffer properties through template metaprogramming, BufferView enables the compiler to generate optimal machine code for each specific use case. This zero-overhead abstraction ensures that the convenience of a high-level interface comes with no runtime performance penalty.
+
+One of BufferView's most important features is its advanced handling of out-of-bounds memory access. Unlike CPU programming where such accesses typically result in segmentation faults or undefined behavior, GPU programming must gracefully handle cases where threads attempt to access memory beyond allocated boundaries. BufferView provides configurable strategies for these scenarios, where developers can choose between returning either numerical zero values or custom sentinel values for invalid accesses. This flexibility is important for algorithms that naturally extend beyond data boundaries, such as convolutions with padding or matrix operations with non-aligned dimensions.
+
+The abstraction extends beyond simple memory access to encompass both scalar and vector data types. GPUs achieve their highest efficiency when loading or storing multiple data elements in a single instruction. BufferView seamlessly supports these vectorized operations, automatically selecting the appropriate hardware instructions based on the data type and access pattern. This capability transforms what would be multiple memory transactions into single, efficient operations that fully utilize the available memory bandwidth.
+
+BufferView also incorporates AMD GPU-specific optimizations that leverage unique hardware features. The AMD buffer addressing mode, for instance, provides hardware-accelerated bounds checking that ensures memory safety without the performance overhead of software-based checks. Similarly, BufferView exposes atomic operations that are crucial for parallel algorithms requiring thread-safe updates to shared data structures. These hardware-specific optimizations are abstracted behind a portable interface, ensuring that code remains maintainable while achieving optimal performance.
+
+Memory coherence and caching policies represent another layer of complexity that BufferView manages transparently. Different GPU memory spaces have different coherence guarantees and caching behaviors. Global memory accesses can be cached in L1 and L2 caches with various coherence protocols, while LDS memory provides workgroup-level coherence with specialized banking structures (see :ref:`ck_tile_lds_bank_conflicts` for details on avoiding bank conflicts). BufferView encapsulates these details, automatically applying the appropriate memory ordering constraints and cache control directives based on the target address space and operation type.

 Address Space Usage Patterns
 ----------------------------
@@ -51,6 +70,7 @@ Address Space Usage Patterns
 .. image:: diagrams/buffer_views_1.svg
   :alt: Diagram
   :align: center
+   
 C++ Implementation
 ------------------

--- a/docs/conceptual/ck_tile/convolution_example.rst
+++ b/docs/conceptual/ck_tile/convolution_example.rst
@@ -59,10 +59,6 @@ The key insight is that convolution can be transformed from a complex nested loo
   
   

-.. image:: diagrams/convolution_example.svg
-   :alt: Diagram
-   :align: center
-
 .. image:: diagrams/convolution_example.svg
   :alt: Diagram
   :align: center
@@ -88,7 +84,6 @@ Non-overlapping tiles:
        
        // Original matrix: shape=(6, 6), strides=(6, 1)
        // Tiled view: shape=(3, 3, 2, 2), strides=(12, 2, 6, 1)
-        // See :ref:`ck_tile_descriptors` for descriptor details
        using TileDescriptor = TensorDescriptor<
            Sequence<kNumTiles, kNumTiles, kTileSize, kTileSize>,
            Sequence<12, 2, 6, 1>
@@ -243,7 +238,6 @@ The im2col transformation converts the 4D windows tensor into a 2D matrix suitab
        >;
        
        // Step 2: Apply merge transforms to create 2D im2col layout
-        // See :ref:`ck_tile_transforms` for transform operations
        using Im2colDescriptor = decltype(
            transform_tensor_descriptor(
                WindowsDescriptor{},
@@ -312,7 +306,6 @@ Combining all components into an optimized convolution implementation:
        >;
        
        // Tile distribution for matrix multiplication
-        // See :ref:`ck_tile_tile_distribution` for details
        using ATileDist = TileDistribution<
            Sequence<TileM, TileK>,
            Sequence<BlockM, 1>
@@ -327,7 +320,6 @@ Combining all components into an optimized convolution implementation:
        >;
        
        // Thread-local accumulator
-        // See :ref:`ck_tile_static_distributed_tensor`
        StaticDistributedTensor<DataType, CTileDist> c_accumulator;
        
        // Initialize accumulator
@@ -339,7 +331,6 @@ Combining all components into an optimized convolution implementation:
        // Main GEMM loop over K dimension
        for (index_t k_tile = 0; k_tile < PatchSize; k_tile += TileK) {
            // Create tile windows for im2col matrix and kernel
-            // See :ref:`ck_tile_tile_window` for window operations
            auto a_window = make_tile_window<ATileDist>(
                input, Im2colDesc{H, W, K},
                {blockIdx.y * TileM, k_tile}
@@ -350,7 +341,7 @@ Combining all components into an optimized convolution implementation:
                {k_tile, 0}
            );
            
-            // Load tiles - see :ref:`ck_tile_load_store_traits` for optimization
+            // Load tiles 
            auto a_tile = a_window.load();
            auto b_tile = b_window.load();
            
@@ -476,7 +467,6 @@ CK Tile enables several optimizations for convolution:
    __shared__ float smem_b[TileK][TileN];
    
    // Collaborative loading with proper bank conflict avoidance
-    // See :ref:`ck_tile_lds_bank_conflicts` for optimization
    auto load_tile_to_smem = [&](auto& window, float smem[][TileK]) {
        #pragma unroll
        for (index_t i = threadIdx.y; i < TileM; i += blockDim.y) {
@@ -560,7 +550,7 @@ This example demonstrates how CK Tile transforms convolution from a memory-bound

 - **Sliding windows** can be efficiently represented using tensor descriptors with appropriate strides
 - **Im2col transformation** converts convolution to matrix multiplication without data copies  
- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_tile_distribution`)
+- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_distribution`)
 - **Multi-channel support** extends naturally through higher-dimensional descriptors
 - **Performance optimizations** like vectorization and shared memory are seamlessly integrated (see :ref:`ck_tile_gemm_optimization` for similar techniques)

--- a/docs/conceptual/ck_tile/coordinate_movement.rst
+++ b/docs/conceptual/ck_tile/coordinate_movement.rst
@@ -317,7 +317,7 @@ Movement Through Adaptors
 Advanced Movement Patterns
 ==========================

-Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_tile_distribution` concepts:
+Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_distribution` concepts:

 Tiled Access Pattern
 --------------------
--- a/docs/conceptual/ck_tile/descriptors.rst
+++ b/docs/conceptual/ck_tile/descriptors.rst
@@ -315,18 +315,18 @@ Padding for Convolution

 .. code-block:: cpp

-// Add padding to spatial dimensions
-   auto padded = transform_tensor_descriptor(
-       input_tensor,
-       make_tuple(
-           make_pass_through_transform(N),    // Batch
-           make_pass_through_transform(C),    // Channel
-           make_pad_transform(H, pad_h, pad_h),  // Height
-           make_pad_transform(W, pad_w, pad_w)   // Width
-       ),
-       make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
-       make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
-   );
+    // Add padding to spatial dimensions
+    auto padded = transform_tensor_descriptor(
+        input_tensor,
+        make_tuple(
+            make_pass_through_transform(N),    // Batch
+            make_pass_through_transform(C),    // Channel
+            make_pad_transform(H, pad_h, pad_h),  // Height
+            make_pad_transform(W, pad_w, pad_w)   // Width
+            ),
+        make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
+        make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
+        );

 For a complete convolution example, see :ref:`ck_tile_convolution_example`.

--- a/docs/conceptual/ck_tile/hardware/gemm_optimization.rst
+++ b/docs/conceptual/ck_tile/hardware/gemm_optimization.rst
@@ -260,7 +260,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
                                       index_t K)
    {
        // Define tile distribution encoding
-        // See :ref:`ck_tile_encoding_internals` and :ref:`ck_tile_tile_distribution`
        using Encoding = tile_distribution_encoding<
            sequence<>,                              // No replication
            tuple<sequence<4, 2, 8, 4>,             // M dimension hierarchy
@@ -274,7 +273,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
        constexpr auto tile_dist = make_static_tile_distribution(Encoding{});
        
        // Create tensor views for global memory
-        // See :ref:`ck_tile_tensor_views` and :ref:`ck_tile_buffer_views`
        auto a_global_view = make_naive_tensor_view<address_space_enum::global>(
            a_global, make_tuple(M, K), make_tuple(K, 1));
        auto b_global_view = make_naive_tensor_view<address_space_enum::global>(
@@ -287,7 +285,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
        const index_t block_n_id = blockIdx.x;
        
        // Create tile windows for loading
-        // See :ref:`ck_tile_tile_window` for tile window details
        auto a_window = make_tile_window(
            a_global_view,
            make_tuple(number<MPerBlock>{}, number<KPerBlock>{}),
@@ -301,7 +298,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
            tile_dist);
        
        // Allocate LDS storage
-        // See :ref:`ck_tile_static_distributed_tensor` for distributed tensors
        auto a_lds = make_static_distributed_tensor<ADataType, 
                                                   decltype(tile_dist)>();
        auto b_lds = make_static_distributed_tensor<BDataType, 
@@ -310,7 +306,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
        // Initialize accumulator
        auto c_reg = make_static_distributed_tensor<CDataType, 
                                                   decltype(tile_dist)>();
-        // See :ref:`ck_tile_sweep_tile` for sweep operations
        sweep_tile(c_reg, [](auto idx, auto& val) { val = 0; });
        
        // Main GEMM loop with pipelining
@@ -324,7 +319,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
        // Pipeline loop
        for(index_t k_tile = 0; k_tile < num_k_tiles - 1; ++k_tile) {
            // Move windows for next iteration
-            // See :ref:`ck_tile_coordinate_movement` for window movement
            a_window.move_slice_window(make_tuple(0, KPerBlock));
            b_window.move_slice_window(make_tuple(0, KPerBlock));
            
--- a/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst
+++ b/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst
@@ -172,7 +172,6 @@ Example usage in CK Tile:
        a_window.load(a_lds_tensor);
        
        // Subsequent reads from LDS are conflict-free
-        // See :ref:`ck_tile_sweep_tile` for sweep operations
        sweep_tile(a_lds_tensor, [](auto idx, auto& val) {
            // Process data...
        });
--- a/docs/conceptual/ck_tile/introduction_motivation.rst
+++ b/docs/conceptual/ck_tile/introduction_motivation.rst
@@ -276,7 +276,7 @@ The foundation of the exploration begins with raw memory access through :ref:`ck

 With these foundational concepts established, the documentation delves into the :ref:`ck_tile_coordinate_systems` that powers tile distribution. This engine implements the mathematical framework that have been introduced, providing compile-time transformations between P-space, Y-space, X-space, and D-space. Understanding these transformations at a deep level enables developers to reason about performance implications and design custom distribution strategies for novel algorithms. The :ref:`ck_tile_transforms` and :ref:`ck_tile_adaptors` provide the building blocks for these transformations.

-The high-level :ref:`ck_tile_distribution` APIs represent the culmination of these lower-level abstractions. These APIs provide an accessible interface for common patterns while exposing enough flexibility for advanced optimizations. Through concrete examples and detailed explanations, the documentation will demonstrate how to leverage these APIs to achieve near-optimal performance across a variety of computational patterns. The :ref:`ck_tile_window` abstraction provides the gateway for efficient data access.
+The high-level :ref:`ck_tile_distribution` APIs represent the culmination of these lower-level abstractions. These APIs provide an accessible interface for common patterns while exposing enough flexibility for advanced optimizations. Through concrete examples and detailed explanations, the documentation will demonstrate how to leverage these APIs to achieve near-optimal performance across a variety of computational patterns. The :ref:`ck_tile_tile_window` abstraction provides the gateway for efficient data access.

 The exploration of coordinate systems goes beyond the basic P, Y, X, D framework to encompass advanced topics such as multi-level tiling, replication strategies, and specialized coordinate systems for specific algorithm classes. The :ref:`ck_tile_encoding_internals` reveals the mathematical foundations, while :ref:`ck_tile_thread_mapping` shows how these abstractions map to hardware. This comprehensive treatment ensures that developers can handle not just common cases but also novel algorithms that require custom distribution strategies.

--- a/docs/conceptual/ck_tile/lds_index_swapping.rst
+++ b/docs/conceptual/ck_tile/lds_index_swapping.rst
@@ -5,7 +5,7 @@
 .. _ck_tile_lds_index_swapping:

 ********************************
-Load Datat Share Index Swapping
+Load Data Share Index Swapping
 ********************************

 Overview
@@ -70,9 +70,9 @@ The original K coordinate is split into K0 and K1, where K1 represents the threa

 The XOR transformation updates the K0 coordinate using the formula:

-.. code-block:: cpp
+.. math::

-    K0' = K0 ^ (M % (KPerBlock / KPack * MLdsLayer))
+    K0' = K0^{(M \% (KPerBlock / KPack * MLdsLayer))}

 This XOR operation redistributes accesses across memory banks by mixing bits from the M and K dimensions.

@@ -132,10 +132,10 @@ The transformed K0' is split into L and K0'' components, creating an intermediat

 The unmerge operation:

-.. code-block:: cpp
+.. math:: 

    L = K0' / (KPerBlock/KPack)
-    K0'' = K0' % (KPerBlock/KPack)
+    K0'' = K0' \% (KPerBlock/KPack)

 When MLdsLayer == 1, this simplifies to L=0 and K0''=K0'.

--- a/docs/conceptual/ck_tile/load_store_traits.rst
+++ b/docs/conceptual/ck_tile/load_store_traits.rst
@@ -71,7 +71,6 @@ The LoadStoreTraits class analyzes distribution patterns at compile time:
       static constexpr index_t scalars_per_access = scalar_per_vector;
       
       // Space-filling curve for optimal traversal
-       // See :ref:`ck_tile_space_filling_curve` for details
       using sfc_type = space_filling_curve<ndim_y>;
       static constexpr sfc_type sfc_ys = make_space_filling_curve<Distribution>();
       
@@ -274,7 +273,7 @@ LoadStoreTraits optimizes for several performance metrics:
           return Traits::num_access;
       }
       
-       // Check coalescing efficiency (see :ref:`ck_tile_gpu_basics`)
+       // Check coalescing efficiency 
       static constexpr bool is_perfectly_coalesced()
       {
           // Perfect coalescing when adjacent threads access adjacent memory
@@ -316,7 +315,6 @@ Comparing Different Configurations
   static_assert(OptimizedAnalyzer::bandwidth_utilization() == 50.0f);  // 8*4/64
   
   // Better bandwidth utilization leads to improved performance
-   // See :ref:`ck_tile_gemm_optimization` for real-world examples

 Integration with Space-Filling Curves
 -------------------------------------
--- a/docs/conceptual/ck_tile/space_filling_curve.rst
+++ b/docs/conceptual/ck_tile/space_filling_curve.rst
@@ -254,7 +254,6 @@ For :ref:`matrix multiplication <ck_tile_gemm_optimization>`, optimal access pat

   // GEMM tile: 16x32 with vector-8 loads
   // Column-major for coalesced access in GEMM
-   // See :ref:`ck_tile_gemm_optimization` for complete example
   using GemmTileCurve = space_filling_curve<
       2,
       sequence<16, 32>,    // Tile size
@@ -336,7 +335,7 @@ Optimizing for Hardware

 .. code-block:: cpp

-   // Optimize for GPU memory coalescing (see :ref:`ck_tile_gpu_basics`)
+   // Optimize for GPU memory coalescing
   template <typename DataType, index_t WarpSize = 32>
   struct coalesced_access_pattern
   {
@@ -411,7 +410,6 @@ LoadStoreTraits Integration
   struct load_store_traits
   {
       // Create optimized space-filling curve
-       // See :ref:`ck_tile_tile_distribution` for Distribution details
       using sfc_type = space_filling_curve<
           Distribution::ndim_y,
           typename Distribution::y_lengths,
@@ -461,7 +459,6 @@ Best Practices
   .. code-block:: cpp

      // Match vector size to cache line for optimal bandwidth
-      // See :ref:`ck_tile_lds_bank_conflicts` for cache optimization
      constexpr index_t optimal_vector = min(
          tensor_length_fast_dim,
          cache_line_size / sizeof(DataType)
--- a/docs/conceptual/ck_tile/static_distributed_tensor.rst
+++ b/docs/conceptual/ck_tile/static_distributed_tensor.rst
@@ -17,9 +17,9 @@ Each thread in a workgroup owns a portion of the overall tensor data, stored in

 This design enables three critical optimizations:

-    * It maximizes register utilization by keeping frequently accessed data in the fastest memory hierarchy. 
-    * It eliminates redundant memory accesses since each thread maintains its own working set. 
-    * It provides a clean abstraction for complex algorithms like matrix multiplication where each thread accumulates partial results that eventually combine into the final output.
+* It maximizes register utilization by keeping frequently accessed data in the fastest memory hierarchy. 
+* It eliminates redundant memory accesses since each thread maintains its own working set. 
+* It provides a clean abstraction for complex algorithms like matrix multiplication where each thread accumulates partial results that eventually combine into the final output.

 Thread-Local Storage Model
 ==========================
@@ -384,8 +384,7 @@ Static distributed tensors integrate seamlessly with other CK Tile components:
        // Main GEMM loop
        for(index_t k_tile = 0; k_tile < K; k_tile += kTileK) {
        // Create tile windows for this iteration
-        // See :ref:`ck_tile_tile_window` for details
-        auto a_window = make_tile_window(
+            auto a_window = make_tile_window(
            a_ptr, ALayout{M, K}, 
            ATileDist{}, 
            {blockIdx.y * kTileM, k_tile}
@@ -398,7 +397,6 @@ Static distributed tensors integrate seamlessly with other CK Tile components:
            );
            
            // Load tiles to distributed tensors
-            // See :ref:`ck_tile_load_store_traits` for optimized loading
            auto a_tile = a_window.load();
            auto b_tile = b_window.load();
            
--- a/docs/conceptual/ck_tile/thread_mapping.rst
+++ b/docs/conceptual/ck_tile/thread_mapping.rst
@@ -356,7 +356,6 @@ CK uses several techniques to optimize memory access:
                                               float>>>;

    // 2. Swizzling to avoid bank conflicts
-    // See :ref:`ck_tile_lds_index_swapping` and :ref:`ck_tile_swizzling_example`
    template <index_t BankSize = 32>
    __device__ index_t swizzle_offset(index_t tid, index_t offset)
    {
@@ -434,7 +433,6 @@ The following example shows how thread mapping works in a CK kernel:
        __shared__ ComputeType shared_sum[BlockSize];
        
        // 5. Create tensor view and tile window
-        // See :ref:`ck_tile_tensor_views` and :ref:`ck_tile_tile_window`
        auto x_view = make_naive_tensor_view<address_space_enum::global>(
            x + bid * hidden_size,
            make_tuple(hidden_size),
--- a/docs/conceptual/ck_tile/tile_distribution.rst
+++ b/docs/conceptual/ck_tile/tile_distribution.rst
@@ -1,4 +1,4 @@
-.. _ck_tile_distribution:
+.. _ck_tile_tile_distribution:

 Tile Distribution - The Core API
 ================================
--- a/docs/conceptual/ck_tile/tile_window.rst
+++ b/docs/conceptual/ck_tile/tile_window.rst
@@ -283,7 +283,7 @@ Creating and Using TileWindow

   using namespace ck_tile;
   
-   // Create a tensor view for input data (see :ref:`ck_tile_tensor_views`)
+   // Create a tensor view for input data
   auto tensor_view = make_naive_tensor_view(
       data_ptr,
       make_tuple(256, 256),    // Shape
@@ -314,7 +314,7 @@ Creating and Using TileWindow
       distribution
   );
   
-   // Load data into distributed tensor (see :ref:`ck_tile_static_distributed_tensor`)
+   // Load data into distributed tensor 
   auto distributed_data = make_static_distributed_tensor<float>(distribution);
   window.load(distributed_data);

@@ -558,7 +558,6 @@ Complete Load-Compute-Store Pipeline
           c_dist);
       
       // Create distributed tensors for register storage
-       // See :ref:`ck_tile_static_distributed_tensor` for details
       auto a_reg = make_static_distributed_tensor<AType>(a_dist);
       auto b_reg = make_static_distributed_tensor<BType>(b_dist);
       auto c_reg = make_static_distributed_tensor<CType>(c_dist);
@@ -620,6 +619,8 @@ Performance Characteristics
 .. image:: diagrams/tile_window_5.svg
   :alt: Diagram
   :align: center
+
+
 Best Practices
 --------------

--- a/docs/conceptual/ck_tile/transforms.rst
+++ b/docs/conceptual/ck_tile/transforms.rst
@@ -302,7 +302,7 @@ EmbedTransform expands linear indices from the lower coordinate space into multi
   using namespace ck_tile;
   
   // Create embed transform for 2x3 tensor with strides [12, 1]
-   // This is commonly used in :ref:`descriptors <ck_tile_descriptors>`
+   // This is commonly used in descriptors
   auto transform = make_embed_transform(make_tuple(2, 3), make_tuple(12, 1));
   
   // Forward: Linear → 2D (Manual calculation)
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -30,8 +30,6 @@ release = version_number
 external_toc_path = "./sphinx/_toc.yml"

 docs_core = ROCmDocs(left_nav_title)
-docs_core.run_doxygen(doxygen_root="doxygen", doxygen_path="doxygen/xml")
-docs_core.enable_api_reference()
 docs_core.setup()

 external_projects_current_project = "composable_kernel"
@@ -50,4 +48,4 @@ for sphinx_var in ROCmDocs.SPHINX_VARS:
 extensions += ['sphinxcontrib.bibtex']
 bibtex_bibfiles = ['refs.bib']

-cpp_id_attributes = ["__global__", "__device__", "__host__"]
+cpp_id_attributes = ["__global__", "__device__", "__host__"]
--- a/docs/doxygen/Doxyfile
+++ b/docs/doxygen/Doxyfile
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -25,7 +25,7 @@ The Composable Kernel repository is located at `https://github.com/ROCm/composab

    * :doc:`Composable Kernel structure <./conceptual/Composable-Kernel-structure>`
    * :doc:`Composable Kernel mathematical basis <./conceptual/Composable-Kernel-math>`
-    * :doc:`CK Tile conceptual documentation <./conceptual/ck_tile/index>`
+    * :doc:`CK Tile conceptual documentation <./conceptual/ck_tile/CK-tile-index>`

  .. grid-item-card:: Tutorials

@@ -37,9 +37,6 @@ The Composable Kernel repository is located at `https://github.com/ROCm/composab
    * :doc:`Composable Kernel custom types <./reference/Composable_Kernel_custom_types>`
    * :doc:`Composable Kernel vector utilities <./reference/Composable_Kernel_vector_utilities>`
    * :ref:`wrapper`    
-    * :doc:`Composable Kernel API reference <./doxygen/html/namespace_c_k>`
-    * :doc:`CK Tile API reference <./doxygen/html/namespaceck__tile>`
-    * :doc:`Composable Kernel complete API class list <./doxygen/html/annotated>`
    * :doc:`Composable Kernel glossary <./reference/Composable-Kernel-Glossary>`
    
 To contribute to the documentation refer to `Contributing to ROCm  <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
--- a/docs/reference/Composable-Kernel-Glossary.rst
+++ b/docs/reference/Composable-Kernel-Glossary.rst
@@ -4,7 +4,6 @@

 ***************************************************
 Composable Kernel glossary
-
 ***************************************************

 .. glossary::
@@ -14,7 +13,7 @@ Composable Kernel glossary
        The arithmetic logic unit (ALU) is the GPU component responsible for arithmetic and logic operations.

    compute unit
-        The compute unit (CU) is the parallel vector processor in an AMD GPU with multiple :term:`ALUs<arithmetic logic unit>`. Each compute unit will run all the :term:`wavefronts<wavefront>` in a :term:`work group>`. A compute unit is equivalent to NVIDIA's streaming   multiprocessor.
+        The compute unit (CU) is the parallel vector processor in an AMD GPU with multiple :term:`ALUs<arithmetic logic unit>`. Each compute unit will run all the :term:`wavefronts<wavefront>` in a :term:`work group`. A compute unit is equivalent to NVIDIA's streaming   multiprocessor.

    matrix core
        A matrix core is a specialized GPU unit that accelerate matrix operations for AI and deep learning tasks. A GPU contains multiple matrix cores.
@@ -32,7 +31,7 @@ Composable Kernel glossary
        See :term:`scalar general purpose register`.

    scalar general purpose register
-        A scalar general purpose register (SGPR) is a :term:`register` shared by all the :term:`work items<work item>` in a :term:`wave<wavefront>`. SGPRs are used for constants, addresses, and control flow common across the entire wave.
+        A scalar general purpose register (SGPR) is a :term:`register` shared by all the :term:`work-items<work-item>` in a :term:`wave<wavefront>`. SGPRs are used for constants, addresses, and control flow common across the entire wave.

    LDS
        See :term:`local data share`.
@@ -101,7 +100,7 @@ Composable Kernel glossary
        A Composable Kernel pipeline schedules the sequence of operations for a :term:`kernel`, such as the data loading, computation, and storage phases. A pipeline consists of a :term:`problem` and a :term:`policy`. 

    tile partitioner
-        The tile partitioner defines the mapping between the :term:`problem` dimensions and GPU hierarchy. It specifies :term:`workgroup`-level :term:`tile` sizes and determines :term:`grid` dimensions by dividing the problem size by the tile sizes.
+        The tile partitioner defines the mapping between the :term:`problem` dimensions and GPU hierarchy. It specifies :term:`work group`-level :term:`tile` sizes and determines :term:`grid` dimensions by dividing the problem size by the tile sizes.

    problem
        The problem is the part of the :term:`pipeline` that defines input and output shapes, data types, and mathematical :term:`operations<operation>`.
@@ -186,10 +185,10 @@ Composable Kernel glossary
        Viewport into a larger tensor that defines the current tile's position and boundaries for computation.

    load tile
-        Load tile is an operation that transfers data from :term:`global memory` or the :term:`load data share` to :term:`vector general purpose registers<vector general purpose register>`.
+        Load tile is an operation that transfers data from :term:`global memory` or the :term:`local data share` to :term:`vector general purpose registers<vector general purpose register>`.

    store tile
-        Store tile is an operation that transfers data from  :term:`vector general purpose registers<vector general purpose register>` to :term:`global memory` or the :term:`load data share`.
+        Store tile is an operation that transfers data from  :term:`vector general purpose registers<vector general purpose register>` to :term:`global memory` or the :term:`local data share`.

    descriptor
        Metadata structure that defines :term:`tile` properties, memory layouts, and coordinate transformations for Composable Kernel :term:`operations<operation>`.
--- a/docs/reference/Composable-Kernel-wrapper.rst
+++ b/docs/reference/Composable-Kernel-wrapper.rst
@@ -54,36 +54,3 @@ Advanced examples:
 * `Image to column <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_img2col.cpp>`_
 * `Basic gemm <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_basic_gemm.cpp>`_
 * `Optimized gemm <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_optimized_gemm.cpp>`_
-
-------------------------------------
-Layout
-------------------------------------
-
-.. doxygenstruct:: Layout
-
-------------------------------------
-Layout helpers
-------------------------------------
-
-.. doxygenfile:: include/ck/wrapper/utils/layout_utils.hpp
-
-------------------------------------
-Tensor
-------------------------------------
-
-.. doxygenstruct:: Tensor
-
-------------------------------------
-Tensor helpers
-------------------------------------
-
-.. doxygenfile:: include/ck/wrapper/utils/tensor_utils.hpp
-
-.. doxygenfile:: include/ck/wrapper/utils/tensor_partition.hpp
-
-------------------------------------
-Operations
-------------------------------------
-
-.. doxygenfile:: include/ck/wrapper/operations/copy.hpp
-.. doxygenfile:: include/ck/wrapper/operations/gemm.hpp
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -6,42 +6,36 @@ subtrees:
 - caption: Install
  entries:
  - file: install/Composable-Kernel-prerequisites.rst
-    title: Composable Kernel prerequisites
+    title: Prerequisites
  - file: install/Composable-Kernel-install.rst
    title: Build and install Composable Kernel
  - file: install/Composable-Kernel-Docker.rst
-    title: Composable Kernel Docker images
+    title: Docker images

 - caption: Conceptual
  entries:
  - file: conceptual/Composable-Kernel-structure.rst
-    title: Composable Kernel structure
+    title: Structure
  - file: conceptual/Composable-Kernel-math.rst
-    title: Composable Kernel mathematical basis
-  - file: conceptual/ck_tile/index.rst
+    title: Mathematical basis
+  - file: conceptual/ck_tile/CK-tile-index.rst
    title: CK Tile conceptual documentation

 - caption: Tutorial
  entries:
  - file: tutorial/Composable-Kernel-examples.rst
-    title: Composable Kernel examples
+    title: Examples

 - caption: Reference
  entries:
  - file: reference/Composable_Kernel_supported_scalar_types.rst
-    title: Composable Kernel scalar types
+    title: Scalar types
  - file: reference/Composable_Kernel_custom_types.rst
-    title: Composable Kernel custom types
+    title: Custom types
  - file: reference/Composable_Kernel_vector_utilities.rst
-    title: Composable Kernel vector utilities
+    title: Vector utilities
  - file: reference/Composable-Kernel-wrapper.rst
-    title: Composable Kernel wrapper
-  - file: doxygen/html/namespace_c_k.rst
-    title: CK API reference 
-  - file: doxygen/html/namespaceck__tile.rst
-    title: CK Tile API reference
-  - file: doxygen/html/annotated.rst
-    title: Full API class list
+    title: Wrapper
  - file: reference/Composable-Kernel-Glossary.rst
    title: Glossary

--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -8,9 +8,9 @@ accessible-pygments==0.0.5
    # via pydata-sphinx-theme
 alabaster==1.0.0
    # via sphinx
-asttokens==3.0.0
+asttokens==3.0.1
    # via stack-data
-attrs==25.3.0
+attrs==25.4.0
    # via
    #   jsonschema
    #   jupyter-cache
@@ -19,40 +19,30 @@ babel==2.17.0
    # via
    #   pydata-sphinx-theme
    #   sphinx
-beautifulsoup4==4.13.4
+beautifulsoup4==4.14.3
    # via pydata-sphinx-theme
 breathe==4.36.0
    # via rocm-docs-core
-certifi==2025.1.31
+certifi==2026.1.4
    # via requests
-cffi==1.17.1
+cffi==2.0.0
    # via
    #   cryptography
    #   pynacl
-charset-normalizer==3.4.1
+charset-normalizer==3.4.4
    # via requests
-click==8.1.8
+click==8.3.1
    # via
-    #   click-log
-    #   doxysphinx
    #   jupyter-cache
    #   sphinx-external-toc
-click-log==0.4.0
-    # via doxysphinx
-comm==0.2.2
+comm==0.2.3
    # via ipykernel
-contourpy==1.3.2
-    # via matplotlib
-cryptography==44.0.2
+cryptography==46.0.3
    # via pyjwt
-cycler==0.12.1
-    # via matplotlib
-debugpy==1.8.14
+debugpy==1.8.19
    # via ipykernel
 decorator==5.2.1
    # via ipython
-deprecated==1.2.18
-    # via pygithub
 docutils==0.21.2
    # via
    #   myst-parser
@@ -60,35 +50,31 @@ docutils==0.21.2
    #   pydata-sphinx-theme
    #   sphinx
    #   sphinxcontrib-bibtex
-doxysphinx==3.3.12
-    # via rocm-docs-core
-exceptiongroup==1.2.2
+exceptiongroup==1.3.1
    # via ipython
-executing==2.2.0
+executing==2.2.1
    # via stack-data
-fastjsonschema==2.21.1
+fastjsonschema==2.21.2
    # via
    #   nbformat
    #   rocm-docs-core
-fonttools==4.57.0
-    # via matplotlib
 gitdb==4.0.12
    # via gitpython
-gitpython==3.1.44
+gitpython==3.1.46
    # via rocm-docs-core
-greenlet==3.2.1
+greenlet==3.3.0
    # via sqlalchemy
-idna==3.10
+idna==3.11
    # via requests
 imagesize==1.4.1
    # via sphinx
-importlib-metadata==8.6.1
+importlib-metadata==8.7.1
    # via
    #   jupyter-cache
    #   myst-nb
-ipykernel==6.29.5
+ipykernel==7.1.0
    # via myst-nb
-ipython==8.35.0
+ipython==8.38.0
    # via
    #   ipykernel
    #   myst-nb
@@ -98,53 +84,43 @@ jinja2==3.1.6
    # via
    #   myst-parser
    #   sphinx
-jsonschema==4.23.0
+jsonschema==4.26.0
    # via nbformat
-jsonschema-specifications==2024.10.1
+jsonschema-specifications==2025.9.1
    # via jsonschema
 jupyter-cache==1.0.1
    # via myst-nb
-jupyter-client==8.6.3
+jupyter-client==8.8.0
    # via
    #   ipykernel
    #   nbclient
-jupyter-core==5.7.2
+jupyter-core==5.9.1
    # via
    #   ipykernel
    #   jupyter-client
    #   nbclient
    #   nbformat
-kiwisolver==1.4.8
-    # via matplotlib
-latexcodec==3.0.0
+latexcodec==3.0.1
    # via pybtex
-libsass==0.22.0
-    # via doxysphinx
-lxml==5.2.1
-    # via doxysphinx
 markdown-it-py==3.0.0
    # via
    #   mdit-py-plugins
    #   myst-parser
-markupsafe==3.0.2
+markupsafe==3.0.3
    # via jinja2
-matplotlib==3.10.1
-    # via doxysphinx
-matplotlib-inline==0.1.7
+matplotlib-inline==0.2.1
    # via
    #   ipykernel
    #   ipython
-mdit-py-plugins==0.4.2
+mdit-py-plugins==0.5.0
    # via myst-parser
 mdurl==0.1.2
    # via markdown-it-py
-mpire==2.10.2
-    # via doxysphinx
-myst-nb==1.2.0
+myst-nb==1.3.0
    # via rocm-docs-core
 myst-parser==4.0.1
    # via myst-nb
-nbclient==0.10.2
+nbclient==0.10.4
    # via
    #   jupyter-cache
    #   myst-nb
@@ -155,28 +131,20 @@ nbformat==5.10.4
    #   nbclient
 nest-asyncio==1.6.0
    # via ipykernel
-numpy==1.26.4
-    # via
-    #   contourpy
-    #   doxysphinx
-    #   matplotlib
 packaging==25.0
    # via
    #   ipykernel
-    #   matplotlib
    #   pydata-sphinx-theme
    #   sphinx
-parso==0.8.4
+parso==0.8.5
    # via jedi
 pexpect==4.9.0
    # via ipython
-pillow==11.2.1
-    # via matplotlib
-platformdirs==4.3.7
+platformdirs==4.5.1
    # via jupyter-core
-prompt-toolkit==3.0.51
+prompt-toolkit==3.0.52
    # via ipython
-psutil==7.0.0
+psutil==7.2.1
    # via ipykernel
 ptyprocess==0.7.0
    # via pexpect
@@ -188,36 +156,27 @@ pybtex==0.25.1
    #   sphinxcontrib-bibtex
 pybtex-docutils==1.0.3
    # via sphinxcontrib-bibtex
-pycparser==2.22
+pycparser==2.23
    # via cffi
 pydata-sphinx-theme==0.15.4
    # via
    #   rocm-docs-core
    #   sphinx-book-theme
-pygithub==2.6.1
+pygithub==2.8.1
    # via rocm-docs-core
-pygments==2.19.1
+pygments==2.19.2
    # via
    #   accessible-pygments
    #   ipython
-    #   mpire
    #   pydata-sphinx-theme
    #   sphinx
-pyjson5==1.6.8
-    # via doxysphinx
 pyjwt[crypto]==2.10.1
    # via pygithub
-pynacl==1.5.0
+pynacl==1.6.2
    # via pygithub
-pyparsing==3.2.3
-    # via
-    #   doxysphinx
-    #   matplotlib
 python-dateutil==2.9.0.post0
-    # via
-    #   jupyter-client
-    #   matplotlib
-pyyaml==6.0.2
+    # via jupyter-client
+pyyaml==6.0.3
    # via
    #   jupyter-cache
    #   myst-nb
@@ -225,21 +184,21 @@ pyyaml==6.0.2
    #   pybtex
    #   rocm-docs-core
    #   sphinx-external-toc
-pyzmq==26.4.0
+pyzmq==27.1.0
    # via
    #   ipykernel
    #   jupyter-client
-referencing==0.36.2
+referencing==0.37.0
    # via
    #   jsonschema
    #   jsonschema-specifications
-requests==2.32.3
+requests==2.32.5
    # via
    #   pygithub
    #   sphinx
 rocm-docs-core[api-reference]==1.31.3
    # via -r requirements.in
-rpds-py==0.24.0
+rpds-py==0.30.0
    # via
    #   jsonschema
    #   referencing
@@ -247,9 +206,9 @@ six==1.17.0
    # via python-dateutil
 smmap==5.0.2
    # via gitdb
-snowballstemmer==2.2.0
+snowballstemmer==3.0.1
    # via sphinx
-soupsieve==2.7
+soupsieve==2.8.1
    # via beautifulsoup4
 sphinx==8.1.3
    # via
@@ -288,23 +247,20 @@ sphinxcontrib-qthelp==2.0.0
    # via sphinx
 sphinxcontrib-serializinghtml==2.0.0
    # via sphinx
-sqlalchemy==2.0.40
+sqlalchemy==2.0.45
    # via jupyter-cache
 stack-data==0.6.3
    # via ipython
 tabulate==0.9.0
    # via jupyter-cache
-tomli==2.2.1
+tomli==2.4.0
    # via sphinx
-tornado==6.4.2
+tornado==6.5.4
    # via
    #   ipykernel
    #   jupyter-client
-tqdm==4.67.1
-    # via mpire
 traitlets==5.14.3
    # via
-    #   comm
    #   ipykernel
    #   ipython
    #   jupyter-client
@@ -312,22 +268,22 @@ traitlets==5.14.3
    #   matplotlib-inline
    #   nbclient
    #   nbformat
-typing-extensions==4.13.2
+typing-extensions==4.15.0
    # via
    #   beautifulsoup4
+    #   cryptography
+    #   exceptiongroup
    #   ipython
    #   myst-nb
    #   pydata-sphinx-theme
    #   pygithub
    #   referencing
    #   sqlalchemy
-urllib3==2.4.0
+urllib3==2.6.3
    # via
    #   pygithub
    #   requests
-wcwidth==0.2.13
+wcwidth==0.2.14
    # via prompt-toolkit
-wrapt==1.17.2
-    # via deprecated
-zipp==3.21.0
+zipp==3.23.0
    # via importlib-metadata