CK: removed the api reference (#3571)

* removed the api reference * updating to the latest rocm-docs-core min version * fixed a formatting issue with buffer views * removed reference links from code snippets * removed reference links from code snippets --------- Co-authored-by: John Afaganis <john.afaganis@amd.com>
2026-04-20 06:49:15 +00:00 · 2026-01-27 10:36:47 -05:00
parent b66597ed96
commit 0cc83cb8e8
25 changed files with 130 additions and 3160 deletions
--- a/docs/conceptual/ck_tile/CK-tile-index.rst
+++ b/docs/conceptual/ck_tile/CK-tile-index.rst
@@ -1,14 +1,13 @@
 .. _ck_tile_index:

-************************
-CK Tile Index
-************************
-
-CK Tile documentation structure:
+****************************************************
+CK Tile conceptual documentation table of contents
+****************************************************

 .. toctree::
   :maxdepth: 2

+   index
   introduction_motivation
   buffer_views
   tensor_views
--- a/docs/conceptual/ck_tile/MERMAID_DIAGRAMS.md
+++ b/docs/conceptual/ck_tile/MERMAID_DIAGRAMS.md
@@ -1,156 +0,0 @@
-# Mermaid Diagram Management
-
-This document explains how to manage mermaid diagrams in the CK Tile documentation.
-
-## Overview
-
-All mermaid diagrams in the CK Tile documentation have been converted to SVG files for better rendering compatibility. The original mermaid source code is preserved as commented blocks in the RST files, allowing easy updates when needed.
-
-## Directory Structure
-
- `docs/conceptual/ck_tile/diagrams/` - Contains all SVG diagram files
- `docs/conceptual/ck_tile/convert_mermaid_to_svg.py` - Initial conversion script (one-time use)
- `docs/conceptual/ck_tile/update_diagrams.py` - Helper script to regenerate diagrams from comments
-
-## Diagram Format in RST Files
-
-Each diagram follows this format:
-
-```rst
-.. 
-   Original mermaid diagram (edit here, then run update_diagrams.py)
-   
-   .. mermaid::
-   
-      graph TB
-          A --> B
-          B --> C
-
-.. image:: diagrams/diagram_name.svg
-   :alt: Diagram
-   :align: center
-```
-
-The commented mermaid block won't appear in the rendered documentation but serves as the source for regenerating the SVG.
-
-## Updating Diagrams
-
-### When to Update
-
-You need to regenerate SVG files when:
- Modifying the mermaid source in a commented block
- Adding new diagrams
- Updating diagram styling
-
-### How to Update
-
-1. **Edit the commented mermaid source** in the RST file
-2. **Run the update script**:
-   ```bash
-   # Update all diagrams
-   python docs/conceptual/ck_tile/update_diagrams.py
-   
-   # Update diagrams in a specific file
-   python docs/conceptual/ck_tile/update_diagrams.py transforms.rst
-   
-   # Force regenerate all diagrams (even if SVGs exist)
-   python docs/conceptual/ck_tile/update_diagrams.py --force
-   ```
-
-### Prerequisites
-
-The update script requires [mermaid-cli](https://github.com/mermaid-js/mermaid-cli):
-
-```bash
-npm install -g @mermaid-js/mermaid-cli
-```
-
-## Adding New Diagrams
-
-To add a new mermaid diagram:
-
-1. **Create the commented block** in your RST file:
-   ```rst
-   .. 
-      Original mermaid diagram (edit here, then run update_diagrams.py)
-      
-      .. mermaid::
-      
-         graph TB
-             A --> B
-   ```
-
-2. **Add the image reference** immediately after:
-   ```rst
-   .. image:: diagrams/my_new_diagram.svg
-      :alt: My New Diagram
-      :align: center
-   ```
-
-3. **Generate the SVG**:
-   ```bash
-   python docs/conceptual/ck_tile/update_diagrams.py your_file.rst
-   ```
-
-## Current Diagrams
-
-The following RST files contain mermaid diagrams (40 total):
-
- `adaptors.rst` (2 diagrams)
- `convolution_example.rst` (1 diagram)
- `coordinate_movement.rst` (1 diagram)
- `descriptors.rst` (2 diagrams)
- `encoding_internals.rst` (2 diagrams)
- `lds_index_swapping.rst` (3 diagrams)
- `load_store_traits.rst` (2 diagrams)
- `space_filling_curve.rst` (1 diagram)
- `static_distributed_tensor.rst` (1 diagram)
- `sweep_tile.rst` (4 diagrams)
- `tensor_coordinates.rst` (2 diagrams)
- `thread_mapping.rst` (2 diagrams)
- `tile_window.rst` (5 diagrams)
- `transforms.rst` (12 diagrams)
-
-## Troubleshooting
-
-### SVG not generated
-
- Check that mermaid-cli is installed: `mmdc --version`
- Verify the mermaid syntax is valid
- Look for error messages in the script output
-
-### Diagram not updating
-
- Use `--force` flag to regenerate: `python docs/update_diagrams.py --force`
- Check that the image reference matches the generated filename
-
-### Pattern not matching
-
-If the update script can't find your commented diagram:
- Ensure proper indentation (3 spaces for comment block content)
- Verify the `.. mermaid::` directive is commented
- Check that the image reference immediately follows the comment block
-
-## Script Details
-
-### update_diagrams.py
-
-This script:
-1. Scans RST files for commented mermaid blocks
-2. Extracts the mermaid source code
-3. Converts to SVG using `mmdc`
-4. Saves to the diagrams directory
-
-**Usage:**
- `python docs/conceptual/ck_tile/update_diagrams.py` - Check all files, update missing SVGs
- `python docs/conceptual/ck_tile/update_diagrams.py --force` - Regenerate all SVGs
- `python docs/conceptual/ck_tile/update_diagrams.py <file.rst>` - Update specific file
-
-### convert_mermaid_to_svg.py
-
-This was the initial conversion script. It:
-1. Found all active `.. mermaid::` directives
-2. Converted them to SVGs
-3. Replaced directives with commented source + image references
-
-This script was used once for the initial conversion and typically doesn't need to be run again.
--- a/docs/conceptual/ck_tile/adaptors.rst
+++ b/docs/conceptual/ck_tile/adaptors.rst
@@ -59,8 +59,8 @@ A TensorAdaptor encapsulates a sequence of :ref:`coordinate transformations <ck_
 .. image:: diagrams/adaptors_1.svg
   :alt: Diagram
   :align: center
-Core Components

+Core Components
 ~~~~~~~~~~~~~~~

 Each TensorAdaptor contains:
@@ -115,7 +115,7 @@ Custom adaptors can be created by specifying which transforms to use and how the
       make_tuple(sequence<0>{})      // to single dim 0
   );
   
-   // The adaptor is embedded in the :ref:`descriptor <ck_tile_descriptors>`
+   // The adaptor is embedded in the descriptor
   // To use it:
   multi_index<1> top_coord{5};  // 1D coordinate
   // This internally calculates: row = 5/3 = 1, col = 5%3 = 2
@@ -309,7 +309,6 @@ A practical example showing how adaptors create efficient :ref:`GPU memory acces
   // - Dimension 0,1: Thread indices
   // - Dimension 2,3: Vector indices within thread
   // Enables coalesced memory access on GPU
-   // See :ref:`ck_tile_thread_mapping` for thread mapping details

 Common Transform Chains
 -----------------------
--- a/docs/conceptual/ck_tile/buffer_views.rst
+++ b/docs/conceptual/ck_tile/buffer_views.rst
@@ -1,6 +1,25 @@
 .. _ck_tile_buffer_views:

+**********************************
 Buffer Views - Raw Memory Access
+**********************************
+
+Overview
+--------
+
+At the foundation of the CK Tile system lies BufferView, a compile-time abstraction that provides structured access to raw memory regions within GPU kernels. This serves as the bridge between the hardware's physical memory model and the higher-level abstractions that enable efficient GPU programming. BufferView encapsulates the complexity of GPU memory hierarchies while exposing a unified interface that works seamlessly across different memory address spaces including global memory shared across the entire device, local data share (LDS) memory shared within a workgroup, or the ultra-fast register files private to each thread.
+
+BufferView serves as the foundation for :ref:`ck_tile_tensor_views`, which add multi-dimensional structure on top of raw memory access. Understanding BufferView is essential before moving on to more complex abstractions like :ref:`ck_tile_distribution` and :ref:`ck_tile_tile_window`.
+
+By providing compile-time knowledge of buffer properties through template metaprogramming, BufferView enables the compiler to generate optimal machine code for each specific use case. This zero-overhead abstraction ensures that the convenience of a high-level interface comes with no runtime performance penalty.
+
+One of BufferView's most important features is its advanced handling of out-of-bounds memory access. Unlike CPU programming where such accesses typically result in segmentation faults or undefined behavior, GPU programming must gracefully handle cases where threads attempt to access memory beyond allocated boundaries. BufferView provides configurable strategies for these scenarios, where developers can choose between returning either numerical zero values or custom sentinel values for invalid accesses. This flexibility is important for algorithms that naturally extend beyond data boundaries, such as convolutions with padding or matrix operations with non-aligned dimensions.
+
+The abstraction extends beyond simple memory access to encompass both scalar and vector data types. GPUs achieve their highest efficiency when loading or storing multiple data elements in a single instruction. BufferView seamlessly supports these vectorized operations, automatically selecting the appropriate hardware instructions based on the data type and access pattern. This capability transforms what would be multiple memory transactions into single, efficient operations that fully utilize the available memory bandwidth.
+
+BufferView also incorporates AMD GPU-specific optimizations that leverage unique hardware features. The AMD buffer addressing mode, for instance, provides hardware-accelerated bounds checking that ensures memory safety without the performance overhead of software-based checks. Similarly, BufferView exposes atomic operations that are crucial for parallel algorithms requiring thread-safe updates to shared data structures. These hardware-specific optimizations are abstracted behind a portable interface, ensuring that code remains maintainable while achieving optimal performance.
+
+Memory coherence and caching policies represent another layer of complexity that BufferView manages transparently. Different GPU memory spaces have different coherence guarantees and caching behaviors. Global memory accesses can be cached in L1 and L2 caches with various coherence protocols, while LDS memory provides workgroup-level coherence with specialized banking structures (see :ref:`ck_tile_lds_bank_conflicts` for details on avoiding bank conflicts). BufferView encapsulates these details, automatically applying the appropriate memory ordering constraints and cache control directives based on the target address space and operation type.

 Address Space Usage Patterns
 ----------------------------
@@ -51,6 +70,7 @@ Address Space Usage Patterns
 .. image:: diagrams/buffer_views_1.svg
   :alt: Diagram
   :align: center
+   
 C++ Implementation
 ------------------

--- a/docs/conceptual/ck_tile/convolution_example.rst
+++ b/docs/conceptual/ck_tile/convolution_example.rst
@@ -59,10 +59,6 @@ The key insight is that convolution can be transformed from a complex nested loo
   
   

-.. image:: diagrams/convolution_example.svg
-   :alt: Diagram
-   :align: center
-
 .. image:: diagrams/convolution_example.svg
   :alt: Diagram
   :align: center
@@ -88,7 +84,6 @@ Non-overlapping tiles:
        
        // Original matrix: shape=(6, 6), strides=(6, 1)
        // Tiled view: shape=(3, 3, 2, 2), strides=(12, 2, 6, 1)
-        // See :ref:`ck_tile_descriptors` for descriptor details
        using TileDescriptor = TensorDescriptor<
            Sequence<kNumTiles, kNumTiles, kTileSize, kTileSize>,
            Sequence<12, 2, 6, 1>
@@ -243,7 +238,6 @@ The im2col transformation converts the 4D windows tensor into a 2D matrix suitab
        >;
        
        // Step 2: Apply merge transforms to create 2D im2col layout
-        // See :ref:`ck_tile_transforms` for transform operations
        using Im2colDescriptor = decltype(
            transform_tensor_descriptor(
                WindowsDescriptor{},
@@ -312,7 +306,6 @@ Combining all components into an optimized convolution implementation:
        >;
        
        // Tile distribution for matrix multiplication
-        // See :ref:`ck_tile_tile_distribution` for details
        using ATileDist = TileDistribution<
            Sequence<TileM, TileK>,
            Sequence<BlockM, 1>
@@ -327,7 +320,6 @@ Combining all components into an optimized convolution implementation:
        >;
        
        // Thread-local accumulator
-        // See :ref:`ck_tile_static_distributed_tensor`
        StaticDistributedTensor<DataType, CTileDist> c_accumulator;
        
        // Initialize accumulator
@@ -339,7 +331,6 @@ Combining all components into an optimized convolution implementation:
        // Main GEMM loop over K dimension
        for (index_t k_tile = 0; k_tile < PatchSize; k_tile += TileK) {
            // Create tile windows for im2col matrix and kernel
-            // See :ref:`ck_tile_tile_window` for window operations
            auto a_window = make_tile_window<ATileDist>(
                input, Im2colDesc{H, W, K},
                {blockIdx.y * TileM, k_tile}
@@ -350,7 +341,7 @@ Combining all components into an optimized convolution implementation:
                {k_tile, 0}
            );
            
-            // Load tiles - see :ref:`ck_tile_load_store_traits` for optimization
+            // Load tiles 
            auto a_tile = a_window.load();
            auto b_tile = b_window.load();
            
@@ -476,7 +467,6 @@ CK Tile enables several optimizations for convolution:
    __shared__ float smem_b[TileK][TileN];
    
    // Collaborative loading with proper bank conflict avoidance
-    // See :ref:`ck_tile_lds_bank_conflicts` for optimization
    auto load_tile_to_smem = [&](auto& window, float smem[][TileK]) {
        #pragma unroll
        for (index_t i = threadIdx.y; i < TileM; i += blockDim.y) {
@@ -560,7 +550,7 @@ This example demonstrates how CK Tile transforms convolution from a memory-bound

 - **Sliding windows** can be efficiently represented using tensor descriptors with appropriate strides
 - **Im2col transformation** converts convolution to matrix multiplication without data copies  
- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_tile_distribution`)
+- **Tile distribution** enables optimal work distribution across GPU threads (see :ref:`ck_tile_distribution`)
 - **Multi-channel support** extends naturally through higher-dimensional descriptors
 - **Performance optimizations** like vectorization and shared memory are seamlessly integrated (see :ref:`ck_tile_gemm_optimization` for similar techniques)

--- a/docs/conceptual/ck_tile/coordinate_movement.rst
+++ b/docs/conceptual/ck_tile/coordinate_movement.rst
@@ -317,7 +317,7 @@ Movement Through Adaptors
 Advanced Movement Patterns
 ==========================

-Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_tile_distribution` concepts:
+Real-world applications use advanced movement patterns for optimal memory access. These patterns often relate to :ref:`ck_tile_tile_window` operations and :ref:`ck_tile_distribution` concepts:

 Tiled Access Pattern
 --------------------
--- a/docs/conceptual/ck_tile/descriptors.rst
+++ b/docs/conceptual/ck_tile/descriptors.rst
@@ -315,18 +315,18 @@ Padding for Convolution

 .. code-block:: cpp

-// Add padding to spatial dimensions
-   auto padded = transform_tensor_descriptor(
-       input_tensor,
-       make_tuple(
-           make_pass_through_transform(N),    // Batch
-           make_pass_through_transform(C),    // Channel
-           make_pad_transform(H, pad_h, pad_h),  // Height
-           make_pad_transform(W, pad_w, pad_w)   // Width
-       ),
-       make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
-       make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
-   );
+    // Add padding to spatial dimensions
+    auto padded = transform_tensor_descriptor(
+        input_tensor,
+        make_tuple(
+            make_pass_through_transform(N),    // Batch
+            make_pass_through_transform(C),    // Channel
+            make_pad_transform(H, pad_h, pad_h),  // Height
+            make_pad_transform(W, pad_w, pad_w)   // Width
+            ),
+        make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{}),
+        make_tuple(sequence<0>{}, sequence<1>{}, sequence<2>{}, sequence<3>{})
+        );

 For a complete convolution example, see :ref:`ck_tile_convolution_example`.

--- a/docs/conceptual/ck_tile/hardware/gemm_optimization.rst
+++ b/docs/conceptual/ck_tile/hardware/gemm_optimization.rst
@@ -260,7 +260,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
                                       index_t K)
    {
        // Define tile distribution encoding
-        // See :ref:`ck_tile_encoding_internals` and :ref:`ck_tile_tile_distribution`
        using Encoding = tile_distribution_encoding<
            sequence<>,                              // No replication
            tuple<sequence<4, 2, 8, 4>,             // M dimension hierarchy
@@ -274,7 +273,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
        constexpr auto tile_dist = make_static_tile_distribution(Encoding{});
        
        // Create tensor views for global memory
-        // See :ref:`ck_tile_tensor_views` and :ref:`ck_tile_buffer_views`
        auto a_global_view = make_naive_tensor_view<address_space_enum::global>(
            a_global, make_tuple(M, K), make_tuple(K, 1));
        auto b_global_view = make_naive_tensor_view<address_space_enum::global>(
@@ -287,7 +285,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
        const index_t block_n_id = blockIdx.x;
        
        // Create tile windows for loading
-        // See :ref:`ck_tile_tile_window` for tile window details
        auto a_window = make_tile_window(
            a_global_view,
            make_tuple(number<MPerBlock>{}, number<KPerBlock>{}),
@@ -301,7 +298,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
            tile_dist);
        
        // Allocate LDS storage
-        // See :ref:`ck_tile_static_distributed_tensor` for distributed tensors
        auto a_lds = make_static_distributed_tensor<ADataType, 
                                                   decltype(tile_dist)>();
        auto b_lds = make_static_distributed_tensor<BDataType, 
@@ -310,7 +306,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
        // Initialize accumulator
        auto c_reg = make_static_distributed_tensor<CDataType, 
                                                   decltype(tile_dist)>();
-        // See :ref:`ck_tile_sweep_tile` for sweep operations
        sweep_tile(c_reg, [](auto idx, auto& val) { val = 0; });
        
        // Main GEMM loop with pipelining
@@ -324,7 +319,6 @@ Here's how CK Tile implements an optimized GEMM kernel:
        // Pipeline loop
        for(index_t k_tile = 0; k_tile < num_k_tiles - 1; ++k_tile) {
            // Move windows for next iteration
-            // See :ref:`ck_tile_coordinate_movement` for window movement
            a_window.move_slice_window(make_tuple(0, KPerBlock));
            b_window.move_slice_window(make_tuple(0, KPerBlock));
            
--- a/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst
+++ b/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst
@@ -172,7 +172,6 @@ Example usage in CK Tile:
        a_window.load(a_lds_tensor);
        
        // Subsequent reads from LDS are conflict-free
-        // See :ref:`ck_tile_sweep_tile` for sweep operations
        sweep_tile(a_lds_tensor, [](auto idx, auto& val) {
            // Process data...
        });
--- a/docs/conceptual/ck_tile/introduction_motivation.rst
+++ b/docs/conceptual/ck_tile/introduction_motivation.rst
@@ -276,7 +276,7 @@ The foundation of the exploration begins with raw memory access through :ref:`ck

 With these foundational concepts established, the documentation delves into the :ref:`ck_tile_coordinate_systems` that powers tile distribution. This engine implements the mathematical framework that have been introduced, providing compile-time transformations between P-space, Y-space, X-space, and D-space. Understanding these transformations at a deep level enables developers to reason about performance implications and design custom distribution strategies for novel algorithms. The :ref:`ck_tile_transforms` and :ref:`ck_tile_adaptors` provide the building blocks for these transformations.

-The high-level :ref:`ck_tile_distribution` APIs represent the culmination of these lower-level abstractions. These APIs provide an accessible interface for common patterns while exposing enough flexibility for advanced optimizations. Through concrete examples and detailed explanations, the documentation will demonstrate how to leverage these APIs to achieve near-optimal performance across a variety of computational patterns. The :ref:`ck_tile_window` abstraction provides the gateway for efficient data access.
+The high-level :ref:`ck_tile_distribution` APIs represent the culmination of these lower-level abstractions. These APIs provide an accessible interface for common patterns while exposing enough flexibility for advanced optimizations. Through concrete examples and detailed explanations, the documentation will demonstrate how to leverage these APIs to achieve near-optimal performance across a variety of computational patterns. The :ref:`ck_tile_tile_window` abstraction provides the gateway for efficient data access.

 The exploration of coordinate systems goes beyond the basic P, Y, X, D framework to encompass advanced topics such as multi-level tiling, replication strategies, and specialized coordinate systems for specific algorithm classes. The :ref:`ck_tile_encoding_internals` reveals the mathematical foundations, while :ref:`ck_tile_thread_mapping` shows how these abstractions map to hardware. This comprehensive treatment ensures that developers can handle not just common cases but also novel algorithms that require custom distribution strategies.

--- a/docs/conceptual/ck_tile/lds_index_swapping.rst
+++ b/docs/conceptual/ck_tile/lds_index_swapping.rst
@@ -5,7 +5,7 @@
 .. _ck_tile_lds_index_swapping:

 ********************************
-Load Datat Share Index Swapping
+Load Data Share Index Swapping
 ********************************

 Overview
@@ -70,9 +70,9 @@ The original K coordinate is split into K0 and K1, where K1 represents the threa

 The XOR transformation updates the K0 coordinate using the formula:

-.. code-block:: cpp
+.. math::

-    K0' = K0 ^ (M % (KPerBlock / KPack * MLdsLayer))
+    K0' = K0^{(M \% (KPerBlock / KPack * MLdsLayer))}

 This XOR operation redistributes accesses across memory banks by mixing bits from the M and K dimensions.

@@ -132,10 +132,10 @@ The transformed K0' is split into L and K0'' components, creating an intermediat

 The unmerge operation:

-.. code-block:: cpp
+.. math:: 

    L = K0' / (KPerBlock/KPack)
-    K0'' = K0' % (KPerBlock/KPack)
+    K0'' = K0' \% (KPerBlock/KPack)

 When MLdsLayer == 1, this simplifies to L=0 and K0''=K0'.

--- a/docs/conceptual/ck_tile/load_store_traits.rst
+++ b/docs/conceptual/ck_tile/load_store_traits.rst
@@ -71,7 +71,6 @@ The LoadStoreTraits class analyzes distribution patterns at compile time:
       static constexpr index_t scalars_per_access = scalar_per_vector;
       
       // Space-filling curve for optimal traversal
-       // See :ref:`ck_tile_space_filling_curve` for details
       using sfc_type = space_filling_curve<ndim_y>;
       static constexpr sfc_type sfc_ys = make_space_filling_curve<Distribution>();
       
@@ -274,7 +273,7 @@ LoadStoreTraits optimizes for several performance metrics:
           return Traits::num_access;
       }
       
-       // Check coalescing efficiency (see :ref:`ck_tile_gpu_basics`)
+       // Check coalescing efficiency 
       static constexpr bool is_perfectly_coalesced()
       {
           // Perfect coalescing when adjacent threads access adjacent memory
@@ -316,7 +315,6 @@ Comparing Different Configurations
   static_assert(OptimizedAnalyzer::bandwidth_utilization() == 50.0f);  // 8*4/64
   
   // Better bandwidth utilization leads to improved performance
-   // See :ref:`ck_tile_gemm_optimization` for real-world examples

 Integration with Space-Filling Curves
 -------------------------------------
--- a/docs/conceptual/ck_tile/space_filling_curve.rst
+++ b/docs/conceptual/ck_tile/space_filling_curve.rst
@@ -254,7 +254,6 @@ For :ref:`matrix multiplication <ck_tile_gemm_optimization>`, optimal access pat

   // GEMM tile: 16x32 with vector-8 loads
   // Column-major for coalesced access in GEMM
-   // See :ref:`ck_tile_gemm_optimization` for complete example
   using GemmTileCurve = space_filling_curve<
       2,
       sequence<16, 32>,    // Tile size
@@ -336,7 +335,7 @@ Optimizing for Hardware

 .. code-block:: cpp

-   // Optimize for GPU memory coalescing (see :ref:`ck_tile_gpu_basics`)
+   // Optimize for GPU memory coalescing
   template <typename DataType, index_t WarpSize = 32>
   struct coalesced_access_pattern
   {
@@ -411,7 +410,6 @@ LoadStoreTraits Integration
   struct load_store_traits
   {
       // Create optimized space-filling curve
-       // See :ref:`ck_tile_tile_distribution` for Distribution details
       using sfc_type = space_filling_curve<
           Distribution::ndim_y,
           typename Distribution::y_lengths,
@@ -461,7 +459,6 @@ Best Practices
   .. code-block:: cpp

      // Match vector size to cache line for optimal bandwidth
-      // See :ref:`ck_tile_lds_bank_conflicts` for cache optimization
      constexpr index_t optimal_vector = min(
          tensor_length_fast_dim,
          cache_line_size / sizeof(DataType)
--- a/docs/conceptual/ck_tile/static_distributed_tensor.rst
+++ b/docs/conceptual/ck_tile/static_distributed_tensor.rst
@@ -17,9 +17,9 @@ Each thread in a workgroup owns a portion of the overall tensor data, stored in

 This design enables three critical optimizations:

-    * It maximizes register utilization by keeping frequently accessed data in the fastest memory hierarchy. 
-    * It eliminates redundant memory accesses since each thread maintains its own working set. 
-    * It provides a clean abstraction for complex algorithms like matrix multiplication where each thread accumulates partial results that eventually combine into the final output.
+* It maximizes register utilization by keeping frequently accessed data in the fastest memory hierarchy. 
+* It eliminates redundant memory accesses since each thread maintains its own working set. 
+* It provides a clean abstraction for complex algorithms like matrix multiplication where each thread accumulates partial results that eventually combine into the final output.

 Thread-Local Storage Model
 ==========================
@@ -384,8 +384,7 @@ Static distributed tensors integrate seamlessly with other CK Tile components:
        // Main GEMM loop
        for(index_t k_tile = 0; k_tile < K; k_tile += kTileK) {
        // Create tile windows for this iteration
-        // See :ref:`ck_tile_tile_window` for details
-        auto a_window = make_tile_window(
+            auto a_window = make_tile_window(
            a_ptr, ALayout{M, K}, 
            ATileDist{}, 
            {blockIdx.y * kTileM, k_tile}
@@ -398,7 +397,6 @@ Static distributed tensors integrate seamlessly with other CK Tile components:
            );
            
            // Load tiles to distributed tensors
-            // See :ref:`ck_tile_load_store_traits` for optimized loading
            auto a_tile = a_window.load();
            auto b_tile = b_window.load();
            
--- a/docs/conceptual/ck_tile/thread_mapping.rst
+++ b/docs/conceptual/ck_tile/thread_mapping.rst
@@ -356,7 +356,6 @@ CK uses several techniques to optimize memory access:
                                               float>>>;

    // 2. Swizzling to avoid bank conflicts
-    // See :ref:`ck_tile_lds_index_swapping` and :ref:`ck_tile_swizzling_example`
    template <index_t BankSize = 32>
    __device__ index_t swizzle_offset(index_t tid, index_t offset)
    {
@@ -434,7 +433,6 @@ The following example shows how thread mapping works in a CK kernel:
        __shared__ ComputeType shared_sum[BlockSize];
        
        // 5. Create tensor view and tile window
-        // See :ref:`ck_tile_tensor_views` and :ref:`ck_tile_tile_window`
        auto x_view = make_naive_tensor_view<address_space_enum::global>(
            x + bid * hidden_size,
            make_tuple(hidden_size),
--- a/docs/conceptual/ck_tile/tile_distribution.rst
+++ b/docs/conceptual/ck_tile/tile_distribution.rst
@@ -1,4 +1,4 @@
-.. _ck_tile_distribution:
+.. _ck_tile_tile_distribution:

 Tile Distribution - The Core API
 ================================
--- a/docs/conceptual/ck_tile/tile_window.rst
+++ b/docs/conceptual/ck_tile/tile_window.rst
@@ -283,7 +283,7 @@ Creating and Using TileWindow

   using namespace ck_tile;
   
-   // Create a tensor view for input data (see :ref:`ck_tile_tensor_views`)
+   // Create a tensor view for input data
   auto tensor_view = make_naive_tensor_view(
       data_ptr,
       make_tuple(256, 256),    // Shape
@@ -314,7 +314,7 @@ Creating and Using TileWindow
       distribution
   );
   
-   // Load data into distributed tensor (see :ref:`ck_tile_static_distributed_tensor`)
+   // Load data into distributed tensor 
   auto distributed_data = make_static_distributed_tensor<float>(distribution);
   window.load(distributed_data);

@@ -558,7 +558,6 @@ Complete Load-Compute-Store Pipeline
           c_dist);
       
       // Create distributed tensors for register storage
-       // See :ref:`ck_tile_static_distributed_tensor` for details
       auto a_reg = make_static_distributed_tensor<AType>(a_dist);
       auto b_reg = make_static_distributed_tensor<BType>(b_dist);
       auto c_reg = make_static_distributed_tensor<CType>(c_dist);
@@ -620,6 +619,8 @@ Performance Characteristics
 .. image:: diagrams/tile_window_5.svg
   :alt: Diagram
   :align: center
+
+
 Best Practices
 --------------

--- a/docs/conceptual/ck_tile/transforms.rst
+++ b/docs/conceptual/ck_tile/transforms.rst
@@ -302,7 +302,7 @@ EmbedTransform expands linear indices from the lower coordinate space into multi
   using namespace ck_tile;
   
   // Create embed transform for 2x3 tensor with strides [12, 1]
-   // This is commonly used in :ref:`descriptors <ck_tile_descriptors>`
+   // This is commonly used in descriptors
   auto transform = make_embed_transform(make_tuple(2, 3), make_tuple(12, 1));
   
   // Forward: Linear → 2D (Manual calculation)