Supplement the documentation

2026-06-30 11:47:48 +00:00 · 2025-11-03 21:59:29 +00:00
parent 88f97ab36b
commit 29bbf41353
4 changed files with 27 additions and 10 deletions
--- a/docs/conceptual/ck_tile/buffer_views.rst
+++ b/docs/conceptual/ck_tile/buffer_views.rst
@@ -443,15 +443,13 @@ C++ Atomic Operations
 Summary
 -------

-The BufferView abstraction represents a foundational achievement in GPU programming, providing a advanced yet accessible interface to the complex memory hierarchies of modern GPUs. Through its careful design and implementation, BufferView demonstrates that high-level abstractions need not come at the cost of performance—indeed, they can enable optimizations that would be impractical or impossible with lower-level approaches.
+BufferView abstracts GPU memory hierarchies behind a concise interface. The approach is intended to keep overhead small while enabling optimizations that are otherwise awkward in low-level code.

-The unified interface that BufferView provides across all memory spaces represents a significant simplification for GPU developers. Whether accessing global memory with its large capacity and high latency, shared memory with its limited size but superior bandwidth, or register files with their extreme speed but strict limitations, developers use the same consistent API. This uniformity reduces cognitive load, prevents errors, and enables code reuse across different memory hierarchies. The abstraction ensures that algorithms can be written once and adapted to different memory configurations through simple template parameter changes.
+BufferView offers a unified interface across global, shared, and register memory. Using the same API for each space can lower cognitive overhead, reduce certain classes of mistakes, and support code reuse via template parameters.

-Type safety, achieved through encoding the address space directly in the type system, transforms a common source of runtime errors into compile-time diagnostics. This design decision exemplifies the zero-overhead principle that permeates CK Tile—safety and correctness are achieved not through runtime checks but through compile-time verification. The C++ type system becomes an ally in ensuring correctness, preventing operations that would be invalid or inefficient for specific memory spaces while enabling aggressive optimizations for operations that are valid.
+Address spaces are encoded in types so that common errors are reported at compile time. Consistent with CK Tile’s zero-overhead design aim, we favor compile-time checks over runtime guards. The C++ type system enforces memory-space constraints and can make valid cases more amenable to compiler optimization.

-The performance characteristics of BufferView demonstrate the power of thoughtful abstraction design. Through support for vectorized operations, BufferView enables kernels to achieve near-theoretical memory bandwidth utilization. The automatic selection of optimal access patterns, the elimination of unnecessary bounds checks through compile-time knowledge, and the use of hardware-specific features like AMD buffer addressing all contribute to performance that matches or exceeds hand-optimized code. The abstraction actually enables optimizations by providing the compiler with more semantic information than raw pointer manipulation would allow.
-
-Flexibility in handling edge cases and boundary conditions proves crucial for real-world algorithms. BufferView's support for configurable invalid value handling, runtime bounds checking when needed, and conditional access patterns accommodates the full spectrum of GPU algorithms. From simple, regular computations to complex, data-dependent access patterns, BufferView provides the tools needed while maintaining performance. This flexibility extends to the atomic operations that are essential for parallel algorithms, enabling thread-safe updates without sacrificing the benefits of the abstraction.
+BufferView supports configurable handling of invalid values, optional runtime bounds checks, and conditional access patterns. It also provides atomic operations for thread-safe updates. These features are intended to cover common edge cases without adding unnecessary overhead.

 The true achievement of BufferView lies not in any single feature but in the harmonious integration of all these capabilities. By hiding the complexity of different memory spaces while exposing precisely the operations needed for high-performance GPU computing, BufferView establishes a pattern that the rest of CK Tile follows: compile-time abstractions that enhance rather than compromise performance. As we build upon this foundation with :ref:`ck_tile_tensor_views` and :ref:`ck_tile_distribution`, the wisdom of this approach becomes increasingly apparent—each layer adds capability while maintaining the efficiency established at the base. For hardware-specific details about memory hierarchies, see :ref:`ck_tile_gpu_basics`.

--- a/docs/conceptual/ck_tile/hardware/gemm_optimization.rst
+++ b/docs/conceptual/ck_tile/hardware/gemm_optimization.rst
@@ -133,7 +133,7 @@ There is a limitation, though: we cannot preload entire rows or columns of ``A``
 - Once done, move the tile window forward along the ``K`` dimension
 - Repeat until the entire **16×16 output block** of ``C`` is computed

-This technique of **tiling with cooperative loading** dramatically reduces global memory traffic and improves GPU efficiency by leveraging fast, on-chip LDS.
+This technique of **tiling with cooperative loading** dramatically reduces global memory traffic and improves GPU efficiency by leveraging fast, on-chip LDS as in LDS we could have a better speed and reuse of the data.

 Tiling Mathematics
 ------------------
@@ -186,6 +186,18 @@ How much space in LDS would this tiling use? Recall that matrices **A** and **B*

 We have much more space in LDS, so why not try a bigger tile size? We can afford 32 KB for each matrix, which allows us to increase the tile size to **256×64**. With this tile size, each compute unit (CU) will output a **256×256 block in C**. With this approach, the number of global memory reads will be **256 times smaller per element in C** compared to a brute-force approach.

+Variation of the GEMM in Inference
+----------------------------------
+
+When we are implementing GEMM in inference, because B matrix is the weight which is static, we will preshuffle the B matrix to the warp GEMM MFMA shape to have a faster access for registers to do the MFMA operations. In this strategy we have the following optimizations:
+
+- Shared Memory bypass of the B Matrix.
+- Loop over the A Matrix stored in the shared memory and let B stays in the registers.
+- Ping Pong buffering for the GEMM Pipeline
+
+We could reach a 20%+ peformance raise comparing to the universal gemm in general.
+
+
 Utilization Considerations
 --------------------------

--- a/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst
+++ b/docs/conceptual/ck_tile/hardware/lds_bank_conflicts.rst
@@ -11,7 +11,7 @@ Understanding AMD GPU LDS and Bank Conflicts
 Introduction
 ============

-Local Data Share (**LDS**) is AMD's shared memory within a compute unit (see :ref:`ck_tile_gpu_basics` for architecture details). It is organized into **32 banks**, each serving 4 bytes per cycle. Understanding how memory addresses map to banks is key to avoiding **bank conflicts**.
+Local Data Share (**LDS**) is AMD's shared memory within a compute unit (see :ref:`ck_tile_gpu_basics` for architecture details). It is organized into **32 or 64 banks** depending on the hardware architecture, each bank has a 4 bytes width. Understanding how memory addresses map to banks is key to avoiding **bank conflicts**.

 Bank Mapping
 ============
@@ -24,10 +24,10 @@ For AMD GCN architecture, the LDS bank mapping is typically:

 This means:

- Addresses that differ by multiples of ``32 * 4 bytes = 128 bytes`` map to the same bank.
+- Addresses that differ by multiples of ``bank numbers * 4 bytes`` map to the same bank.
 - Conflicts occur when multiple threads in the same wave access the same bank **in the same cycle**.

-Not all the lanes can produce bank conflicts. HW divides access to LDS from wavefront into phases. Which lanes would be considered in each phase depends on the width of the instruction. Let us consider ``ds_write_b128`` as an example. Here access will be divided into 8 phases for 64 lane wavefront:
+Not all the lanes can produce bank conflicts. HW divides access to LDS from wavefront into phases. Which lanes would be considered in each phase depends on the width of the instruction. Let us consider ``ds_write_b128`` as an example as it is the instruction that has the largest granularity write with the highest performance. Here access will be divided into 8 phases for 64 lane wavefront. If in 1 phase there will not be two thread access the same bank, there will bot be bank conflict:

 - lane0~lane7
 - lane8~lane15
@@ -66,6 +66,8 @@ Similarly for LDS read instruction ``ds_read_b128``, when there is no bank confl

 then it's bank conflict-free for LDS reading.

+The reason we are accessing the data vertically is becasue in most LDS access we will have the MFMA instruction in the next step and the MFMA require to access the data vertically like above.
+
 The LDS read access pattern illustrated below is typical for LDS usage in machine learning workloads. The read pattern can generate 4-way bank conflicts in every phase of access. You can experiment with ``row_padding`` (padding in a number of banks) to see if the problem can be solved this way, but also remember that in practice this will require additional LDS storage. The bigger the padding, the more additional storage is necessary.

 XOR Preshuffle: An Alternative to Padding
--- a/docs/conceptual/ck_tile/space_filling_curve.rst
+++ b/docs/conceptual/ck_tile/space_filling_curve.rst
@@ -33,6 +33,11 @@ Snake Patterns

 Snake (or serpentine) patterns reverse the traversal direction on alternate rows/planes, keeping consecutive accesses spatially close. This pattern is particularly effective for maintaining cache locality when moving between rows or higher-dimensional boundaries.

+Usage
+~~~~~
+
+SFC mainly uses in the following part of CK Tile: Tile Transpose, Tile shuffling iteration, and CShuffle to both access the tile data with the discrete way the application requred and have the best cache memory coherent hit.
+
 C++ Implementation
 ------------------