[composable_kernel] initial draft of the ck tile conceptual doc (#3242)

* Adding CK Tile documentation * Updates based on feedback * Fix tile window API description * Fix remaining images * add documentation about flush_cache and rotating_buffer functionality in ck_tile * Supplement the documentation * light edit of the ck tile conceptual doc * Fixes for ruff check. * Fixes for ruff check 2. * Fixes for ruff check 3. --------- Co-authored-by: Vidyasagar <vanantha@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
2026-05-05 06:01:23 +00:00 · 2025-12-04 14:09:21 -05:00
parent cd21e20ae7
commit d9d4c9c3df
98 changed files with 12671 additions and 0 deletions
--- a/docs/conceptual/ck_tile/cache_flushing_benchmarking.rst
+++ b/docs/conceptual/ck_tile/cache_flushing_benchmarking.rst
@@ -0,0 +1,390 @@
+===================================
+Cache Flushing for GPU Benchmarking
+===================================
+
+Overview
+========
+
+When benchmarking GPU kernels, accurate performance measurements require understanding and controlling cache behavior. Running a kernel multiple times with the same input data can lead to artificially fast results due to **cache hits**, where data and instructions are served from fast GPU cache rather than slow High Bandwidth Memory (HBM).
+
+Composable Kernel provides two complementary mechanisms to ensure realistic "cold cache" performance measurements:
+
+1. **Instruction Cache Flushing** - Invalidates cached GPU instructions
+2. **Rotating Memory Buffers** - Cycles through multiple data buffer copies at different memory addresses
+
+This document explains how these mechanisms work and how to use them in benchmarks.
+
+The Problem: Hot vs. Cold Cache
+================================
+
+GPU Memory Hierarchy
+--------------------
+
+GPUs have a multi-level cache hierarchy:
+
+.. code-block:: text
+
+    Fast → Slow, Small → Large
+
+    ┌─────────────────┐
+    │  Register File  │  ~1 cycle
+    ├─────────────────┤
+    │  L1 I-Cache     │  ~4 cycles   ← Instruction cache
+    ├─────────────────┤
+    │  L1 Data Cache  │  ~4 cycles   ← Data cache
+    ├─────────────────┤
+    │  L2 Cache       │  ~50 cycles
+    ├─────────────────┤
+    │  HBM (VRAM)     │  ~400 cycles
+    └─────────────────┘
+
+Cache Behavior Without Flushing
+--------------------------------
+
+When running a kernel repeatedly without cache management:
+
+.. code-block:: text
+
+    Run 1:  [Cache MISS]  → Fetch from HBM → 400 cycles → 5.2ms
+    Run 2:  [Cache HIT!]  → Read from L1/L2 → 4 cycles → 3.8ms ← Artificially fast!
+    Run 3:  [Cache HIT!]  → Read from L1/L2 → 4 cycles → 3.8ms
+    ...
+    Average: 4.1ms (misleading - not representative of real-world performance)
+
+This leads to:
+
+- ✗ Inflated performance numbers
+- ✗ Inconsistent timing between first and subsequent runs
+- ✗ Unfair comparisons between different kernels
+- ✗ Misleading optimization decisions
+
+Solution 1: Instruction Cache Flushing
+=======================================
+
+What is Instruction Cache?
+---------------------------
+
+The **instruction cache (I-cache)** is a small, fast memory on each GPU compute unit that stores recently executed machine code instructions. When a thread needs to execute an instruction:
+
+1. The **Program Counter (PC)** holds the instruction's memory address
+2. The GPU checks if that address exists in the I-cache
+3. **Cache HIT**: Instruction read instantly from I-cache (~4 cycles)
+4. **Cache MISS**: Instruction fetched from HBM (~400 cycles), then cached
+
+How It Works
+------------
+
+The GPU uses **address-based caching**: when you launch the same kernel multiple times, the kernel code resides at the same memory address, allowing the I-cache to serve cached instructions.
+
+.. code-block:: text
+
+    First Kernel Run:
+    PC = 0x7F8A0000  →  I-Cache lookup  →  MISS  →  Fetch from HBM  →  Cache it
+    
+    Second Kernel Run (without flush):
+    PC = 0x7F8A0000  →  I-Cache lookup  →  HIT!  →  Read from cache (fast!)
+    
+    Second Kernel Run (with flush):
+    PC = 0x7F8A0000  →  I-Cache lookup  →  MISS  →  Fetch from HBM again
+
+The ``flush_icache()`` Function
+--------------------------------
+
+Located in ``include/ck_tile/host/flush_icache.hpp``:
+
+.. code-block:: cpp
+
+    namespace ck_tile {
+    // GPU kernel to invalidate instruction cache for accurate benchmarking.
+    static __global__ void flush_cache()
+    {
+        asm __volatile__("s_icache_inv \n\t"    // Invalidate I-cache
+                         "s_nop 0 \n\t"          // Wait cycles (16 NOPs)
+                         "s_nop 0 \n\t"
+                         // ... 14 more NOPs
+                         "s_nop 0 \n\t" ::
+                             :);
+    }
+    }
+
+**Key Components:**
+
+- ``s_icache_inv``: AMD GPU instruction that invalidates the L1 instruction cache on the current compute unit
+- ``s_nop 0`` (×16): No-operation instructions (NOPs) that create a 16-cycle delay to ensure cache invalidation completes before the kernel exits
+
+**Why 16 NOPs?**
+
+The ``s_icache_inv`` instruction is **asynchronous**: it initiates cache invalidation but doesn't wait for completion. Without the NOPs, the kernel might exit before the flush finishes, leading to race conditions and incomplete cache invalidation.
+
+Launching the Flush Kernel
+---------------------------
+
+From ``include/ck_tile/host/rotating_buffers.hpp``:
+
+.. code-block:: cpp
+
+    inline void flush_icache()
+    {
+        hipDeviceProp_t deviceProps;
+        HIP_CHECK_ERROR(hipGetDeviceProperties(&deviceProps, 0));
+        
+        // Over-provision blocks to ensure all CUs execute the flush instruction.
+        // With imperfect scheduling, launching exactly 1 block per CU doesn't guarantee coverage.
+        // 60x over-provisioning provides statistical certainty that every CU gets at least one block.
+        constexpr int32_t blocks_per_cu = 60;
+        int32_t gpu_block3 = deviceProps.multiProcessorCount * blocks_per_cu;
+
+        ck_tile::flush_cache<<<dim3(gpu_block3), dim3(64), 0, nullptr>>>();
+        HIP_CHECK_ERROR(hipGetLastError());
+    }
+
+**Why 60× Over-provisioning?**
+
+The I-cache is **per-compute-unit** (CU). To flush all CUs, we must ensure every CU executes at least one instance of ``s_icache_inv``. 
+
+- Launching exactly 1 block per CU doesn't guarantee 1:1 mapping due to GPU scheduler behavior
+- Launching 60 blocks per CU provides statistical certainty that every CU receives work
+- For a 120-CU GPU: 120 × 60 = 7,200 blocks × 64 threads = 460,800 total threads
+
+This ensures comprehensive instruction cache flushing across all compute units.
+
+Solution 2: Rotating Memory Buffers
+====================================
+
+What is Data Cache?
+-------------------
+
+While I-cache stores instructions, **data cache** (L1 data, L2) stores matrix data (inputs A, B and output C). When a kernel reads the same matrix repeatedly, the data is served from cache rather than HBM.
+
+The RotatingMemWrapper Struct
+------------------------------
+
+Located in ``include/ck_tile/host/rotating_buffers.hpp``:
+
+.. code-block:: cpp
+
+    template <typename ADataType, typename BDataType>
+    struct RotatingMemWrapper
+    {
+        RotatingMemWrapper(const void* a_ptr_,
+                          const void* b_ptr_,
+                          std::size_t rotating_count_,
+                          std::size_t size_a_,
+                          std::size_t size_b_);
+        
+        void Next();  // Rotate to next buffer copy
+        ~RotatingMemWrapper() noexcept;  // Cleanup
+    };
+
+**Purpose**: Prevents data cache reuse by cycling through multiple copies of input matrices at different memory addresses.
+
+How It Works
+------------
+
+**Constructor: Create Buffer Copies**
+
+.. code-block:: cpp
+
+    RotatingMemWrapper(a_ptr, b_ptr, rotating_count=3, size_a, size_b)
+    {
+        // Store original buffer pointers as first entry
+        p_a_grids.push_back(a_ptr);
+        p_b_grids.push_back(b_ptr);
+        
+        // Create (rotating_count - 1) additional copies at different memory addresses
+        for(size_t i = 1; i < rotating_count; i++)
+        {
+            void* pADeviceBuf;
+            hipMalloc(&pADeviceBuf, size_a);
+            hipMemcpy(pADeviceBuf, p_a_grids[0], size_a, hipMemcpyDeviceToDevice);
+            p_a_grids.push_back(pADeviceBuf);
+            
+            // Same for B matrix...
+        }
+    }
+
+Result:
+
+.. code-block:: text
+
+    GPU Memory:
+    ┌─────────────────────────┐
+    │ Matrix A (original)     │  Address: 0x1000
+    │ Matrix A (copy 1)       │  Address: 0x2000
+    │ Matrix A (copy 2)       │  Address: 0x3000
+    │ Matrix B (original)     │  Address: 0x4000
+    │ Matrix B (copy 1)       │  Address: 0x5000
+    │ Matrix B (copy 2)       │  Address: 0x6000
+    └─────────────────────────┘
+
+**Next(): Rotate to Next Buffer**
+
+.. code-block:: cpp
+
+    void Next()
+    {
+        if(rotating_count > 1)
+        {
+            std::size_t idx = iter++ % rotating_count;  // Cycle: 0,1,2,0,1,2,...
+            a_ptr = p_a_grids[idx];
+            b_ptr = p_b_grids[idx];
+        }
+    }
+
+Usage in benchmarking loop:
+
+.. code-block:: text
+
+    Iteration 1:  Next() → Use buffers at 0x1000, 0x4000 → Kernel reads → Cache miss
+    Iteration 2:  Next() → Use buffers at 0x2000, 0x5000 → Kernel reads → Cache miss
+    Iteration 3:  Next() → Use buffers at 0x3000, 0x6000 → Kernel reads → Cache miss
+    Iteration 4:  Next() → Use buffers at 0x1000, 0x4000 → Kernel reads → Cache miss
+    ...
+
+By the time the buffers cycle back to the first copy, the cache has likely evicted the old data.
+
+**Destructor: Cleanup**
+
+.. code-block:: cpp
+
+    ~RotatingMemWrapper() noexcept
+    {
+        // Restore original buffer pointers
+        a_ptr = p_a_grids[0];
+        b_ptr = p_b_grids[0];
+
+        // Free extra buffer copies (index 0 is original, don't free it)
+        for(size_t i = 1; i < rotating_count; i++)
+        {
+            hipFree(p_a_grids[i]);
+            hipFree(p_b_grids[i]);
+        }
+    }
+
+Using Cache Flushing in Practice
+=================================
+
+Command Line Argument
+---------------------
+
+The ``flush_cache`` command-line argument controls whether cache flushing is enabled:
+
+.. code-block:: bash
+
+    # Enable cache flushing (cold cache benchmarking)
+    ./gemm_example --flush_cache=1 --rotating_count=3
+    
+    # Disable cache flushing (hot cache benchmarking)
+    ./gemm_example --flush_cache=0
+
+In ``run_gemm_quant_example.inc``:
+
+.. code-block:: cpp
+
+    bool flush_cache = arg_parser.get_bool("flush_cache");
+    int rotating_count = arg_parser.get_int("rotating_count");
+    
+    // Pass to stream_config
+    ck_tile::stream_config{
+        nullptr,      // stream
+        true,         // time_kernel
+        1,            // log_level
+        n_warmup,     // cold_niters (warmup iterations)
+        n_repeat,     // nrepeat (timed iterations)
+        true,         // is_gpu_timer
+        flush_cache,  // flush_cache_ ← Controls cache flushing
+        rotating_count // rotating_count_ ← Number of buffer copies
+    }
+
+Integration with Timing Loop
+-----------------------------
+
+The ``launch_kernel_time_mask`` function integrates both mechanisms:
+
+.. code-block:: cpp
+
+    // From include/ck_tile/host/kernel_launch.hpp
+    template <typename PreprocessFunc, typename... Callables>
+    float launch_kernel_time_mask(const stream_config& s, 
+                                   PreprocessFunc preprocess,
+                                   Callables&&... callables)
+    {
+        // Timing loop (simplified)
+        for(int i = 0; i < s.nrepeat_; i++)
+        {
+            preprocess();      // 1. Flush I-cache + rotate buffers
+            callables_func();  // 2. Launch kernel
+        }
+        
+        return average_time;
+    }
+
+Complete Example
+----------------
+
+From ``example/ck_tile/38_block_scale_gemm/run_gemm_quant_example.inc``:
+
+.. code-block:: cpp
+
+    // Setup rotating memory wrapper
+    RotatingMemWrapper<ADataType, BDataType> rotating_mem(
+        a_ptr, b_ptr, rotating_count, size_a, size_b);
+    
+    // Define preprocessing: flush I-cache + rotate buffers
+    auto preprocess = [&]() {
+        if(flush_cache) {
+            flush_icache();      // Invalidate instruction cache
+            rotating_mem.Next(); // Switch to next buffer copy
+        }
+    };
+    
+    // Define kernel launch
+    auto kernel_launch = [&]() {
+        gemm_kernel<<<grid, block>>>(a_ptr, b_ptr, c_ptr, M, N, K);
+    };
+    
+    // Benchmark with cache control
+    float avg_time = launch_kernel_time_mask(
+        stream_config,   // Config with flush_cache and rotating_count
+        preprocess,      // Flush + rotate before each iteration
+        kernel_launch    // Kernel to benchmark
+    );
+
+Execution Flow
+--------------
+
+With ``flush_cache=true`` and ``rotating_count=3``, ``nrepeat=100``:
+
+.. code-block:: text
+
+    Warmup Phase (n_warmup iterations):
+      - Run kernel without timing
+      - Prime GPU, warm up scheduler
+    
+    Timed Phase (100 iterations):
+      Iteration 1:  flush_icache() → rotating_mem.Next() → Use buffer copy 0 → kernel() → Measure
+      Iteration 2:  flush_icache() → rotating_mem.Next() → Use buffer copy 1 → kernel() → Measure
+      Iteration 3:  flush_icache() → rotating_mem.Next() → Use buffer copy 2 → kernel() → Measure
+      Iteration 4:  flush_icache() → rotating_mem.Next() → Use buffer copy 0 → kernel() → Measure
+      ...
+      Iteration 100: flush_icache() → rotating_mem.Next() → Use buffer copy 1 → kernel() → Measure
+    
+    Return: Average time per iteration (excluding preprocess overhead)
+
+References
+==========
+
+Related Files
+-------------
+
+- ``include/ck_tile/host/flush_icache.hpp`` - I-cache flush kernel implementation
+- ``include/ck_tile/host/rotating_buffers.hpp`` - RotatingMemWrapper implementation
+- ``include/ck_tile/host/kernel_launch.hpp`` - Timing loop integration
+
+Conclusion
+==========
+
+Accurate GPU kernel benchmarking requires careful control of cache behavior. The combination of **instruction cache flushing** (``flush_icache``) and **rotating memory buffers** (``RotatingMemWrapper``) ensures realistic "cold cache" performance measurements that represent real-world application behavior.
+
+By understanding and utilizing these mechanisms through the ``flush_cache`` command-line argument, you can obtain trustworthy performance data for optimization decisions and fair kernel comparisons.
+