.. meta:: :description: CK Tile Hardware-Specific Documentation :keywords: CDNA, GPU architecture, LDS, GEMM, CK, Composable Kernel .. _ck_tile_hardware: ******************************************************************** CK Tile Hardware Documentation ******************************************************************** This section provides in-depth coverage of hardware-specific concepts and optimizations for CK Tile on AMD GPUs. Overview ======== Understanding the underlying hardware architecture is crucial for achieving optimal performance with CK Tile. This documentation covers: - AMD CDNA architecture fundamentals - Memory hierarchy and optimization techniques - Practical examples of high-performance kernels Documentation Structure ======================= .. toctree:: :maxdepth: 2 :caption: Hardware Topics gpu_basics lds_bank_conflicts gemm_optimization GPU Architecture Basics ----------------------- :ref:`ck_tile_gpu_basics` provides an introduction to AMD CDNA architecture. LDS and Bank Conflicts ---------------------- :ref:`ck_tile_lds_bank_conflicts` explains Local Data Share (LDS) optimization. GEMM Optimization Case Study ---------------------------- :ref:`ck_tile_gemm_optimization` demonstrates a complete optimization journey. Key Hardware Considerations =========================== Memory Hierarchy ---------------- 1. **Global Memory**: High latency, high bandwidth - Optimize with coalesced access patterns - Use tile windows for automatic optimization 2. **L2/Infinity Cache**: Intermediate storage - Benefits from spatial and temporal locality - CK Tile's tiling naturally improves cache hit rates 3. **LDS**: Low latency, shared within CU - 64KB per CU, organized in 32 banks - CK Tile handles bank conflict avoidance 4. **Registers**: Lowest latency, per-thread storage - 512 VGPRs available per wavefront - CK Tile's compile-time optimization minimizes usage Compute Resources ----------------- 1. **Wavefront Execution**: 64 threads in lockstep - CK Tile ensures coalesced memory access - Automatic warp-level synchronization 2. **Matrix Units**: Specialized MFMA instructions - 16x16x16 operations in 16 cycles - CK Tile can leverage these automatically 3. **Occupancy**: Balancing threads vs resources - Register pressure affects occupancy - CK Tile helps through efficient register use Performance Guidelines ====================== To achieve optimal performance with CK Tile: 1. **Choose appropriate tile sizes**: - Match hardware capabilities (e.g., 256x256 for GEMM) - Consider LDS capacity and register pressure 2. **Align problem dimensions**: - Match CU count when possible (304 for MI300) - Use padding for non-aligned sizes 3. **Enable pipelining**: - Use double buffering for latency hiding - CK Tile supports async operations 4. **Profile and verify**: - Use rocprof to check for bottlenecks - Verify bank conflict avoidance - Monitor occupancy and register usage Next Steps ========== - Review :ref:`ck_tile_gpu_basics` for architecture fundamentals - Study :ref:`ck_tile_lds_bank_conflicts` for shared memory optimization - Explore :ref:`ck_tile_gemm_optimization` for a complete optimization example For practical implementation, refer back to the main :ref:`ck_tile_conceptual` documentation to see how these hardware concepts integrate with CK Tile's abstractions.