mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-05 14:11:29 +00:00
* Adding CK Tile documentation * Updates based on feedback * Fix tile window API description * Fix remaining images * add documentation about flush_cache and rotating_buffer functionality in ck_tile * Supplement the documentation * light edit of the ck tile conceptual doc * Fixes for ruff check. * Fixes for ruff check 2. * Fixes for ruff check 3. --------- Co-authored-by: Vidyasagar <vanantha@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
128 lines
3.4 KiB
ReStructuredText
128 lines
3.4 KiB
ReStructuredText
.. meta::
|
|
:description: CK Tile Hardware-Specific Documentation
|
|
:keywords: CDNA, GPU architecture, LDS, GEMM, CK, Composable Kernel
|
|
|
|
.. _ck_tile_hardware:
|
|
|
|
********************************************************************
|
|
CK Tile Hardware Documentation
|
|
********************************************************************
|
|
|
|
This section provides in-depth coverage of hardware-specific concepts and optimizations for CK Tile on AMD GPUs.
|
|
|
|
Overview
|
|
========
|
|
|
|
Understanding the underlying hardware architecture is crucial for achieving optimal performance with CK Tile. This documentation covers:
|
|
|
|
- AMD CDNA architecture fundamentals
|
|
- Memory hierarchy and optimization techniques
|
|
- Practical examples of high-performance kernels
|
|
|
|
Documentation Structure
|
|
=======================
|
|
|
|
.. toctree::
|
|
:maxdepth: 2
|
|
:caption: Hardware Topics
|
|
|
|
gpu_basics
|
|
lds_bank_conflicts
|
|
gemm_optimization
|
|
|
|
GPU Architecture Basics
|
|
-----------------------
|
|
|
|
:ref:`ck_tile_gpu_basics` provides an introduction to AMD CDNA architecture.
|
|
|
|
LDS and Bank Conflicts
|
|
----------------------
|
|
|
|
:ref:`ck_tile_lds_bank_conflicts` explains Local Data Share (LDS) optimization.
|
|
|
|
GEMM Optimization Case Study
|
|
----------------------------
|
|
|
|
:ref:`ck_tile_gemm_optimization` demonstrates a complete optimization journey.
|
|
|
|
|
|
Key Hardware Considerations
|
|
===========================
|
|
|
|
|
|
Memory Hierarchy
|
|
----------------
|
|
|
|
1. **Global Memory**: High latency, high bandwidth
|
|
|
|
- Optimize with coalesced access patterns
|
|
- Use tile windows for automatic optimization
|
|
|
|
2. **L2/Infinity Cache**: Intermediate storage
|
|
|
|
- Benefits from spatial and temporal locality
|
|
- CK Tile's tiling naturally improves cache hit rates
|
|
|
|
3. **LDS**: Low latency, shared within CU
|
|
|
|
- 64KB per CU, organized in 32 banks
|
|
- CK Tile handles bank conflict avoidance
|
|
|
|
4. **Registers**: Lowest latency, per-thread storage
|
|
|
|
- 512 VGPRs available per wavefront
|
|
- CK Tile's compile-time optimization minimizes usage
|
|
|
|
Compute Resources
|
|
-----------------
|
|
|
|
1. **Wavefront Execution**: 64 threads in lockstep
|
|
|
|
- CK Tile ensures coalesced memory access
|
|
- Automatic warp-level synchronization
|
|
|
|
2. **Matrix Units**: Specialized MFMA instructions
|
|
|
|
- 16x16x16 operations in 16 cycles
|
|
- CK Tile can leverage these automatically
|
|
|
|
3. **Occupancy**: Balancing threads vs resources
|
|
|
|
- Register pressure affects occupancy
|
|
- CK Tile helps through efficient register use
|
|
|
|
Performance Guidelines
|
|
======================
|
|
|
|
To achieve optimal performance with CK Tile:
|
|
|
|
1. **Choose appropriate tile sizes**:
|
|
|
|
- Match hardware capabilities (e.g., 256x256 for GEMM)
|
|
- Consider LDS capacity and register pressure
|
|
|
|
2. **Align problem dimensions**:
|
|
|
|
- Match CU count when possible (304 for MI300)
|
|
- Use padding for non-aligned sizes
|
|
|
|
3. **Enable pipelining**:
|
|
|
|
- Use double buffering for latency hiding
|
|
- CK Tile supports async operations
|
|
|
|
4. **Profile and verify**:
|
|
|
|
- Use rocprof to check for bottlenecks
|
|
- Verify bank conflict avoidance
|
|
- Monitor occupancy and register usage
|
|
|
|
Next Steps
|
|
==========
|
|
|
|
- Review :ref:`ck_tile_gpu_basics` for architecture fundamentals
|
|
- Study :ref:`ck_tile_lds_bank_conflicts` for shared memory optimization
|
|
- Explore :ref:`ck_tile_gemm_optimization` for a complete optimization example
|
|
|
|
For practical implementation, refer back to the main :ref:`ck_tile_conceptual` documentation to see how these hardware concepts integrate with CK Tile's abstractions.
|