Files
composable_kernel/docs/conceptual/ck_tile/hardware/index.rst
spolifroni-amd d9d4c9c3df [composable_kernel] initial draft of the ck tile conceptual doc (#3242)
* Adding CK Tile documentation

* Updates based on feedback

* Fix tile window API description

* Fix remaining images

* add documentation about flush_cache and rotating_buffer functionality in ck_tile

* Supplement the documentation

* light edit of the ck tile conceptual doc

* Fixes for ruff check.

* Fixes for ruff check 2.

* Fixes for ruff check 3.

---------

Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
2025-12-04 11:09:21 -08:00

128 lines
3.4 KiB
ReStructuredText

.. meta::
:description: CK Tile Hardware-Specific Documentation
:keywords: CDNA, GPU architecture, LDS, GEMM, CK, Composable Kernel
.. _ck_tile_hardware:
********************************************************************
CK Tile Hardware Documentation
********************************************************************
This section provides in-depth coverage of hardware-specific concepts and optimizations for CK Tile on AMD GPUs.
Overview
========
Understanding the underlying hardware architecture is crucial for achieving optimal performance with CK Tile. This documentation covers:
- AMD CDNA architecture fundamentals
- Memory hierarchy and optimization techniques
- Practical examples of high-performance kernels
Documentation Structure
=======================
.. toctree::
:maxdepth: 2
:caption: Hardware Topics
gpu_basics
lds_bank_conflicts
gemm_optimization
GPU Architecture Basics
-----------------------
:ref:`ck_tile_gpu_basics` provides an introduction to AMD CDNA architecture.
LDS and Bank Conflicts
----------------------
:ref:`ck_tile_lds_bank_conflicts` explains Local Data Share (LDS) optimization.
GEMM Optimization Case Study
----------------------------
:ref:`ck_tile_gemm_optimization` demonstrates a complete optimization journey.
Key Hardware Considerations
===========================
Memory Hierarchy
----------------
1. **Global Memory**: High latency, high bandwidth
- Optimize with coalesced access patterns
- Use tile windows for automatic optimization
2. **L2/Infinity Cache**: Intermediate storage
- Benefits from spatial and temporal locality
- CK Tile's tiling naturally improves cache hit rates
3. **LDS**: Low latency, shared within CU
- 64KB per CU, organized in 32 banks
- CK Tile handles bank conflict avoidance
4. **Registers**: Lowest latency, per-thread storage
- 512 VGPRs available per wavefront
- CK Tile's compile-time optimization minimizes usage
Compute Resources
-----------------
1. **Wavefront Execution**: 64 threads in lockstep
- CK Tile ensures coalesced memory access
- Automatic warp-level synchronization
2. **Matrix Units**: Specialized MFMA instructions
- 16x16x16 operations in 16 cycles
- CK Tile can leverage these automatically
3. **Occupancy**: Balancing threads vs resources
- Register pressure affects occupancy
- CK Tile helps through efficient register use
Performance Guidelines
======================
To achieve optimal performance with CK Tile:
1. **Choose appropriate tile sizes**:
- Match hardware capabilities (e.g., 256x256 for GEMM)
- Consider LDS capacity and register pressure
2. **Align problem dimensions**:
- Match CU count when possible (304 for MI300)
- Use padding for non-aligned sizes
3. **Enable pipelining**:
- Use double buffering for latency hiding
- CK Tile supports async operations
4. **Profile and verify**:
- Use rocprof to check for bottlenecks
- Verify bank conflict avoidance
- Monitor occupancy and register usage
Next Steps
==========
- Review :ref:`ck_tile_gpu_basics` for architecture fundamentals
- Study :ref:`ck_tile_lds_bank_conflicts` for shared memory optimization
- Explore :ref:`ck_tile_gemm_optimization` for a complete optimization example
For practical implementation, refer back to the main :ref:`ck_tile_conceptual` documentation to see how these hardware concepts integrate with CK Tile's abstractions.