Backup commit grouping all in-progress local work so nothing is lost: - Modified CK-UA kernel + example sources (unified_attention.cpp, unified_attention_kernel.hpp) and CMake/build files. - Updated dispatcher README and ctypes_utils.py. - New unified_attention example notes: PARAMETERS.md, VARIABLES.md. - New unified_attention instances for d128 fp16/bf16 (mask/nmask, gqa6). - New 99_toy_tutorial/ collection: bank-conflict investigations (test_*.cpp, *.js, *.gdb, *.asm, *.md), tile distribution / row reduction / calling_gemm / thread_buffer tutorials. - Slide decks and supporting assets (bank_conflict_slides.qmd/.html, tile_distribution_slides.qmd, assets/, *_files/, step1_reshape_only, xor_full_steps_simple). - GDB helper script (break_on_ds_read.gdb). Not intended for upstream review; pure WIP snapshot.
CK Tile Tutorials
This directory contains step-by-step tutorials for learning the CK Tile API, progressing from fundamental concepts to production-ready optimizations.
Tutorial Overview
Tutorial 01: Tensor Fundamentals
Introduction to CK Tile's core tensor concepts.
Files: tutorial_01_tensor_fundamentals/
Tutorial 02: Tensor Adaptors
Learn how to transform tensor layouts using adaptors.
Files: tutorial_02_tensor_adaptors/
Documentation: tutorial_02_tensor_adaptors/XOR_TRANSFORM_EXPLAINED.md
Tutorial 03: Padding and Tiles
Understanding tile operations and padding strategies.
Files: tutorial_03_padding_and_tiles/
Tutorial 04: Descriptor vs Adaptor
Deep dive into the differences between descriptors and adaptors.
Files: tutorial_04_descriptor_vs_adaptor/
Documentation: tutorial_04_descriptor_vs_adaptor/DESCRIPTOR_VS_ADAPTOR.md
Tutorial 05: Basic Distributed GEMM
Introduction to distributed matrix multiplication.
Files: tutorial_05_basic_distributed_gemm/
Tutorial 06: Tile Sweeping GEMM
Optimized GEMM using tile sweeping techniques.
Files: tutorial_06_tile_sweeping_gemm/
Tutorial 07: Tile Sweeping with Y Repetition
Advanced tile sweeping with dimension repetition.
Files: tutorial_07_tile_sweeping_with_y_repetition/
Documentation: tutorial_07_tile_sweeping_with_y_repetition/Y_REPETITION_EXPLAINED.md
Tutorial 08: LDS Staging
Introduction to Local Data Share (shared memory) staging.
Files: tutorial_08_lds_staging/
Tutorial 09: Optimized LDS
Advanced LDS optimization techniques.
Files: tutorial_09_optimized_lds/
Tutorial 10: XOR LDS
First introduction to XOR swizzling for bank conflict reduction.
Files: tutorial_10_xor_lds/
Tutorial 11: Bank Conflicts and XOR Swizzling ⭐
Complete guide to understanding and eliminating LDS bank conflicts on AMD GPUs.
This tutorial provides comprehensive coverage of bank conflicts, from theory to implementation.
Files:
tutorial_11_xor_test/xor_test_plain_only.cpp- Baseline transpose (no XOR)tutorial_11_xor_test/xor_test_production_transpose.cpp- XOR optimized transpose
Documentation:
- BANK_CONFLICT_TUTORIAL.md - Comprehensive guide (START HERE!)
tutorial_11_xor_test/BANK_CONFLICT_SUMMARY.md- Quick referencetutorial_11_xor_test/XOR_TRANSPOSE_SUMMARY.md- Implementation details
Scripts:
scripts/profile_bank_conflicts.sh- Automated profilingscripts/analyze_bank_conflicts.py- Results analysis
What You'll Learn:
- LDS bank conflict architecture on AMD MI300 GPUs
- Constraint satisfaction problem (CSP) framing of optimization
- Measuring conflicts with rocprofv3 profiling tools
- XOR swizzling technique in CK Tile API
- Trade-offs between different optimization approaches
- Mathematical limits of bank conflict elimination
Key Results:
Plain LDS: 1,244% conflict rate (12.4 conflicts per instruction)
XOR LDS: 533% conflict rate (5.3 conflicts per instruction)
Improvement: 57% reduction in bank conflicts
Theoretical minimum: 2-way conflicts (64 threads / 32 banks)
Gap to optimal: 2.5× (good practical result!)
Quick Start:
1. Build the tutorials:
cd relbuild
cmake --build . --target aa_tutorial_11_plain_transpose -j$(nproc)
cmake --build . --target aa_tutorial_11_production_transpose -j$(nproc)
2. Run baseline (plain transpose):
./bin/aa_tutorial_11_plain_transpose
3. Run optimized (XOR transpose):
./bin/aa_tutorial_11_production_transpose
4. Profile and analyze (requires rocprofv3):
bash ../example/ck_tile/99_toy_tutorial/scripts/profile_bank_conflicts.sh
This will:
- Build both versions
- Profile with AMD performance counters
- Generate comprehensive analysis report
- Show 57% conflict reduction
Prerequisites:
- Basic GPU programming knowledge (threads, blocks, wavefronts)
- Understanding of shared memory concepts
- Tutorial 08 (LDS staging) recommended
- AMD GPU with ROCm for profiling
Next Steps:
After completing this tutorial, you can:
- Apply XOR swizzling to your own kernels
- Experiment with different tile sizes (32×32 for near-zero conflicts)
- Explore advanced optimizations (double buffering, padding)
- Read the complete BANK_CONFLICT_TUTORIAL.md for deep dive
Tutorial 12: XOR Correct
Verification and testing of XOR implementations.
Files: tutorial_12_xor_correct/
Tutorial 13: Production XOR
Production-ready XOR swizzling implementation.
Files: tutorial_13_production_xor/
Learning Path
Beginner (Start Here)
- Tutorial 01: Tensor Fundamentals
- Tutorial 02: Tensor Adaptors
- Tutorial 03: Padding and Tiles
- Tutorial 04: Descriptor vs Adaptor
Intermediate (GEMM Basics)
- Tutorial 05: Basic Distributed GEMM
- Tutorial 06: Tile Sweeping GEMM
- Tutorial 07: Tile Sweeping with Y Repetition
Advanced (Performance Optimization)
- Tutorial 08: LDS Staging
- Tutorial 09: Optimized LDS
- Tutorial 10: XOR LDS
- Tutorial 11: Bank Conflicts (Comprehensive) ⭐
- Tutorial 12: XOR Correct
- Tutorial 13: Production XOR
Building Tutorials
All tutorials can be built using CMake from the repository root:
# Create build directory
mkdir -p relbuild && cd relbuild
# Configure with CMake
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_COMPILER=hipcc \
-DGPU_TARGETS="gfx942" \
..
# Build specific tutorial (example)
cmake --build . --target aa_tutorial_11_plain_transpose -j$(nproc)
# Or build all tutorials
cmake --build . -j$(nproc)
Profiling Tutorials
For performance analysis of Tutorial 11 (bank conflicts):
# Use the automated profiling script
bash example/ck_tile/99_toy_tutorial/scripts/profile_bank_conflicts.sh relbuild /tmp/my_analysis
# Or manually profile a specific tutorial
rocprofv3 --pmc SQ_LDS_BANK_CONFLICT,SQ_INSTS_LDS \
-d /tmp/profile_output \
-- ./bin/aa_tutorial_11_plain_transpose
Documentation
Each tutorial may include:
- Source code with detailed comments
- README or markdown docs explaining concepts
- CMakeLists.txt for building
- Analysis scripts for performance evaluation
Comprehensive guides:
- BANK_CONFLICT_TUTORIAL.md - Complete bank conflict guide
Contributing
When adding new tutorials:
- Follow the naming convention:
tutorial_XX_descriptive_name/ - Include clear comments in source code
- Add documentation for complex concepts
- Update this README with tutorial summary
- Ensure tutorials build successfully
Getting Help
- See individual tutorial README files for specific guidance
- Refer to CK Tile API documentation
- Check the main repository README for general setup
- Open issues on GitHub for bugs or questions
Happy Learning!
For questions or feedback about these tutorials, please refer to the CK Tile documentation or open an issue.