Files
composable_kernel/example/ck_tile/99_toy_tutorial
root 393ebc1a50 WIP backup: snapshot all local notes, slides, tutorials, and kernel work
Backup commit grouping all in-progress local work so nothing is lost:

- Modified CK-UA kernel + example sources (unified_attention.cpp,
  unified_attention_kernel.hpp) and CMake/build files.
- Updated dispatcher README and ctypes_utils.py.
- New unified_attention example notes: PARAMETERS.md, VARIABLES.md.
- New unified_attention instances for d128 fp16/bf16 (mask/nmask, gqa6).
- New 99_toy_tutorial/ collection: bank-conflict investigations
  (test_*.cpp, *.js, *.gdb, *.asm, *.md), tile distribution / row
  reduction / calling_gemm / thread_buffer tutorials.
- Slide decks and supporting assets (bank_conflict_slides.qmd/.html,
  tile_distribution_slides.qmd, assets/, *_files/, step1_reshape_only,
  xor_full_steps_simple).
- GDB helper script (break_on_ds_read.gdb).

Not intended for upstream review; pure WIP snapshot.
2026-05-11 20:34:52 +00:00
..

CK Tile Tutorials

This directory contains step-by-step tutorials for learning the CK Tile API, progressing from fundamental concepts to production-ready optimizations.

Tutorial Overview

Tutorial 01: Tensor Fundamentals

Introduction to CK Tile's core tensor concepts.

Files: tutorial_01_tensor_fundamentals/

Tutorial 02: Tensor Adaptors

Learn how to transform tensor layouts using adaptors.

Files: tutorial_02_tensor_adaptors/ Documentation: tutorial_02_tensor_adaptors/XOR_TRANSFORM_EXPLAINED.md

Tutorial 03: Padding and Tiles

Understanding tile operations and padding strategies.

Files: tutorial_03_padding_and_tiles/

Tutorial 04: Descriptor vs Adaptor

Deep dive into the differences between descriptors and adaptors.

Files: tutorial_04_descriptor_vs_adaptor/ Documentation: tutorial_04_descriptor_vs_adaptor/DESCRIPTOR_VS_ADAPTOR.md

Tutorial 05: Basic Distributed GEMM

Introduction to distributed matrix multiplication.

Files: tutorial_05_basic_distributed_gemm/

Tutorial 06: Tile Sweeping GEMM

Optimized GEMM using tile sweeping techniques.

Files: tutorial_06_tile_sweeping_gemm/

Tutorial 07: Tile Sweeping with Y Repetition

Advanced tile sweeping with dimension repetition.

Files: tutorial_07_tile_sweeping_with_y_repetition/ Documentation: tutorial_07_tile_sweeping_with_y_repetition/Y_REPETITION_EXPLAINED.md

Tutorial 08: LDS Staging

Introduction to Local Data Share (shared memory) staging.

Files: tutorial_08_lds_staging/

Tutorial 09: Optimized LDS

Advanced LDS optimization techniques.

Files: tutorial_09_optimized_lds/

Tutorial 10: XOR LDS

First introduction to XOR swizzling for bank conflict reduction.

Files: tutorial_10_xor_lds/

Tutorial 11: Bank Conflicts and XOR Swizzling

Complete guide to understanding and eliminating LDS bank conflicts on AMD GPUs.

This tutorial provides comprehensive coverage of bank conflicts, from theory to implementation.

Files:

  • tutorial_11_xor_test/xor_test_plain_only.cpp - Baseline transpose (no XOR)
  • tutorial_11_xor_test/xor_test_production_transpose.cpp - XOR optimized transpose

Documentation:

  • BANK_CONFLICT_TUTORIAL.md - Comprehensive guide (START HERE!)
  • tutorial_11_xor_test/BANK_CONFLICT_SUMMARY.md - Quick reference
  • tutorial_11_xor_test/XOR_TRANSPOSE_SUMMARY.md - Implementation details

Scripts:

  • scripts/profile_bank_conflicts.sh - Automated profiling
  • scripts/analyze_bank_conflicts.py - Results analysis

What You'll Learn:

  • LDS bank conflict architecture on AMD MI300 GPUs
  • Constraint satisfaction problem (CSP) framing of optimization
  • Measuring conflicts with rocprofv3 profiling tools
  • XOR swizzling technique in CK Tile API
  • Trade-offs between different optimization approaches
  • Mathematical limits of bank conflict elimination

Key Results:

Plain LDS:  1,244% conflict rate (12.4 conflicts per instruction)
XOR LDS:      533% conflict rate (5.3 conflicts per instruction)
Improvement:  57% reduction in bank conflicts

Theoretical minimum: 2-way conflicts (64 threads / 32 banks)
Gap to optimal: 2.5× (good practical result!)

Quick Start:

1. Build the tutorials:

cd relbuild
cmake --build . --target aa_tutorial_11_plain_transpose -j$(nproc)
cmake --build . --target aa_tutorial_11_production_transpose -j$(nproc)

2. Run baseline (plain transpose):

./bin/aa_tutorial_11_plain_transpose

3. Run optimized (XOR transpose):

./bin/aa_tutorial_11_production_transpose

4. Profile and analyze (requires rocprofv3):

bash ../example/ck_tile/99_toy_tutorial/scripts/profile_bank_conflicts.sh

This will:

  • Build both versions
  • Profile with AMD performance counters
  • Generate comprehensive analysis report
  • Show 57% conflict reduction

Prerequisites:

  • Basic GPU programming knowledge (threads, blocks, wavefronts)
  • Understanding of shared memory concepts
  • Tutorial 08 (LDS staging) recommended
  • AMD GPU with ROCm for profiling

Next Steps:

After completing this tutorial, you can:

  • Apply XOR swizzling to your own kernels
  • Experiment with different tile sizes (32×32 for near-zero conflicts)
  • Explore advanced optimizations (double buffering, padding)
  • Read the complete BANK_CONFLICT_TUTORIAL.md for deep dive

Tutorial 12: XOR Correct

Verification and testing of XOR implementations.

Files: tutorial_12_xor_correct/

Tutorial 13: Production XOR

Production-ready XOR swizzling implementation.

Files: tutorial_13_production_xor/


Learning Path

Beginner (Start Here)

  1. Tutorial 01: Tensor Fundamentals
  2. Tutorial 02: Tensor Adaptors
  3. Tutorial 03: Padding and Tiles
  4. Tutorial 04: Descriptor vs Adaptor

Intermediate (GEMM Basics)

  1. Tutorial 05: Basic Distributed GEMM
  2. Tutorial 06: Tile Sweeping GEMM
  3. Tutorial 07: Tile Sweeping with Y Repetition

Advanced (Performance Optimization)

  1. Tutorial 08: LDS Staging
  2. Tutorial 09: Optimized LDS
  3. Tutorial 10: XOR LDS
  4. Tutorial 11: Bank Conflicts (Comprehensive)
  5. Tutorial 12: XOR Correct
  6. Tutorial 13: Production XOR

Building Tutorials

All tutorials can be built using CMake from the repository root:

# Create build directory
mkdir -p relbuild && cd relbuild

# Configure with CMake
cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_CXX_COMPILER=hipcc \
      -DGPU_TARGETS="gfx942" \
      ..

# Build specific tutorial (example)
cmake --build . --target aa_tutorial_11_plain_transpose -j$(nproc)

# Or build all tutorials
cmake --build . -j$(nproc)

Profiling Tutorials

For performance analysis of Tutorial 11 (bank conflicts):

# Use the automated profiling script
bash example/ck_tile/99_toy_tutorial/scripts/profile_bank_conflicts.sh relbuild /tmp/my_analysis

# Or manually profile a specific tutorial
rocprofv3 --pmc SQ_LDS_BANK_CONFLICT,SQ_INSTS_LDS \
          -d /tmp/profile_output \
          -- ./bin/aa_tutorial_11_plain_transpose

Documentation

Each tutorial may include:

  • Source code with detailed comments
  • README or markdown docs explaining concepts
  • CMakeLists.txt for building
  • Analysis scripts for performance evaluation

Comprehensive guides:

Contributing

When adding new tutorials:

  1. Follow the naming convention: tutorial_XX_descriptive_name/
  2. Include clear comments in source code
  3. Add documentation for complex concepts
  4. Update this README with tutorial summary
  5. Ensure tutorials build successfully

Getting Help

  • See individual tutorial README files for specific guidance
  • Refer to CK Tile API documentation
  • Check the main repository README for general setup
  • Open issues on GitHub for bugs or questions

Happy Learning!

For questions or feedback about these tutorials, please refer to the CK Tile documentation or open an issue.