WIP backup: snapshot all local notes, slides, tutorials, and kernel work

Backup commit grouping all in-progress local work so nothing is lost: - Modified CK-UA kernel + example sources (unified_attention.cpp, unified_attention_kernel.hpp) and CMake/build files. - Updated dispatcher README and ctypes_utils.py. - New unified_attention example notes: PARAMETERS.md, VARIABLES.md. - New unified_attention instances for d128 fp16/bf16 (mask/nmask, gqa6). - New 99_toy_tutorial/ collection: bank-conflict investigations (test_*.cpp, *.js, *.gdb, *.asm, *.md), tile distribution / row reduction / calling_gemm / thread_buffer tutorials. - Slide decks and supporting assets (bank_conflict_slides.qmd/.html, tile_distribution_slides.qmd, assets/, *_files/, step1_reshape_only, xor_full_steps_simple). - GDB helper script (break_on_ds_read.gdb). Not intended for upstream review; pure WIP snapshot.
2026-05-19 12:30:16 +00:00 · 2026-05-11 20:34:52 +00:00
parent 3f076a6fc1
commit 393ebc1a50
664 changed files with 257117 additions and 69 deletions
--- a/include/ck_tile/ops/unified_attention/kernel/unified_attention_kernel.hpp
+++ b/include/ck_tile/ops/unified_attention/kernel/unified_attention_kernel.hpp
@@ -307,7 +307,7 @@ struct UnifiedAttentionKernel
        const index_t context_len = amd_wave_read_first_lane(seq_len - cur_batch_query_len);

        index_t _max_seq_prefix_len = amd_wave_read_first_lane(
-            (context_len + q_block_local_idx * kBlockQ + (kBlockM - 1) + 1));
+            (context_len + q_block_local_idx * kBlockQ + (kBlockQ - 1) + 1)); // this should be kBlockQ instead of kBlockM

        if(seq_len < _max_seq_prefix_len)
        {