WIP backup: snapshot all local notes, slides, tutorials, and kernel work

Backup commit grouping all in-progress local work so nothing is lost:

- Modified CK-UA kernel + example sources (unified_attention.cpp,
  unified_attention_kernel.hpp) and CMake/build files.
- Updated dispatcher README and ctypes_utils.py.
- New unified_attention example notes: PARAMETERS.md, VARIABLES.md.
- New unified_attention instances for d128 fp16/bf16 (mask/nmask, gqa6).
- New 99_toy_tutorial/ collection: bank-conflict investigations
  (test_*.cpp, *.js, *.gdb, *.asm, *.md), tile distribution / row
  reduction / calling_gemm / thread_buffer tutorials.
- Slide decks and supporting assets (bank_conflict_slides.qmd/.html,
  tile_distribution_slides.qmd, assets/, *_files/, step1_reshape_only,
  xor_full_steps_simple).
- GDB helper script (break_on_ds_read.gdb).

Not intended for upstream review; pure WIP snapshot.
This commit is contained in:
root
2026-05-11 20:34:52 +00:00
parent 3f076a6fc1
commit 393ebc1a50
664 changed files with 257117 additions and 69 deletions

View File

@@ -307,7 +307,7 @@ struct UnifiedAttentionKernel
const index_t context_len = amd_wave_read_first_lane(seq_len - cur_batch_query_len);
index_t _max_seq_prefix_len = amd_wave_read_first_lane(
(context_len + q_block_local_idx * kBlockQ + (kBlockM - 1) + 1));
(context_len + q_block_local_idx * kBlockQ + (kBlockQ - 1) + 1)); // this should be kBlockQ instead of kBlockM
if(seq_len < _max_seq_prefix_len)
{