Attention mask tweaks for better long context performance (#825)

* Parallelize mask We see non-negligible PP gains for long contexts. More importantly, the strange drop in performance observed for GPT-OSS for context >= 32k tokens is gone. * Whith FA on, create mask as f16 directly * WIP * Reduce KQ mask padding to 16 Why was it 64 in the first place? I don't observe any issues, while TG performance for long contexts improves by 2-4%. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-26 17:20:01 +00:00 · 2025-10-13 14:01:11 +03:00
parent 21a0bfb1c0
commit 4e24d48e63
3 changed files with 277 additions and 25 deletions
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2235,7 +2235,7 @@ extern "C" {
            int                   min_entries,
            float                 thresh);

-#define GGML_KQ_MASK_PAD 64
+#define GGML_KQ_MASK_PAD 16

    // q:    [n_embd, n_batch,     n_head,    1]
    // k:    [n_embd, n_kv,        n_head_kv, 1]