Attention mask tweaks for better long context performance (#825)

* Parallelize mask

We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.

* Whith FA on, create mask as f16 directly

* WIP

* Reduce KQ mask padding to 16

Why was it 64 in the first place?

I don't observe any issues, while TG performance
for long contexts improves by 2-4%.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-10-13 14:01:11 +03:00
committed by GitHub
parent 21a0bfb1c0
commit 4e24d48e63
3 changed files with 277 additions and 25 deletions

View File

@@ -2235,7 +2235,7 @@ extern "C" {
int min_entries,
float thresh);
#define GGML_KQ_MASK_PAD 64
#define GGML_KQ_MASK_PAD 16
// q: [n_embd, n_batch, n_head, 1]
// k: [n_embd, n_kv, n_head_kv, 1]