ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Kawrakow 4e24d48e63 Attention mask tweaks for better long context performance (#825 )

* Parallelize mask

We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.

* Whith FA on, create mask as f16 directly

* WIP

* Reduce KQ mask padding to 16

Why was it 64 in the first place?

I don't observe any issues, while TG performance
for long contexts improves by 2-4%.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-10-13 14:01:11 +03:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Attention mask tweaks for better long context performance (#825 )

2025-10-13 14:01:11 +03:00

src

Attempt to fix AVX2 FA (#807 )

2025-09-30 08:06:53 +02:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Set default value of GGML_SCHED_MAX_COPIES to 1 (#751 )

2025-09-02 07:04:39 +02:00