WIP KQ binary mask: make it a parameter, turn on via command line

It is a pain to implement binary mask to 32-bit value conversion on NEON and AVX2, so I decided to make the binary mask optional There is also a commented out (and not working) attempt for NEON in this commit.
2026-03-06 12:00:29 +00:00 · 2024-08-28 15:01:02 +02:00
parent fe825ecbe4
commit 05f95229a7
5 changed files with 72 additions and 2 deletions
--- a/include/llama.h
+++ b/include/llama.h
@@ -340,6 +340,7 @@ extern "C" {
        bool embeddings;  // if true, extract embeddings (together with logits)
        bool offload_kqv; // whether to offload the KQV ops (including the KV cache) to GPU
        bool flash_attn;  // whether to use flash attention [EXPERIMENTAL]
+        bool binary_kq;   // whether to use binary KQ mask [EXPERIMENTAL]

        // Abort callback
        // if it returns true, execution of llama_decode() will be aborted