WIP KQ binary mask: make it a parameter, turn on via command line

It is a pain to implement binary mask to 32-bit value conversion on
NEON and AVX2, so I decided to make the binary mask optional

There is also a commented out (and not working) attempt for NEON
in this commit.
This commit is contained in:
Iwan Kawrakow
2024-08-28 15:01:02 +02:00
parent fe825ecbe4
commit 05f95229a7
5 changed files with 72 additions and 2 deletions

View File

@@ -340,6 +340,7 @@ extern "C" {
bool embeddings; // if true, extract embeddings (together with logits)
bool offload_kqv; // whether to offload the KQV ops (including the KV cache) to GPU
bool flash_attn; // whether to use flash attention [EXPERIMENTAL]
bool binary_kq; // whether to use binary KQ mask [EXPERIMENTAL]
// Abort callback
// if it returns true, execution of llama_decode() will be aborted