ik_llama.cpp/github-data/pull_requests/310 - Metal_ FA and FlashMLA.md at ik/debug_issue_721 - ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-05 03:20:00 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

Performance is not great, but it works with standard attentions and all 3 MLA options.

"Works" as:

f16 KV cache works for all combinations of fa and mla
I have allowed only Q8_0 quantized cache
Quantized cache only works with standard attention (-mla 0) without FA
With FA quantized cache kind of works, but we get messages such as ggml_metal_get_buffer: error: tensor 'v-26' buffer is nil. Not sure why. PPL is slightly higher than without FA