mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-03 18:40:14 +00:00
722 B
722 B
🔀 #310 - Metal: FA and FlashMLA
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-04-03 |
| Updated | 2025-04-03 |
Description
Performance is not great, but it works with standard attentions and all 3 MLA options.
"Works" as:
f16KV cache works for all combinations offaandmla- I have allowed only
Q8_0quantized cache - Quantized cache only works with standard attention (
-mla 0) without FA - With FA quantized cache kind of works, but we get messages such as
ggml_metal_get_buffer: error: tensor 'v-26' buffer is nil. Not sure why. PPL is slightly higher than without FA