mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-02 18:10:02 +00:00
* FlashMLA(CUDA): WIP to allow q8_0 quantized cache * WIP * FlashMLA(CUDA) - allow q8_0 for KV cache This works, and PP is not bad, but TG is still quite a bit slower. * FlashMLA(CUDA) - allow q8_0 for KV cache This is better. ~9% slower than f16 cache for short contexts, nearly on par at 16k tokens. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>