ik_llama.cpp/github-data/pull_requests/265 - Allow q8_0 cache on the CPU for FlashMLA-2.md at 0ce8068d2b56bd01e422f6d498e989fcd156ca28 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 11:21:56 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

657 B

Raw Blame History

🔀 #265 - Allow q8_0 cache on the CPU for FlashMLA-2

Author	`ikawrakow`
State	❌ Closed
Created	2025-03-18
Updated	2025-03-18

Description

Somehow I had the concept that Q8_0 KV cache is working for CPU-only inference with FlashMLA-2. Indeed it is for prompt processing, but not for TG (two different paths are taken). Clearly too many options as I'm getting confused myself. Anyhow, this PR adds the missing Q8_0 -> Q8_0 contiguous transpose operation, so now we can use Q8_0 KV cache with FlashMLA-2 also on the CPU.

657 B Raw Blame History

🔀 #265 - Allow q8_0 cache on the CPU for FlashMLA-2

Description

657 B

Raw Blame History