q8_KV: be able to use it for K cache

This required quite a few fixes in ggml and llama.cpp: * ggml: do not calculate row size as n/block_size*type_size. I had removed most of it when implementing the quants with per row scale, bit it was stull lurking in ggml_copy. Not sure if these were the last remnants of ggmil-style row sizes, or if there are still places left * llama.cpp: get rid of the the 1d K cache assumption. Create and manage the K-cache as a 2D tensor so we can have per row meta data as needed by q8_KV. Using q8_KV for K-cache results in non-negligible performance gains. More details to follow, but for DeepSeek-Lite with MLA, we get 18% speedup for PP-8192 compared to q8_0 K-cache.
2026-04-27 09:53:40 +00:00 · 2025-02-17 15:19:05 +02:00
parent a4ffe2e69e
commit 0280b8d52b
4 changed files with 38 additions and 14 deletions
--- a/examples/llama-bench/llama-bench.cpp
+++ b/examples/llama-bench/llama-bench.cpp
@@ -339,6 +339,9 @@ static ggml_type ggml_type_from_name(const std::string & s) {
    if (s == "q6_0") {
        return GGML_TYPE_Q6_0;
    }
+    if (s == "q8_KV") {
+        return GGML_TYPE_Q8_KV;
+    }

    return GGML_TYPE_COUNT;
 }