q8_KV: be able to use it for K cache

This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
  removed most of it when implementing the quants with per row scale,
  bit it was stull lurking in ggml_copy. Not sure if these were the last
  remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
  the K-cache as a 2D tensor so we can have per row meta data as needed
  by q8_KV.

Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.
This commit is contained in:
Iwan Kawrakow
2025-02-17 15:19:05 +02:00
parent a4ffe2e69e
commit 0280b8d52b
4 changed files with 38 additions and 14 deletions

View File

@@ -2259,6 +2259,9 @@ static ggml_type kv_cache_type_from_str(const std::string & s) {
if (s == "q6_0") {
return GGML_TYPE_Q6_0;
}
if (s == "q8_KV") {
return GGML_TYPE_Q8_KV;
}
throw std::runtime_error("Invalid cache type: " + s);
}