mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-27 09:53:40 +00:00
q8_KV: be able to use it for K cache
This required quite a few fixes in ggml and llama.cpp: * ggml: do not calculate row size as n/block_size*type_size. I had removed most of it when implementing the quants with per row scale, bit it was stull lurking in ggml_copy. Not sure if these were the last remnants of ggmil-style row sizes, or if there are still places left * llama.cpp: get rid of the the 1d K cache assumption. Create and manage the K-cache as a 2D tensor so we can have per row meta data as needed by q8_KV. Using q8_KV for K-cache results in non-negligible performance gains. More details to follow, but for DeepSeek-Lite with MLA, we get 18% speedup for PP-8192 compared to q8_0 K-cache.
This commit is contained in:
@@ -339,6 +339,9 @@ static ggml_type ggml_type_from_name(const std::string & s) {
|
||||
if (s == "q6_0") {
|
||||
return GGML_TYPE_Q6_0;
|
||||
}
|
||||
if (s == "q8_KV") {
|
||||
return GGML_TYPE_Q8_KV;
|
||||
}
|
||||
|
||||
return GGML_TYPE_COUNT;
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user