Files
ik_llama.cpp/ggml
Nexes the Elder 9a63e768ea Legacy quants cpy_blck_q_f16 function for K cache (#1001)
Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16.

To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts.

Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior.

Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix)

-ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.
2025-11-24 08:56:38 +01:00
..
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-11-11 10:35:48 +02:00