ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-13 15:30:03 +00:00

Files

Nexes the Elder 9a63e768ea Legacy quants cpy_blck_q_f16 function for K cache (#1001 )

Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16.

To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts.

Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior.

Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix)

-ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.

2025-11-24 08:56:38 +01:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

CUDA: set compute parameters via command line arguments (#910 )

2025-11-07 07:11:23 +02:00

src

Legacy quants cpy_blck_q_f16 function for K cache (#1001 )

2025-11-24 08:56:38 +01:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Enable fusion by default (#939 )

2025-11-11 10:35:48 +02:00