iq4_k: basics

* quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented.
2026-04-29 19:01:47 +00:00 · 2024-07-27 17:05:31 +03:00
parent f62615b44f
commit 8a2d43813d
14 changed files with 427 additions and 4 deletions
--- a/include/llama.h
+++ b/include/llama.h
@@ -170,6 +170,7 @@ extern "C" {
        LLAMA_FTYPE_MOSTLY_Q4_0_8_8      = 35, // except 1d tensors
        LLAMA_FTYPE_MOSTLY_IQ1_BN        = 36,
        LLAMA_FTYPE_MOSTLY_IQ2_BN        = 37,
+        LLAMA_FTYPE_MOSTLY_IQ4_K         = 38, // except 1d tensors

        LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
    };