iq4_k: basics

* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
  PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
  In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
  products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.
This commit is contained in:
Iwan Kawrakow
2024-07-27 17:05:31 +03:00
parent f62615b44f
commit 8a2d43813d
14 changed files with 427 additions and 4 deletions

View File

@@ -170,6 +170,7 @@ extern "C" {
LLAMA_FTYPE_MOSTLY_Q4_0_8_8 = 35, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ1_BN = 36,
LLAMA_FTYPE_MOSTLY_IQ2_BN = 37,
LLAMA_FTYPE_MOSTLY_IQ4_K = 38, // except 1d tensors
LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
};