iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation.
2026-02-25 23:54:10 +00:00 · 2024-08-05 14:22:05 +03:00
parent b409c15363
commit 1b41d792ec
9 changed files with 157 additions and 3 deletions
--- a/include/llama.h
+++ b/include/llama.h
@@ -174,6 +174,7 @@ extern "C" {
        LLAMA_FTYPE_MOSTLY_IQ3_K         = 39, // except 1d tensors
        LLAMA_FTYPE_MOSTLY_IQ4_K         = 40, // except 1d tensors
        LLAMA_FTYPE_MOSTLY_IQ5_K         = 41, // except 1d tensors
+        LLAMA_FTYPE_MOSTLY_IQ2_TN        = 42, // except 1d tensors

        LLAMA_FTYPE_GUESSED = 1024, // not specified in the model file
    };