ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-24 23:24:13 +00:00

Files

Iwan Kawrakow a4f1ac8da4 iq2_kt: quantize / dequantize

I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).

2024-11-21 08:16:40 +02:00

CMakeLists.txt

WIP

2024-11-21 08:16:40 +02:00

quantize-stats.cpp

iq2_kt: quantize / dequantize

2024-11-21 08:16:40 +02:00