Adding IQ2_KL (#602)

* Experiments for 2.6875 bpw quants

At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.

* iq2_kl: basics

* iq2_kl: CUDA dequantize

* iq2_kl: small improvement in PPL

Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.

* iq2_kl: MMQ

Quite good: PP-512(L3-8B) = 8472 t/s.

* iq2_kl: MMVQ

We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.

* iq2_kl: Zen4 GEMM/GEMV

Not particularly fast. I may need to think about rearranging the bits.

* iq2_kl: better Zen4

* iq2_kl: convert/repack to q8_k_r8 (AVX2)

* iq2_kl: AVX2 GEMM/GEMV

* iq2_kl: WIP NEON

The compiler started crashing!!!

* iq2_kl: NEON

Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.

* iq2_kl: convert/repack to q8_k_r8 (NEON)

* iq2_kl: Metal dequantize

* iq2_kl: Metal GEMV - pretty slow

* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)

* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)

* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 476 t/s up from 466 t/s.

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 492 t/s up from 476 t/s.

* Add iq2_kl to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-07-14 18:55:08 +02:00
committed by GitHub
parent da8998c6c6
commit f375799f17
24 changed files with 1819 additions and 12 deletions

View File

@@ -76,6 +76,7 @@ static const std::vector<struct quant_option> QUANT_OPTIONS = {
{ "IQ2_K_R4", LLAMA_FTYPE_MOSTLY_IQ2_K_R4, "IQ2_K repacked",},
{ "IQ2_KS", LLAMA_FTYPE_MOSTLY_IQ2_KS, " 2.1875 bpw non-linear quantization",},
{ "IQ2_KT", LLAMA_FTYPE_MOSTLY_IQ2_KT, " 2.125 bpw trellis quantization", },
{ "IQ2_KL", LLAMA_FTYPE_MOSTLY_IQ2_KL, " 2.69 bpw non-linear quantization", },
{ "IQ3_KS", LLAMA_FTYPE_MOSTLY_IQ3_KS, " 3.19 bpw non-linear quantization", },
{ "IQ3_K", LLAMA_FTYPE_MOSTLY_IQ3_K, " 3.44 bpw non-linear quantization", },
{ "IQ3_K_R4", LLAMA_FTYPE_MOSTLY_IQ3_K_R4, "IQ3_K repacked", },