Files
ik_llama.cpp/include
Kawrakow 8b7536bda8 IQ1_S_R4: better 1.5 bpw quants (#185)
* iq1_s_r4: basics - quantize/dequantize

* iq1_s_r4: gemm/gemv works on AVX2/Zen4

* Don't forget to make sure we have a multiple of 4 rows per thread

* iq1_s_r4: this is better

* iq1_s_r4: fix Zen4 after AVX2 changes

* iq1_s_r4: NEON gemm/gemv

* iq1_s_r4: more bits for shared experts

With this mix we arrive at PPL(512) = 9.4140
for Deepseek-Lite using 1.766 bpw for the repeating layers.

On the Ryzen-7950X we get PP-512 = 494 t/s and
TG-128 = 52 t/s @ 16 threads.

* Forgotten counter increment

* iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv

* Compiler warnings

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05 13:49:39 +02:00
..
2025-02-05 13:49:39 +02:00