Files
ik_llama.cpp/ggml
Kawrakow c410cc72bb Much faster CPU prompt processing (part 3) (#534)
* Repack q4_0 and q8_0 to q8_0_R8

q8_0 is fine, but I observe a very significant PPL increase
for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit
scale conversions.

* Change q8_2_x4 to store in16_t sums

With that q4_0 now works.
I need to check all quants that use q8_2_x4!

* q5_0 and use a dequntizing template

* q6_0

129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s.

* iq4_nl

137 t/s -> 293 t/s. iq4_nl is at 251 t/s.

* q4_1: 135 t/s -> 262 t/s

* q5_1: 125 t/s -> 253 t/s

* iq3_xs

178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s.

* q2_K

202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 15:30:56 +03:00
..
2024-07-27 07:55:01 +02:00
2025-06-08 17:27:00 +03:00
2024-07-27 07:55:01 +02:00