Files
ik_llama.cpp/ggml
Kawrakow 58d2cbf948 Much faster prompt processing for k-quants (ARM_NEON) (#552)
* iq2_xxs

55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s

* iq2_xs

46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.

* iq2_s

42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.

* iq3_xxs

51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.

* iq3_s

46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s

* q2_k

85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.

* q3_K

45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.

* q6_k

47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.

* q4_k

58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s.

As I had to add a new implementation for q8_1-quantized
activations, TG became slightly faster too
(25.1 -> 25.9 t/s).

* q5_k

54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.

* iq4_xs

71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-24 13:05:01 +02:00
..
2024-07-27 07:55:01 +02:00
2025-06-08 17:27:00 +03:00
2024-07-27 07:55:01 +02:00