Files
ik_llama.cpp/ggml
Kawrakow fbf975741e R4 i-quants improvements (#157)
* Add nrc_y = 16 implementation.

Here just iq2_s on Zen4. We get PP-512 go up to 169.5 t/s from
148.5 t/s. As we are sure that we will be multiplying with 16
columns, we can spend the time to add the mins and make the
iq2_s quants unsigned.

* nrc_y = 16: AVX2 iq2_s

We go from 176.8 to 203.3 t/s.

* nrc_y = 16: NEON iq2_s

We go from 50.4 to 62.3 t/s.
We didn't need to do anything other than to set func16 to
mul_mat_iq2_s_r4_q8_k<16>. Even though we absolutely don't have
so many vector registers for all accumulators, unpacking and preparing
the iq2_s quants is so expensive that we still gain ~23% in performance
by reusing the unpacked quants 16 times instead of just 8, despite
having to load/unload the accumulated results to/from the
available vector registers.

* nrc_y = 16: NEON iq2_xxs, iq2_xs, iq3_xxs

iq2_xxs: 76.34 -> 85.33 t/s
iq2_xs:  54.13 -> 67.99 t/s
iq3_xxs: 67.45 -> 73.56 t/s

* nrc_y = 16: AVX2 iq2_xxs, iq2_xs, iq3_xxs

iq2_xxs: 195.7 -> 221.8 t/s
iq2_xs : 192.6 -> 220.6 t/s
iq3_xxs: 184.4 -> 206.9 t/s

* r4_nrcy_16: iq3_k_r4, iq4_k_r4, iq4_ks_r4, iq5_k_r4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-22 10:52:56 +01:00
..
2024-07-27 07:55:01 +02:00
2024-12-21 11:26:35 +01:00
2024-12-22 10:52:56 +01:00
2024-07-27 07:55:01 +02:00
2024-10-04 14:43:26 +03:00