Commit Graph

19 Commits

Author SHA1 Message Date
Iwan Kawrakow
2a72d9f978 iqk_mul_mat: better AVX2 implementation for iq2_xxs
From here on switching to GCC 12.

PP-512 is now 139.3 t/s.
TG-128 is 13.5 t/s @  4 threads
          23.0 t/s @  8 threads
          25.1 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
3a6e3943a8 iqk_mul_mat: better AVX2 implementation for iq2_xxs
2.41X for PP-512 (120.5 t/s).
Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s).
But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s.
Very strange.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
60f050d610 iqk_mul_mat: AVX2 implementation for iq2_xxs
2.09X for PP-512 (104.7 t/s), worse than mainline for TG.
I think it needs more work.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
309e32405f iqk_mul_mat: AVX2 implementation for iq2_xs
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK
(slightly better @ 4 threads, slightly worse @ 16 threads).
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
8015edb3cc iqk_mul_mat: AVX2 implementation for iq2_s
We get 2.04X for PP-512 (107 t/s). TG againsuffers
a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
b0071de081 Separate templates for TG and PP for i-quants on AVX2 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
2c8c0d0a68 iqk_mul_mat: AVX2 implementation for iq3_xxs
We get 2.3X for PP-512 (87 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
34befcaf67 iqk_mul_mat: AVX2 implementation for iq3_s
We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
4f53915dcb Cleanup - Arm i-quants should be good now
Still missing iq1_s and iq1_m, but I don't think I'll do those.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
4b27ade2fb iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version)
Here we get 3.65X (!) for PP-512 (53 t/s).
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
221a2c3807 Simplify 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
7dcca6aea7 iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version)
We get 2.66X for PP-512 (42.35 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
effa4448d6 iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version)
We get 2.2X for PP-512 (52 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
d2ee9ab95e iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version)
We get only a 2.07X for PP-512 to get up to 31 t/s,
so iq2_s remains slow.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
9ac9e928d5 Add Q8_0 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
3f996d0c70 Cosmetics 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
d7ab97149f iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version)
We get ~5% speeedup for TG-128, 3X for PP-512
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
b51922530f iqk_mul_mat: faster q3_K TG
We get 31 t/s up from 26 t/s, but we need to treat
PP differently from TG, else we get a ~10% drop in
PP performance.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
19c578b413 iqk_mul_mat for llama.cpp 2024-06-22 12:02:49 +03:00