Commit Graph

22 Commits

Author SHA1 Message Date
Kawrakow
2c8d3dad1f iqk_mul_mat: experimenting with zen4 (iq2_xs) 2024-06-22 12:02:49 +03:00
Kawrakow
0d9027fe74 iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m) 2024-06-22 12:02:49 +03:00
Kawrakow
ed8f1fe490 iqk_mul_mat: small improvement for iq3_s
The same as in llamafile. We get
PP-512 = 96.6 t/s
TG-128 = 7.77 t/s @  4 threads
         14.4 t/s @  8 threads
         16.3 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Kawrakow
01d55dcbf0 iqk_mul_mat: better AVX2 implementation for iq2_xxs
From here on switching to GCC 12.

PP-512 is now 139.3 t/s.
TG-128 is 13.5 t/s @  4 threads
          23.0 t/s @  8 threads
          25.1 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Kawrakow
d4e9e595f9 iqk_mul_mat: better AVX2 implementation for iq2_xxs
2.41X for PP-512 (120.5 t/s).
Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s).
But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s.
Very strange.
2024-06-22 12:02:49 +03:00
Kawrakow
41391ff4b0 iqk_mul_mat: AVX2 implementation for iq2_xxs
2.09X for PP-512 (104.7 t/s), worse than mainline for TG.
I think it needs more work.
2024-06-22 12:02:49 +03:00
Kawrakow
be132341f5 iqk_mul_mat: AVX2 implementation for iq2_xs
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK
(slightly better @ 4 threads, slightly worse @ 16 threads).
2024-06-22 12:02:49 +03:00
Kawrakow
3c448906bf iqk_mul_mat: AVX2 implementation for iq2_s
We get 2.04X for PP-512 (107 t/s). TG againsuffers
a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
2024-06-22 12:02:49 +03:00
Kawrakow
f31200bde1 Separate templates for TG and PP for i-quants on AVX2 2024-06-22 12:02:49 +03:00
Kawrakow
3f90520d1f iqk_mul_mat: AVX2 implementation for iq3_xxs
We get 2.3X for PP-512 (87 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.
2024-06-22 12:02:49 +03:00
Kawrakow
24ccf42a4f iqk_mul_mat: AVX2 implementation for iq3_s
We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
2024-06-22 12:02:49 +03:00
Kawrakow
32f20a1b9b Cleanup - Arm i-quants should be good now
Still missing iq1_s and iq1_m, but I don't think I'll do those.
2024-06-22 12:02:49 +03:00
Kawrakow
7235135c3e iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version)
Here we get 3.65X (!) for PP-512 (53 t/s).
2024-06-22 12:02:49 +03:00
Kawrakow
482dd30382 Simplify 2024-06-22 12:02:49 +03:00
Kawrakow
6aa7ac9cd3 iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version)
We get 2.66X for PP-512 (42.35 t/s)
2024-06-22 12:02:49 +03:00
Kawrakow
d041c81b1d iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version)
We get 2.2X for PP-512 (52 t/s)
2024-06-22 12:02:49 +03:00
Kawrakow
3fe4e1b27c iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version)
We get only a 2.07X for PP-512 to get up to 31 t/s,
so iq2_s remains slow.
2024-06-22 12:02:49 +03:00
Kawrakow
4c0920cb1b Add Q8_0 2024-06-22 12:02:49 +03:00
Kawrakow
62122c1950 Cosmetics 2024-06-22 12:02:49 +03:00
Kawrakow
fb8bc26dc5 iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version)
We get ~5% speeedup for TG-128, 3X for PP-512
2024-06-22 12:02:49 +03:00
Kawrakow
a18a564e54 iqk_mul_mat: faster q3_K TG
We get 31 t/s up from 26 t/s, but we need to treat
PP differently from TG, else we get a ~10% drop in
PP performance.
2024-06-22 12:02:49 +03:00
Kawrakow
d434b4751a iqk_mul_mat for llama.cpp 2024-06-22 12:02:49 +03:00