Iwan Kawrakow
2a72d9f978
iqk_mul_mat: better AVX2 implementation for iq2_xxs
...
From here on switching to GCC 12.
PP-512 is now 139.3 t/s.
TG-128 is 13.5 t/s @ 4 threads
23.0 t/s @ 8 threads
25.1 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
3a6e3943a8
iqk_mul_mat: better AVX2 implementation for iq2_xxs
...
2.41X for PP-512 (120.5 t/s).
Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s).
But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s.
Very strange.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
60f050d610
iqk_mul_mat: AVX2 implementation for iq2_xxs
...
2.09X for PP-512 (104.7 t/s), worse than mainline for TG.
I think it needs more work.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
309e32405f
iqk_mul_mat: AVX2 implementation for iq2_xs
...
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK
(slightly better @ 4 threads, slightly worse @ 16 threads).
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
8015edb3cc
iqk_mul_mat: AVX2 implementation for iq2_s
...
We get 2.04X for PP-512 (107 t/s). TG againsuffers
a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
b0071de081
Separate templates for TG and PP for i-quants on AVX2
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
2c8c0d0a68
iqk_mul_mat: AVX2 implementation for iq3_xxs
...
We get 2.3X for PP-512 (87 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
34befcaf67
iqk_mul_mat: AVX2 implementation for iq3_s
...
We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
4f53915dcb
Cleanup - Arm i-quants should be good now
...
Still missing iq1_s and iq1_m, but I don't think I'll do those.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
4b27ade2fb
iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version)
...
Here we get 3.65X (!) for PP-512 (53 t/s).
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
221a2c3807
Simplify
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
7dcca6aea7
iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version)
...
We get 2.66X for PP-512 (42.35 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
effa4448d6
iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version)
...
We get 2.2X for PP-512 (52 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
d2ee9ab95e
iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version)
...
We get only a 2.07X for PP-512 to get up to 31 t/s,
so iq2_s remains slow.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
9ac9e928d5
Add Q8_0
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
3f996d0c70
Cosmetics
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
d7ab97149f
iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version)
...
We get ~5% speeedup for TG-128, 3X for PP-512
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
b51922530f
iqk_mul_mat: faster q3_K TG
...
We get 31 t/s up from 26 t/s, but we need to treat
PP differently from TG, else we get a ~10% drop in
PP performance.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
19c578b413
iqk_mul_mat for llama.cpp
2024-06-22 12:02:49 +03:00