Commit Graph

33 Commits

Author SHA1 Message Date
Iwan Kawrakow
f0a52f2fbb iqk_mul_mat: delete unused stuff 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
74b711c8fd iqk_mul_mat: add q8_0
It was actually ready but not turned on.
Having forgotten, I made a new implementation along the
lines of the fp16 implementation (i.e., using tiling).
That matched tiinyBLAS performance. But the existing
implementation that I now turned on is faster:
PP-512 = 134 t/s vs 128.3 t/s for tinyBLAS
TG-128 = 8.7 t/s vs 8.3 t/s for tinyBLAS (@ 4 threads)
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
29164263f4 iqk_mul_mat: fp16 tweaks
Use 4x3 tiling on a real AVX2 CPU (with only 16 vector registers).
This works best for the Ryzen-5975WX.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
36c3f57b0a iqk_mul_mat: fp16 implementation cleanup
It turns out on my Ryzen-7950X CPU using
AVX512 is slower.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
bc659e7de1 iqk_mul_mat: fp16 implementation for AVX2
This simple implementation beats jart's tiniBLAS by a
small margin (143 t/s vs 137 t/s for PP-512, TG is
4.75 t/s, so exactly the same as ggml).
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
667bd4759c iqk_mul_mat: make it independent of sgemm 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
2ee56b4f0d iqk_mul_mat: minor improvements
Current performance:
| model             |       size |  threads |    test |              t/s |
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ3_S    |   2.75 GiB |       16 |   pp512 |    100.21 ± 0.32 |
| llama 7B IQ3_XXS  |   2.41 GiB |       16 |   pp512 |    105.25 ± 0.75 |
| llama 7B IQ2_M    |   2.20 GiB |       16 |   pp512 |    117.88 ± 0.15 |
| llama 7B IQ2_XS   |   1.89 GiB |       16 |   pp512 |    136.38 ± 0.24 |
| llama 7B IQ2_XXS  |   1.73 GiB |       16 |   pp512 |    128.47 ± 0.39 |
                                                     mean: 117.64
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ2_XXS  |   1.73 GiB |        8 |   tg128 |     23.94 ± 0.04 |
| llama 7B IQ2_XS   |   1.89 GiB |        8 |   tg128 |     23.27 ± 0.03 |
| llama 7B IQ2_M    |   2.20 GiB |        8 |   tg128 |     18.88 ± 0.03 |
| llama 7B IQ3_XXS  |   2.41 GiB |        8 |   tg128 |     19.07 ± 0.04 |
| llama 7B IQ3_S    |   2.75 GiB |        8 |   tg128 |     15.44 ± 0.05 |
                                                     mean:  20.12
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
0ad646b9f0 iqk_mul_mat: no more templates in the IQ dequantizers
Also moved the quant specific code from the EvenSignHelper
into the corresponding dequantizers.

These two changes had a tiniy performance benefit (much too small
compared to what I was expecting/hoping for).
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
e35a14ff5f iqk_mul_mat: remove template on one of the prepare() functions 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
e67626533c iqk_mul_mat: experimenting with zen4
Nope, we cannot have good performance for iq2_xxs and
iq3_xxs at the same time. If I don't force inline
the sign functions, I get better performnce for iq2_xxs
and bad performance for iq3_xxs. If I fore inline them,
it is the other way around. Anyway, this is what we have
now on Zen4 for all quants with forced inline EvenSignHelper
methods:

| model            |       size | threads |   test |           t/s |
| -----------------| ---------: | ------: | -----: | ------------: |
| llama 7B IQ3_S   |   2.75 GiB |      16 |  pp512 | 100.91 ± 0.26 |
| llama 7B IQ3_XXS |   2.41 GiB |      16 |  pp512 | 106.08 ± 0.78 |
| llama 7B IQ2_M   |   2.20 GiB |      16 |  pp512 | 116.41 ± 0.25 |
| llama 7B IQ2_XS  |   1.89 GiB |      16 |  pp512 | 132.54 ± 1.07 |
| llama 7B IQ2_XXS |   1.73 GiB |      16 |  pp512 | 125.53 ± 0.06 |
                                    arithmetic mean: 116.29
                                    geometric  mean: 115.70
| -----------------| ---------: | ------: | -----: | ------------: |
| llama 7B IQ3_S   |   2.75 GiB |       8 |  tg128 |  15.69 ± 0.04 |
| llama 7B IQ3_XXS |   2.41 GiB |       8 |  tg128 |  18.02 ± 0.04 |
| llama 7B IQ2_M   |   2.20 GiB |       8 |  tg128 |  18.94 ± 0.03 |
| llama 7B IQ2_XS  |   1.89 GiB |       8 |  tg128 |  23.29 ± 0.02 |
| llama 7B IQ2_XXS |   1.73 GiB |       8 |  tg128 |  22.96 ± 0.09 |
                                    arithmetic mean:  19.78
                                    geometric  mean:  19.56

Without force-inlining, PP(iq3_xxs) drops to 98 t/s while
PP(iq2_xxs) increases to 137 t/s.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
47ae12bbec iqk_mul_mat: experimenting with zen4 (iq2_xxs)
Observing again the wierdness of performance drop
in a quant because of a change in another quant.
After I added FANCY_SIMD implementations for
ia3_s, iq2_s and iq2_xs, I'm observing that
iq2_xxs PP performance dropped to 130 t/s from 139 t/s.
Adding FANCY_SIMD implementation for applying the signs
brings it back to 137 t/s and gives a small boost
for TG as well (23.4 vs 23.0 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
dc96d5484f iqk_mul_mat: experimenting with zen4 (iq2_xs) 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
cb063a2a20 iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m) 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
61b8cc1ff6 iqk_mul_mat: small improvement for iq3_s
The same as in llamafile. We get
PP-512 = 96.6 t/s
TG-128 = 7.77 t/s @  4 threads
         14.4 t/s @  8 threads
         16.3 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
2a72d9f978 iqk_mul_mat: better AVX2 implementation for iq2_xxs
From here on switching to GCC 12.

PP-512 is now 139.3 t/s.
TG-128 is 13.5 t/s @  4 threads
          23.0 t/s @  8 threads
          25.1 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
3a6e3943a8 iqk_mul_mat: better AVX2 implementation for iq2_xxs
2.41X for PP-512 (120.5 t/s).
Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s).
But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s.
Very strange.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
60f050d610 iqk_mul_mat: AVX2 implementation for iq2_xxs
2.09X for PP-512 (104.7 t/s), worse than mainline for TG.
I think it needs more work.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
309e32405f iqk_mul_mat: AVX2 implementation for iq2_xs
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK
(slightly better @ 4 threads, slightly worse @ 16 threads).
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
8015edb3cc iqk_mul_mat: AVX2 implementation for iq2_s
We get 2.04X for PP-512 (107 t/s). TG againsuffers
a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
b0071de081 Separate templates for TG and PP for i-quants on AVX2 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
2c8c0d0a68 iqk_mul_mat: AVX2 implementation for iq3_xxs
We get 2.3X for PP-512 (87 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
34befcaf67 iqk_mul_mat: AVX2 implementation for iq3_s
We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
4f53915dcb Cleanup - Arm i-quants should be good now
Still missing iq1_s and iq1_m, but I don't think I'll do those.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
4b27ade2fb iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version)
Here we get 3.65X (!) for PP-512 (53 t/s).
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
221a2c3807 Simplify 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
7dcca6aea7 iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version)
We get 2.66X for PP-512 (42.35 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
effa4448d6 iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version)
We get 2.2X for PP-512 (52 t/s)
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
d2ee9ab95e iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version)
We get only a 2.07X for PP-512 to get up to 31 t/s,
so iq2_s remains slow.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
9ac9e928d5 Add Q8_0 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
3f996d0c70 Cosmetics 2024-06-22 12:02:49 +03:00
Iwan Kawrakow
d7ab97149f iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version)
We get ~5% speeedup for TG-128, 3X for PP-512
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
b51922530f iqk_mul_mat: faster q3_K TG
We get 31 t/s up from 26 t/s, but we need to treat
PP differently from TG, else we get a ~10% drop in
PP performance.
2024-06-22 12:02:49 +03:00
Iwan Kawrakow
19c578b413 iqk_mul_mat for llama.cpp 2024-06-22 12:02:49 +03:00