Kawrakow
7cb77d7a67
iq1_bn(Metal): 64 -> 66.2 t/s for TG
2024-06-22 12:02:51 +03:00
Kawrakow
04fed5cd9f
iq1_bn(Metal): 60 -> 64 t/s for TG
2024-06-22 12:02:51 +03:00
Kawrakow
5d14a2243e
iq1_bn: very slightly better Metal dot product
2024-06-22 12:02:51 +03:00
Kawrakow
15e1aec7a5
iq1_bn: Metal now works
...
PP performance is decent (668 t/s v 724 t/s for q4_0),
but TG is kind of low (60 t/s vs 81 t/s for q4_0).
2024-06-22 12:02:51 +03:00
Kawrakow
4b64224645
iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working
2024-06-22 12:02:51 +03:00
Kawrakow
77d8637925
iqk_mul_mat(iq1_bn): WIP NEON (not working)
2024-06-22 12:02:51 +03:00
Kawrakow
dfdc4dbee6
iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2
...
I now get PP-512 = 270 t/s on the Ryzen-5975WX
2024-06-22 12:02:51 +03:00
Kawrakow
dff96fb5f8
iqk_mul_mat: improve iq1_bn (bitnet) on AVX2
...
We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.
2024-06-22 12:02:51 +03:00
Kawrakow
b0967ffa79
bitnet: fix scalar dot product
...
I had forgotten to adjust for the change to q8_K64.
On the M2 I'm getting 10.8 t/s with the scalar version!
2024-06-22 12:02:51 +03:00
Kawrakow
88e98260bf
bitnet: scale is per row, not per tensor
2024-06-22 12:02:51 +03:00
Kawrakow
077270395b
iqk_mul_mat: add iq1_bn (bitnet)
...
We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.
2024-06-22 12:02:51 +03:00
Kawrakow
eecd48eab5
bitnet: CUDA, scalar, AVX2
2024-06-22 12:02:51 +03:00
Kawrakow
81576cdcac
bitnet: python + llama
2024-06-22 12:02:51 +03:00
Kawrakow
f9490aea46
iqk_mul_mat: cleanup
2024-06-22 12:02:50 +03:00
Kawrakow
389e6220e9
iqk_mul_mat: be independent of llamafile_sgemm
...
Verified that it works on AVX2.
Also turned on any combination of f16 and f32
(i.e., added f16 x 16 and f32 x f32).
2024-06-22 12:02:50 +03:00
Kawrakow
915a1b2665
iqk_mul_mat: be independent of llamafile_sgemm (WIP)
...
* Remove iqk_mul_mat from llamafile_sgemm
* Pass tensor types and strides to iqk_mul_mat
It is marked WIP because only tested on __aarch64__
2024-06-22 12:02:50 +03:00
Kawrakow
cc628b2e39
Fix nb4
2024-06-22 12:02:50 +03:00
Kawrakow
d41aef5418
iqk_mul_mat: add ability to disable it
2024-06-22 12:02:50 +03:00
Kawrakow
154f56a8de
iqk_mul_mat: be able to handle any f16/f32 combination on AVX2
...
But only turning on f16 x f32 and f32 x f16 for now.
2024-06-22 12:02:50 +03:00
Kawrakow
1211a4b5d0
iqk_mul_mat: turn on AVX512
...
It makes no difference on my Ryzen-7950X, but perhaps
it will be beneficial for CPU's with real AVX512.
2024-06-22 12:02:50 +03:00
Kawrakow
dfcb8bebc5
iqk_mul_mat: slightly better fp16 with 16 vector registers
...
2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX
up from 138 t/s. We use Nx registers to preload the fp16 weights,
so total registers required is Nx * (Ny + 1), so 15 in the case
of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare
register helps. But maybe it is just a matter of how things get
loaded into the cache. On the 7950X I did try 3 x 8 and it did
not perform as well as 5 x 5.
2024-06-22 12:02:50 +03:00
Kawrakow
9dba81ddf2
iqk_mul_mat: better fp16 for AVX2
...
Basically use what I did for Arm.
Improves PP performance to 141.7 t/s up from 136 t/s
on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling).
This is now 10% faster than tinyBLAS.
There is a minor improvement also on the Ryzen-5975WX
(16 vector registers, so we use 4x3 tiling): we get
138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.
2024-06-22 12:02:50 +03:00
Kawrakow
baf6aaa31b
iqk_mul_mat: fp16 for Arm
...
~2% slower than tinyBLAS - not sure why.
2024-06-22 12:02:50 +03:00
Kawrakow
6ec0fcc5c7
iqk_mul_mat: slightly faster FANCY_SIMD dot product
...
About 2% faster for q4_K.
2024-06-22 12:02:50 +03:00
Kawrakow
5812618409
iqk_mul_mat: fix q8_0
...
I was happily using _mm256_packs_epi32() to pack the
q8_0 x q8_0 dot products back to int16_t, and getting useful
results. But theoretically this can overflow, so it is
better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine
the 4 dot products using int32_t additions. This is (almost)
as fast, unlike _mm256_hadd_epi32(), which seems excessively
slow on the Ryzen-7950X.
2024-06-22 12:02:50 +03:00
Kawrakow
7f91151c2e
iqk_mul_mat: decouple from llamafile also in cmake
2024-06-22 12:02:50 +03:00
Kawrakow
8b03121c33
iqk_mul_mat: make it build with the Makefile
2024-06-22 12:02:50 +03:00
Kawrakow
c7870afaad
iqk_mul_mat: use block_q8_1_x4 also for AVX2
...
Here the performance gain is more significant. E.g., for q4_1,
PP-512 becomes 168 t/s up from 137 t/s.
Now the performance gap to q4_0 is so significant that I
wonder if I should change to using Q8_1 also for the
qX_0 legacy quants.
2024-06-22 12:02:50 +03:00
Kawrakow
5b19e5e4a9
iqk_mul_mat: use block_q8_0_x4 also for AVX2
2024-06-22 12:02:50 +03:00
Kawrakow
30a0bf30fa
iqk_mul_mat: delete unused stuff
2024-06-22 12:02:50 +03:00
Kawrakow
64da6f7a97
iqk_mul_mat: add q8_0
...
It was actually ready but not turned on.
Having forgotten, I made a new implementation along the
lines of the fp16 implementation (i.e., using tiling).
That matched tiinyBLAS performance. But the existing
implementation that I now turned on is faster:
PP-512 = 134 t/s vs 128.3 t/s for tinyBLAS
TG-128 = 8.7 t/s vs 8.3 t/s for tinyBLAS (@ 4 threads)
2024-06-22 12:02:50 +03:00
Kawrakow
f2ced256b4
iqk_mul_mat: fp16 tweaks
...
Use 4x3 tiling on a real AVX2 CPU (with only 16 vector registers).
This works best for the Ryzen-5975WX.
2024-06-22 12:02:50 +03:00
Kawrakow
b4ecd2dce6
iqk_mul_mat: fp16 implementation cleanup
...
It turns out on my Ryzen-7950X CPU using
AVX512 is slower.
2024-06-22 12:02:50 +03:00
Kawrakow
e0b52e14a6
iqk_mul_mat: fp16 implementation for AVX2
...
This simple implementation beats jart's tiniBLAS by a
small margin (143 t/s vs 137 t/s for PP-512, TG is
4.75 t/s, so exactly the same as ggml).
2024-06-22 12:02:50 +03:00
Kawrakow
2328da1aa7
iqk_mul_mat: multi-thread quantization also for MoE models
2024-06-22 12:02:50 +03:00
Kawrakow
ea239f8572
iqk_mul_mat: make it independent of sgemm
2024-06-22 12:02:50 +03:00
Kawrakow
5039ea8930
iqk_mul_mat: minor improvements
...
Current performance:
| model | size | threads | test | t/s |
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ3_S | 2.75 GiB | 16 | pp512 | 100.21 ± 0.32 |
| llama 7B IQ3_XXS | 2.41 GiB | 16 | pp512 | 105.25 ± 0.75 |
| llama 7B IQ2_M | 2.20 GiB | 16 | pp512 | 117.88 ± 0.15 |
| llama 7B IQ2_XS | 1.89 GiB | 16 | pp512 | 136.38 ± 0.24 |
| llama 7B IQ2_XXS | 1.73 GiB | 16 | pp512 | 128.47 ± 0.39 |
mean: 117.64
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ2_XXS | 1.73 GiB | 8 | tg128 | 23.94 ± 0.04 |
| llama 7B IQ2_XS | 1.89 GiB | 8 | tg128 | 23.27 ± 0.03 |
| llama 7B IQ2_M | 2.20 GiB | 8 | tg128 | 18.88 ± 0.03 |
| llama 7B IQ3_XXS | 2.41 GiB | 8 | tg128 | 19.07 ± 0.04 |
| llama 7B IQ3_S | 2.75 GiB | 8 | tg128 | 15.44 ± 0.05 |
mean: 20.12
2024-06-22 12:02:50 +03:00
Kawrakow
e85753e1ad
iqk_mul_mat: no more templates in the IQ dequantizers
...
Also moved the quant specific code from the EvenSignHelper
into the corresponding dequantizers.
These two changes had a tiniy performance benefit (much too small
compared to what I was expecting/hoping for).
2024-06-22 12:02:50 +03:00
Kawrakow
b8556267cd
iqk_mul_mat: remove template on one of the prepare() functions
2024-06-22 12:02:49 +03:00
Kawrakow
44b1b4fb97
iqk_mul_mat: experimenting with zen4
...
Nope, we cannot have good performance for iq2_xxs and
iq3_xxs at the same time. If I don't force inline
the sign functions, I get better performnce for iq2_xxs
and bad performance for iq3_xxs. If I fore inline them,
it is the other way around. Anyway, this is what we have
now on Zen4 for all quants with forced inline EvenSignHelper
methods:
| model | size | threads | test | t/s |
| -----------------| ---------: | ------: | -----: | ------------: |
| llama 7B IQ3_S | 2.75 GiB | 16 | pp512 | 100.91 ± 0.26 |
| llama 7B IQ3_XXS | 2.41 GiB | 16 | pp512 | 106.08 ± 0.78 |
| llama 7B IQ2_M | 2.20 GiB | 16 | pp512 | 116.41 ± 0.25 |
| llama 7B IQ2_XS | 1.89 GiB | 16 | pp512 | 132.54 ± 1.07 |
| llama 7B IQ2_XXS | 1.73 GiB | 16 | pp512 | 125.53 ± 0.06 |
arithmetic mean: 116.29
geometric mean: 115.70
| -----------------| ---------: | ------: | -----: | ------------: |
| llama 7B IQ3_S | 2.75 GiB | 8 | tg128 | 15.69 ± 0.04 |
| llama 7B IQ3_XXS | 2.41 GiB | 8 | tg128 | 18.02 ± 0.04 |
| llama 7B IQ2_M | 2.20 GiB | 8 | tg128 | 18.94 ± 0.03 |
| llama 7B IQ2_XS | 1.89 GiB | 8 | tg128 | 23.29 ± 0.02 |
| llama 7B IQ2_XXS | 1.73 GiB | 8 | tg128 | 22.96 ± 0.09 |
arithmetic mean: 19.78
geometric mean: 19.56
Without force-inlining, PP(iq3_xxs) drops to 98 t/s while
PP(iq2_xxs) increases to 137 t/s.
2024-06-22 12:02:49 +03:00
Kawrakow
eb9e2b628a
iqk_mul_mat: experimenting with zen4 (iq2_xxs)
...
Observing again the wierdness of performance drop
in a quant because of a change in another quant.
After I added FANCY_SIMD implementations for
ia3_s, iq2_s and iq2_xs, I'm observing that
iq2_xxs PP performance dropped to 130 t/s from 139 t/s.
Adding FANCY_SIMD implementation for applying the signs
brings it back to 137 t/s and gives a small boost
for TG as well (23.4 vs 23.0 t/s)
2024-06-22 12:02:49 +03:00
Kawrakow
2c8d3dad1f
iqk_mul_mat: experimenting with zen4 (iq2_xs)
2024-06-22 12:02:49 +03:00
Kawrakow
0d9027fe74
iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m)
2024-06-22 12:02:49 +03:00
Kawrakow
ed8f1fe490
iqk_mul_mat: small improvement for iq3_s
...
The same as in llamafile. We get
PP-512 = 96.6 t/s
TG-128 = 7.77 t/s @ 4 threads
14.4 t/s @ 8 threads
16.3 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Kawrakow
01d55dcbf0
iqk_mul_mat: better AVX2 implementation for iq2_xxs
...
From here on switching to GCC 12.
PP-512 is now 139.3 t/s.
TG-128 is 13.5 t/s @ 4 threads
23.0 t/s @ 8 threads
25.1 t/s @ 16 threads
2024-06-22 12:02:49 +03:00
Kawrakow
d4e9e595f9
iqk_mul_mat: better AVX2 implementation for iq2_xxs
...
2.41X for PP-512 (120.5 t/s).
Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s).
But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s.
Very strange.
2024-06-22 12:02:49 +03:00
Kawrakow
41391ff4b0
iqk_mul_mat: AVX2 implementation for iq2_xxs
...
2.09X for PP-512 (104.7 t/s), worse than mainline for TG.
I think it needs more work.
2024-06-22 12:02:49 +03:00
Kawrakow
be132341f5
iqk_mul_mat: AVX2 implementation for iq2_xs
...
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK
(slightly better @ 4 threads, slightly worse @ 16 threads).
2024-06-22 12:02:49 +03:00
Kawrakow
3c448906bf
iqk_mul_mat: AVX2 implementation for iq2_s
...
We get 2.04X for PP-512 (107 t/s). TG againsuffers
a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
2024-06-22 12:02:49 +03:00
Kawrakow
f31200bde1
Separate templates for TG and PP for i-quants on AVX2
2024-06-22 12:02:49 +03:00