Commit Graph

76 Commits

Author SHA1 Message Date
Iwan Kawrakow
e4dc3babb5 iq1bn(no lookup): somewhat better
We now have for Bitnet-3B:
| threads |          test |              t/s |
| ------: | ------------: | ---------------: |
|      16 |         pp512 |    308.97 ± 1.89 |
|      16 |         tg128 |     58.80 ± 0.07 |
|       8 |         tg128 |     49.79 ± 1.23 |
|       4 |         tg128 |     28.85 ± 0.02 |
|       2 |         tg128 |     15.39 ± 0.01 |
2024-07-15 13:46:07 +03:00
Iwan Kawrakow
a4bbd36905 iq1bn: attempt without a lookup table 2024-07-15 11:02:41 +03:00
Iwan Kawrakow
753dbaeeb0 bitnet: remove iq1_bn lookup table storing +/- signs
The AVX2 implementation was the only one left using it, so
I decided to see if we can get a performant implementation
using the 0,1,2 lookup table. Turns out we can, and it is
even slightly faster than the sign based table. We now
get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads
on the Ryzen-7950X.

With only one lookup table left for iq1_bn, I renamed it to
iq1bn_grid_u16.
2024-06-25 18:19:11 +03:00
Iwan Kawrakow
8b436a84c5 bitnet: simdify q8_K64 quantization on AVX
Doesn't make a real difference in performance.
2024-06-25 17:20:34 +03:00
Iwan Kawrakow
c906c4c4fe bitnet: NEON improvements for iq1_bn
With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.
2024-06-25 13:48:29 +02:00
Iwan Kawrakow
7de9559cf2 Bitnet: adapt NEON and Metal to the alternative grid 2024-06-25 11:16:13 +02:00
Iwan Kawrakow
aa14a06b44 Bitnet: trying an alternative iq1_bn grid
Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.
2024-06-25 11:32:48 +03:00
Iwan Kawrakow
3d61866f0a Bitnet: slightly faster 1.625 bpw variant for AVX512VL 2024-06-25 08:33:00 +03:00
Iwan Kawrakow
fc04994ebf iqk_mul_mat: add IQ4_NL also on NEON
PPL seems somewhat higher? For llama-v2-7B iwe are still
~0.04 higher compared to hat we expect after ~30 batches.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
caa42ccc56 iqk_mul_mat: add IQ4_NL
I never use it, so I had completely forgotten about it.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
86dc8e5f8b bitnet(scale in a separate tensor): CPU tweaks
A somewhat nicer iq2_bn implementation on AVX2.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
729ba46f77 bitnet(scale in a separate tensor): CPU tweaks
I had ruined TG performance on AVX2 with the last commit.
Was just testing at 8 threads and there we are totally memory
bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950.
Back to 51 t/s with this commit.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
f0325c5826 bitnet(scale in a separate tensor): more CPU improvements
It seems it is enough to have 4 scales per row for Q8.
I get PPL = 8.5470 with this, which is slightly higher than
the 8.5430 we get with 1 scale per 128 activations, but still
OK, I think.
With this, we get the following performance:

Systema  | quant  |  PP-512     |  TG-128a     | quant |    PP-512    |   TG-12s   |
M2 Max   | iq2bn  229.02 ± 0.37  78.75 ± 0.61  | iq1bn | 146.67 ± 2.85  33.12 ± 0.03
Ryzen7950| iq2bn  379.36 ± 1.03  49.08 ± 0.18  | iq1bn | 247.12 ± 1.53  32.80 ± 0.02
Ryzen5975| iq2bn  465.28 ± 0.57  39.17 ± 0.02  | iq1bn | 325.86 ± 0.46  26.60 ± 0.10
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
e05cca9ef6 bitnet(scale in a separate tensor): CPU improvements
Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat
to deal with that. This improves PP speef by a few percent.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
58d9e8f1d2 bitnet: put the scale in a separate tensor
and correspondingly add an extra ggml_mul_mat operation.
As per @ggerganov, this is how things should be done.
It seems to be working, but as far as I can tell this
results in a ~15% performance penalty for prompt processing.
Commiting so I can go and test on othe platforms.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
927e251a12 Bitnet(1.75 bpw): higher precision fp8 scale
Use 3 bits for the exponent and 5 bits for the mantissa.
This makes PPL to be the same as fp16 (but the previous
version with 4 bits for the exponent and mantissa was
good enough for any practical purposes).
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
2998ca9b14 Bitnet(2.25 bpw): NEON
We get PP-512 = 192 t/s, TG-128 = 72 t/s
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
8c6276f6a1 Bitnet: 2.25 bpw version
Just scaler and AVX2 for now.
PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on
Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and
the model being 10% larger.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
1de6476d75 bitnet 2 bpw: NEON implementation
We get PP-512 = 190 t/s and TG-128 = 75 t/s.
2 bpw TG on the CPU beats 1.75 bpw on the GPU!
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
f97a329638 Removed extra column 2024-06-22 12:02:52 +03:00
Iwan Kawrakow
6616985135 bitnet 2 bpw: AVX2 implementation
We get PP-512 = 322 t/s.
TG is already 51.6 t/s at 4 threads, then it saturates and
starts going down for more than 8 threads.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
d82e5db6e5 iqk_mul_mat(bitnet): fix typo
With the last change (which added the typo), I'm now getting
PP-512 = 300 t/s on the Ryzen-5975WX.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
ddea72453b iqk_mul_mat(bitnet): slightly faster AVX2
We now get 214 t/s on the Ryzen-7950X
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
30a771bd6b iq1_bn: better NEON implementation
PP is decent with 131 t/s (q4_0 has 150 t/s).
TG is better than last commit but still bad at 33.1 t/s
(in comparison q4_0 gets 52.3 t/s).

I had to go to the (0, 1, 2) table. Apple Silicon clearly
does not like operations with signs.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
8222c9f3d1 iq1_bn(NEON): works now, but very slow
Basically 2X slower tan q4_0.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
0c5a353ebd iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
bf22b701f4 iqk_mul_mat(iq1_bn): WIP NEON (not working) 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
29d9bf65f3 iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2
I now get PP-512 = 270 t/s on the Ryzen-5975WX
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
91ec824f2d iqk_mul_mat: improve iq1_bn (bitnet) on AVX2
We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
4fcfcd05d1 bitnet: scale is per row, not per tensor 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
7f8901dca1 iqk_mul_mat: add iq1_bn (bitnet)
We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
58756ef03f iqk_mul_mat: cleanup 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
7501184eb4 iqk_mul_mat: be independent of llamafile_sgemm
Verified that it works on AVX2.
Also turned on any combination of f16 and f32
(i.e., added f16 x 16 and f32 x f32).
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
ad53eabf87 iqk_mul_mat: be independent of llamafile_sgemm (WIP)
* Remove iqk_mul_mat from llamafile_sgemm
* Pass tensor types and strides to iqk_mul_mat

It is marked WIP because only tested on __aarch64__
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
81cf6990f5 iqk_mul_mat: be able to handle any f16/f32 combination on AVX2
But only turning on f16 x f32 and f32 x f16 for now.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
b2acd81c75 iqk_mul_mat: turn on AVX512
It makes no difference on my Ryzen-7950X, but perhaps
it will be beneficial for CPU's with real AVX512.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
9e3dc8c432 iqk_mul_mat: slightly better fp16 with 16 vector registers
2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX
up from 138 t/s. We use Nx registers to preload the fp16 weights,
so total registers required is Nx * (Ny + 1), so 15 in the case
of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare
register helps. But maybe it is just a matter of how things get
loaded into the cache. On the 7950X I did try 3 x 8 and it did
not perform as well as 5 x 5.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
ae1e77c5de iqk_mul_mat: better fp16 for AVX2
Basically use what I did for Arm.
Improves PP performance to 141.7 t/s up from 136 t/s
on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling).
This is now 10% faster than tinyBLAS.
There is a minor improvement also on the Ryzen-5975WX
(16 vector registers, so we use 4x3 tiling): we get
138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
9386b49918 iqk_mul_mat: fp16 for Arm
~2% slower than tinyBLAS - not sure why.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
09d86e5876 iqk_mul_mat: slightly faster FANCY_SIMD dot product
About 2% faster for q4_K.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
8a80a31ddd iqk_mul_mat: fix q8_0
I was happily using _mm256_packs_epi32() to pack the
q8_0 x q8_0 dot products back to int16_t, and getting useful
results. But theoretically this can overflow, so it is
better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine
the 4 dot products using int32_t additions. This is (almost)
as fast, unlike _mm256_hadd_epi32(), which seems excessively
slow on the Ryzen-7950X.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
cd3d8ae0e7 iqk_mul_mat: use block_q8_1_x4 also for AVX2
Here the performance gain is more significant. E.g., for q4_1,
PP-512 becomes 168 t/s up from 137 t/s.
Now the performance gap to q4_0 is so significant that I
wonder if I should change to using Q8_1 also for the
qX_0 legacy quants.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
299c7f6e89 iqk_mul_mat: use block_q8_0_x4 also for AVX2 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
f0a52f2fbb iqk_mul_mat: delete unused stuff 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
74b711c8fd iqk_mul_mat: add q8_0
It was actually ready but not turned on.
Having forgotten, I made a new implementation along the
lines of the fp16 implementation (i.e., using tiling).
That matched tiinyBLAS performance. But the existing
implementation that I now turned on is faster:
PP-512 = 134 t/s vs 128.3 t/s for tinyBLAS
TG-128 = 8.7 t/s vs 8.3 t/s for tinyBLAS (@ 4 threads)
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
29164263f4 iqk_mul_mat: fp16 tweaks
Use 4x3 tiling on a real AVX2 CPU (with only 16 vector registers).
This works best for the Ryzen-5975WX.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
36c3f57b0a iqk_mul_mat: fp16 implementation cleanup
It turns out on my Ryzen-7950X CPU using
AVX512 is slower.
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
bc659e7de1 iqk_mul_mat: fp16 implementation for AVX2
This simple implementation beats jart's tiniBLAS by a
small margin (143 t/s vs 137 t/s for PP-512, TG is
4.75 t/s, so exactly the same as ggml).
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
667bd4759c iqk_mul_mat: make it independent of sgemm 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
2ee56b4f0d iqk_mul_mat: minor improvements
Current performance:
| model             |       size |  threads |    test |              t/s |
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ3_S    |   2.75 GiB |       16 |   pp512 |    100.21 ± 0.32 |
| llama 7B IQ3_XXS  |   2.41 GiB |       16 |   pp512 |    105.25 ± 0.75 |
| llama 7B IQ2_M    |   2.20 GiB |       16 |   pp512 |    117.88 ± 0.15 |
| llama 7B IQ2_XS   |   1.89 GiB |       16 |   pp512 |    136.38 ± 0.24 |
| llama 7B IQ2_XXS  |   1.73 GiB |       16 |   pp512 |    128.47 ± 0.39 |
                                                     mean: 117.64
| ----------------- | ---------: | -------: | ------: | ---------------: |
| llama 7B IQ2_XXS  |   1.73 GiB |        8 |   tg128 |     23.94 ± 0.04 |
| llama 7B IQ2_XS   |   1.89 GiB |        8 |   tg128 |     23.27 ± 0.03 |
| llama 7B IQ2_M    |   2.20 GiB |        8 |   tg128 |     18.88 ± 0.03 |
| llama 7B IQ3_XXS  |   2.41 GiB |        8 |   tg128 |     19.07 ± 0.04 |
| llama 7B IQ3_S    |   2.75 GiB |        8 |   tg128 |     15.44 ± 0.05 |
                                                     mean:  20.12
2024-06-22 12:02:50 +03:00