Commit Graph

3292 Commits

Author SHA1 Message Date
Iwan Kawrakow
caa42ccc56 iqk_mul_mat: add IQ4_NL
I never use it, so I had completely forgotten about it.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
86dc8e5f8b bitnet(scale in a separate tensor): CPU tweaks
A somewhat nicer iq2_bn implementation on AVX2.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
729ba46f77 bitnet(scale in a separate tensor): CPU tweaks
I had ruined TG performance on AVX2 with the last commit.
Was just testing at 8 threads and there we are totally memory
bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950.
Back to 51 t/s with this commit.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
f0325c5826 bitnet(scale in a separate tensor): more CPU improvements
It seems it is enough to have 4 scales per row for Q8.
I get PPL = 8.5470 with this, which is slightly higher than
the 8.5430 we get with 1 scale per 128 activations, but still
OK, I think.
With this, we get the following performance:

Systema  | quant  |  PP-512     |  TG-128a     | quant |    PP-512    |   TG-12s   |
M2 Max   | iq2bn  229.02 ± 0.37  78.75 ± 0.61  | iq1bn | 146.67 ± 2.85  33.12 ± 0.03
Ryzen7950| iq2bn  379.36 ± 1.03  49.08 ± 0.18  | iq1bn | 247.12 ± 1.53  32.80 ± 0.02
Ryzen5975| iq2bn  465.28 ± 0.57  39.17 ± 0.02  | iq1bn | 325.86 ± 0.46  26.60 ± 0.10
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
e05cca9ef6 bitnet(scale in a separate tensor): CPU improvements
Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat
to deal with that. This improves PP speef by a few percent.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
36374ab37d bitnet(scale in a separate tensor): mul -> scale on the CPU 2024-06-22 12:02:52 +03:00
Iwan Kawrakow
e73ae1f6d3 bitnet(scale in a separate tensor): mul -> scale on CUDA
On CUDA we do not have access to the tensor data until we
hit the kernel. That's why this hack.
In any case, iq2_bn goes back up to 228 t/s, which is close
to the 234 t/s we have without the extra scale operation.
PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s
we get without making the mul -> scale replacement.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
7f968d51b4 bitnet(scale in a separate tensor): mul -> scale on Metal
Do the mul -> scale replacement on the fly in the Metal backend.
This recovers the PP performace and cuts the TG performance
degradation in half.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
d08ff0df43 Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale"
This reverts commit f83381371b61e0863b55c60e5f5df139126a496d.
When using CUDA, the tensor contents have not been loaded yet,
so we crash when trying to access the scale when building the
graph. There must be a better way.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
ad60fb3567 bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale
This recovers part of the performance loss. On Metal TG-128 is now
92 t/s, still short of the ~100 t/s with scales applied on the fly.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
257fa74014 bitnet(scale in a separate tensor): Metal
iq2_bn TG-128 drops to 84 t/s, while I see in the logs
that we had 97 t/s. If true, that's a pretty massive
performance penalty for TG. Let me guess: ggml_mul is not
exactly the most performant operation on Metal.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
a2e43b83c9 bitnet(scale in a separate tensor): CUDA 2024-06-22 12:02:52 +03:00
Iwan Kawrakow
58d9e8f1d2 bitnet: put the scale in a separate tensor
and correspondingly add an extra ggml_mul_mat operation.
As per @ggerganov, this is how things should be done.
It seems to be working, but as far as I can tell this
results in a ~15% performance penalty for prompt processing.
Commiting so I can go and test on othe platforms.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
927e251a12 Bitnet(1.75 bpw): higher precision fp8 scale
Use 3 bits for the exponent and 5 bits for the mantissa.
This makes PPL to be the same as fp16 (but the previous
version with 4 bits for the exponent and mantissa was
good enough for any practical purposes).
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
181fd9c56e Bitnet(1.75 bpw): slightly faster CUDA dot product
We get 205 t/s, so ~13% slower than 2 bit.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
fece7e1db7 Bitnet(2.25 bpw): faster Metal dot product
With this we get TG-128 = 97 t/s.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
4f51348d3d Bitnet(2.25 bpw): Metal
We get PP-512 = 702 t/s, TG-128 = 84 t/s.
This is almost on par with q4_0, which is rare on Metal
(to not say it does not exist).
For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet.
TG is kind of funny because we hit 72 t/s on the CPU.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
01ea9a862d Bitnet(2.25 bpw): CUDA
We get PP-512 = 9600 t/s, TG-128 = 234 t/s
(but we need to use 8 CPU threads, else results are lower,
so clearly there is something being computed on the CPU).
PP-512 is very close to PP-512(fp16) = 9800 t/s
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
2998ca9b14 Bitnet(2.25 bpw): NEON
We get PP-512 = 192 t/s, TG-128 = 72 t/s
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
8c6276f6a1 Bitnet: 2.25 bpw version
Just scaler and AVX2 for now.
PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on
Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and
the model being 10% larger.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
1de6476d75 bitnet 2 bpw: NEON implementation
We get PP-512 = 190 t/s and TG-128 = 75 t/s.
2 bpw TG on the CPU beats 1.75 bpw on the GPU!
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
f97a329638 Removed extra column 2024-06-22 12:02:52 +03:00
Iwan Kawrakow
6616985135 bitnet 2 bpw: AVX2 implementation
We get PP-512 = 322 t/s.
TG is already 51.6 t/s at 4 threads, then it saturates and
starts going down for more than 8 threads.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
f6863cfa1b bitnet: add 2 bpw quantization
The scalar dot product already chieves 37 t/s for TG!
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
765622ff8f Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
d82e5db6e5 iqk_mul_mat(bitnet): fix typo
With the last change (which added the typo), I'm now getting
PP-512 = 300 t/s on the Ryzen-5975WX.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
ddea72453b iqk_mul_mat(bitnet): slightly faster AVX2
We now get 214 t/s on the Ryzen-7950X
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
30a771bd6b iq1_bn: better NEON implementation
PP is decent with 131 t/s (q4_0 has 150 t/s).
TG is better than last commit but still bad at 33.1 t/s
(in comparison q4_0 gets 52.3 t/s).

I had to go to the (0, 1, 2) table. Apple Silicon clearly
does not like operations with signs.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
8222c9f3d1 iq1_bn(NEON): works now, but very slow
Basically 2X slower tan q4_0.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
2f403d4c93 iq1_bn(Metal): 66.2 -> 67.1 t/s 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
d42e9e2922 iq1_bn(Metal): 64 -> 66.2 t/s for TG
This should be good enough. One cannot ask
Apple Silicon to do too much work.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
9d58489c33 iq1_bn(Metal): 64 -> 66.2 t/s for TG 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
f1d9c42f77 iq1_bn(Metal): 60 -> 64 t/s for TG 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
a35330eb5c iq1_bn: very slightly better Metal dot product 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
d9fb92b710 iq1_bn: Metal now works
PP performance is decent (668 t/s v 724 t/s for q4_0),
but TG is kind of low (60 t/s vs 81 t/s for q4_0).
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
0c5a353ebd iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
bf22b701f4 iqk_mul_mat(iq1_bn): WIP NEON (not working) 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
29d9bf65f3 iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2
I now get PP-512 = 270 t/s on the Ryzen-5975WX
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
91ec824f2d iqk_mul_mat: improve iq1_bn (bitnet) on AVX2
We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
d1c40ff7e2 bitnet: fix scalar dot product
I had forgotten to adjust for the change to q8_K64.
On the M2 I'm getting 10.8 t/s with the scalar version!
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
4fcfcd05d1 bitnet: scale is per row, not per tensor 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
7f8901dca1 iqk_mul_mat: add iq1_bn (bitnet)
We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.
2024-06-22 12:02:51 +03:00
Iwan Kawrakow
0f53bc30bb bitnet: CUDA, scalar, AVX2 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
f20b28558b bitnet: python + llama 2024-06-22 12:02:51 +03:00
Iwan Kawrakow
58756ef03f iqk_mul_mat: cleanup 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
7501184eb4 iqk_mul_mat: be independent of llamafile_sgemm
Verified that it works on AVX2.
Also turned on any combination of f16 and f32
(i.e., added f16 x 16 and f32 x f32).
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
ad53eabf87 iqk_mul_mat: be independent of llamafile_sgemm (WIP)
* Remove iqk_mul_mat from llamafile_sgemm
* Pass tensor types and strides to iqk_mul_mat

It is marked WIP because only tested on __aarch64__
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
3593891f39 Fix nb4 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
9593e163db iqk_mul_mat: add ability to disable it 2024-06-22 12:02:50 +03:00
Iwan Kawrakow
81cf6990f5 iqk_mul_mat: be able to handle any f16/f32 combination on AVX2
But only turning on f16 x f32 and f32 x f16 for now.
2024-06-22 12:02:50 +03:00