Commit Graph

3318 Commits

Author SHA1 Message Date
Iwan Kawrakow
ba00f23ea1 iq1bn: adjust scalar dot product and some cleanup 2024-07-17 08:44:46 +02:00
Iwan Kawrakow
873a790ee2 iq1bn(no lookup): better version
We have 4 groups of 16 in a block of 64 quants.
For each group of 16 we have 3 groups of 5, each using 8 bits.
The remaining 16'th quants of the 4 groups of 16 are encoded
with 8 bits using the same encoding as the groups of 5.
The only kernel where we have complications is the CUDA dequantize
kernel (because we are dequantizing 8 quants there, and we have
different encoding for the 1st and 2nd group of 8 in a group of 16).

Ths achieves better performance on all tested platforms than
any previous 1.625 bpw attempt. We have:

| model            |       size |     params | backend    | threads |          test |              t/s |
| ---------------- | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | CUDA       |       8 |         pp512 |  9613.02 ± 24.54 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | CUDA       |       8 |         tg128 |    229.85 ± 0.33 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |      16 |         pp512 |    322.59 ± 1.00 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |      16 |         tg128 |     59.79 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       8 |         tg128 |     57.62 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       4 |         tg128 |     33.66 ± 0.29 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       2 |         tg128 |     18.30 ± 0.01 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | Metal      |       8 |         pp512 |    698.13 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | Metal      |       8 |         tg128 |     68.88 ± 0.24 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       8 |         pp512 |    196.80 ± 0.50 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       8 |         tg128 |     51.58 ± 0.41 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       4 |         tg128 |     30.80 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       2 |         tg128 |     16.89 ± 0.01 |

It is still slower than 2 bpw Bitnet, but the difference now is not as
dramatic.
2024-07-17 08:54:11 +03:00
Iwan Kawrakow
52a25e307c iq1bn(no lookup): Metal
In summary, compared to lookup, the multiplication based approach is
* Much better on AVX2
* Slightly better on CUDA
* Slightly worse on Metal
* Much worse on NEON
2024-07-16 09:12:15 +02:00
Iwan Kawrakow
6393e26827 iq1bn(no lookup): NEON attempts
We are at TG-128 = 25.7 t/s, which is quite a bit worse than
lookup.
2024-07-16 08:32:15 +02:00
Iwan Kawrakow
26a1a689c6 iq1bn(no lookup): NEON
Pretty bad.
2024-07-15 20:40:14 +02:00
Iwan Kawrakow
ef39ca6a2c iq1bn(no lookup): CUDA
Not good. We only get ~160 t/s.
2024-07-15 19:56:51 +03:00
Iwan Kawrakow
e4dc3babb5 iq1bn(no lookup): somewhat better
We now have for Bitnet-3B:
| threads |          test |              t/s |
| ------: | ------------: | ---------------: |
|      16 |         pp512 |    308.97 ± 1.89 |
|      16 |         tg128 |     58.80 ± 0.07 |
|       8 |         tg128 |     49.79 ± 1.23 |
|       4 |         tg128 |     28.85 ± 0.02 |
|       2 |         tg128 |     15.39 ± 0.01 |
2024-07-15 13:46:07 +03:00
Iwan Kawrakow
a4bbd36905 iq1bn: attempt without a lookup table 2024-07-15 11:02:41 +03:00
Iwan Kawrakow
01397535b3 Remove all workflows 2024-06-27 09:45:56 +03:00
Iwan Kawrakow
0a3a2c4cd4 imatrix: be able to specify the name of the output tensor
For some models the same tensor is used for token embeddings and
output. This tensor tends to be named token_embedding.weight rather
than output.weight, which prevernts us from collecting imatrix data
for this tensor. With this commit we can tell the name of the
output tensor to the imatrix tool.
2024-06-26 17:38:18 +03:00
Iwan Kawrakow
71725a918f bitnet: fold V scale into rms_norm 2024-06-26 12:05:57 +02:00
Iwan Kawrakow
641dd6bc68 RoPE(Neox, Metal): don't use power functions in a loop
Speeds up Bitnet by ~2% on Metal.
2024-06-26 11:22:47 +02:00
Iwan Kawrakow
767bce7caf Typo 2024-06-25 19:17:14 +03:00
Iwan Kawrakow
753dbaeeb0 bitnet: remove iq1_bn lookup table storing +/- signs
The AVX2 implementation was the only one left using it, so
I decided to see if we can get a performant implementation
using the 0,1,2 lookup table. Turns out we can, and it is
even slightly faster than the sign based table. We now
get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads
on the Ryzen-7950X.

With only one lookup table left for iq1_bn, I renamed it to
iq1bn_grid_u16.
2024-06-25 18:19:11 +03:00
Iwan Kawrakow
8b436a84c5 bitnet: simdify q8_K64 quantization on AVX
Doesn't make a real difference in performance.
2024-06-25 17:20:34 +03:00
Iwan Kawrakow
c906c4c4fe bitnet: NEON improvements for iq1_bn
With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.
2024-06-25 13:48:29 +02:00
Iwan Kawrakow
49bacf2288 bitnet: remove the now unused iq1bn_grid_u16 2024-06-25 12:41:43 +02:00
Iwan Kawrakow
7de9559cf2 Bitnet: adapt NEON and Metal to the alternative grid 2024-06-25 11:16:13 +02:00
Iwan Kawrakow
aa14a06b44 Bitnet: trying an alternative iq1_bn grid
Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.
2024-06-25 11:32:48 +03:00
Iwan Kawrakow
cc44d4a5c3 bitnet: fix scalar dot product for 1.625 bpw
I had not adjusted after going to 4 q8 scales per row.
2024-06-25 08:31:12 +02:00
Iwan Kawrakow
3d61866f0a Bitnet: slightly faster 1.625 bpw variant for AVX512VL 2024-06-25 08:33:00 +03:00
Iwan Kawrakow
707d087927 Bitnet: tiny bity faster 1.625 bpw variant on Metal
We get 70.7 t/s for TG-128 vs 69.5 t/s before.
2024-06-24 16:42:30 +02:00
Iwan Kawrakow
49822f84a9 Adding add_4, mul_4, div_4 kernels to Metal
This gives ~2% speedup for Bitnet on Metal
2024-06-24 10:22:10 +02:00
Iwan Kawrakow
b747093582 bitnet: qnfs tests
Q8_0 fails because as per design the reference quantization
is different from the vecdot quantization.
2024-06-22 12:02:53 +03:00
Iwan Kawrakow
8c936e3d65 bitnet: replace ggml_mul with ggml_scale to apply the scales
Also save one scale operation in the ffn network by adjusting
rms_eps. We gain up to 3% in performance by doing this, but it
is a bit of a hack (we store the tensor scales in op_params
while loading the model).
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
fc04994ebf iqk_mul_mat: add IQ4_NL also on NEON
PPL seems somewhat higher? For llama-v2-7B iwe are still
~0.04 higher compared to hat we expect after ~30 batches.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
caa42ccc56 iqk_mul_mat: add IQ4_NL
I never use it, so I had completely forgotten about it.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
86dc8e5f8b bitnet(scale in a separate tensor): CPU tweaks
A somewhat nicer iq2_bn implementation on AVX2.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
729ba46f77 bitnet(scale in a separate tensor): CPU tweaks
I had ruined TG performance on AVX2 with the last commit.
Was just testing at 8 threads and there we are totally memory
bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950.
Back to 51 t/s with this commit.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
f0325c5826 bitnet(scale in a separate tensor): more CPU improvements
It seems it is enough to have 4 scales per row for Q8.
I get PPL = 8.5470 with this, which is slightly higher than
the 8.5430 we get with 1 scale per 128 activations, but still
OK, I think.
With this, we get the following performance:

Systema  | quant  |  PP-512     |  TG-128a     | quant |    PP-512    |   TG-12s   |
M2 Max   | iq2bn  229.02 ± 0.37  78.75 ± 0.61  | iq1bn | 146.67 ± 2.85  33.12 ± 0.03
Ryzen7950| iq2bn  379.36 ± 1.03  49.08 ± 0.18  | iq1bn | 247.12 ± 1.53  32.80 ± 0.02
Ryzen5975| iq2bn  465.28 ± 0.57  39.17 ± 0.02  | iq1bn | 325.86 ± 0.46  26.60 ± 0.10
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
e05cca9ef6 bitnet(scale in a separate tensor): CPU improvements
Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat
to deal with that. This improves PP speef by a few percent.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
36374ab37d bitnet(scale in a separate tensor): mul -> scale on the CPU 2024-06-22 12:02:52 +03:00
Iwan Kawrakow
e73ae1f6d3 bitnet(scale in a separate tensor): mul -> scale on CUDA
On CUDA we do not have access to the tensor data until we
hit the kernel. That's why this hack.
In any case, iq2_bn goes back up to 228 t/s, which is close
to the 234 t/s we have without the extra scale operation.
PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s
we get without making the mul -> scale replacement.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
7f968d51b4 bitnet(scale in a separate tensor): mul -> scale on Metal
Do the mul -> scale replacement on the fly in the Metal backend.
This recovers the PP performace and cuts the TG performance
degradation in half.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
d08ff0df43 Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale"
This reverts commit f83381371b61e0863b55c60e5f5df139126a496d.
When using CUDA, the tensor contents have not been loaded yet,
so we crash when trying to access the scale when building the
graph. There must be a better way.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
ad60fb3567 bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale
This recovers part of the performance loss. On Metal TG-128 is now
92 t/s, still short of the ~100 t/s with scales applied on the fly.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
257fa74014 bitnet(scale in a separate tensor): Metal
iq2_bn TG-128 drops to 84 t/s, while I see in the logs
that we had 97 t/s. If true, that's a pretty massive
performance penalty for TG. Let me guess: ggml_mul is not
exactly the most performant operation on Metal.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
a2e43b83c9 bitnet(scale in a separate tensor): CUDA 2024-06-22 12:02:52 +03:00
Iwan Kawrakow
58d9e8f1d2 bitnet: put the scale in a separate tensor
and correspondingly add an extra ggml_mul_mat operation.
As per @ggerganov, this is how things should be done.
It seems to be working, but as far as I can tell this
results in a ~15% performance penalty for prompt processing.
Commiting so I can go and test on othe platforms.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
927e251a12 Bitnet(1.75 bpw): higher precision fp8 scale
Use 3 bits for the exponent and 5 bits for the mantissa.
This makes PPL to be the same as fp16 (but the previous
version with 4 bits for the exponent and mantissa was
good enough for any practical purposes).
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
181fd9c56e Bitnet(1.75 bpw): slightly faster CUDA dot product
We get 205 t/s, so ~13% slower than 2 bit.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
fece7e1db7 Bitnet(2.25 bpw): faster Metal dot product
With this we get TG-128 = 97 t/s.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
4f51348d3d Bitnet(2.25 bpw): Metal
We get PP-512 = 702 t/s, TG-128 = 84 t/s.
This is almost on par with q4_0, which is rare on Metal
(to not say it does not exist).
For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet.
TG is kind of funny because we hit 72 t/s on the CPU.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
01ea9a862d Bitnet(2.25 bpw): CUDA
We get PP-512 = 9600 t/s, TG-128 = 234 t/s
(but we need to use 8 CPU threads, else results are lower,
so clearly there is something being computed on the CPU).
PP-512 is very close to PP-512(fp16) = 9800 t/s
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
2998ca9b14 Bitnet(2.25 bpw): NEON
We get PP-512 = 192 t/s, TG-128 = 72 t/s
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
8c6276f6a1 Bitnet: 2.25 bpw version
Just scaler and AVX2 for now.
PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on
Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and
the model being 10% larger.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
1de6476d75 bitnet 2 bpw: NEON implementation
We get PP-512 = 190 t/s and TG-128 = 75 t/s.
2 bpw TG on the CPU beats 1.75 bpw on the GPU!
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
f97a329638 Removed extra column 2024-06-22 12:02:52 +03:00
Iwan Kawrakow
6616985135 bitnet 2 bpw: AVX2 implementation
We get PP-512 = 322 t/s.
TG is already 51.6 t/s at 4 threads, then it saturates and
starts going down for more than 8 threads.
2024-06-22 12:02:52 +03:00
Iwan Kawrakow
f6863cfa1b bitnet: add 2 bpw quantization
The scalar dot product already chieves 37 t/s for TG!
2024-06-22 12:02:51 +03:00