I had ruined TG performance on AVX2 with the last commit.
Was just testing at 8 threads and there we are totally memory
bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950.
Back to 51 t/s with this commit.
It seems it is enough to have 4 scales per row for Q8.
I get PPL = 8.5470 with this, which is slightly higher than
the 8.5430 we get with 1 scale per 128 activations, but still
OK, I think.
With this, we get the following performance:
Systema | quant | PP-512 | TG-128a | quant | PP-512 | TG-12s |
M2 Max | iq2bn 229.02 ± 0.37 78.75 ± 0.61 | iq1bn | 146.67 ± 2.85 33.12 ± 0.03
Ryzen7950| iq2bn 379.36 ± 1.03 49.08 ± 0.18 | iq1bn | 247.12 ± 1.53 32.80 ± 0.02
Ryzen5975| iq2bn 465.28 ± 0.57 39.17 ± 0.02 | iq1bn | 325.86 ± 0.46 26.60 ± 0.10
On CUDA we do not have access to the tensor data until we
hit the kernel. That's why this hack.
In any case, iq2_bn goes back up to 228 t/s, which is close
to the 234 t/s we have without the extra scale operation.
PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s
we get without making the mul -> scale replacement.
This reverts commit f83381371b61e0863b55c60e5f5df139126a496d.
When using CUDA, the tensor contents have not been loaded yet,
so we crash when trying to access the scale when building the
graph. There must be a better way.
iq2_bn TG-128 drops to 84 t/s, while I see in the logs
that we had 97 t/s. If true, that's a pretty massive
performance penalty for TG. Let me guess: ggml_mul is not
exactly the most performant operation on Metal.
and correspondingly add an extra ggml_mul_mat operation.
As per @ggerganov, this is how things should be done.
It seems to be working, but as far as I can tell this
results in a ~15% performance penalty for prompt processing.
Commiting so I can go and test on othe platforms.
Use 3 bits for the exponent and 5 bits for the mantissa.
This makes PPL to be the same as fp16 (but the previous
version with 4 bits for the exponent and mantissa was
good enough for any practical purposes).
We get PP-512 = 702 t/s, TG-128 = 84 t/s.
This is almost on par with q4_0, which is rare on Metal
(to not say it does not exist).
For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet.
TG is kind of funny because we hit 72 t/s on the CPU.
We get PP-512 = 9600 t/s, TG-128 = 234 t/s
(but we need to use 8 CPU threads, else results are lower,
so clearly there is something being computed on the CPU).
PP-512 is very close to PP-512(fp16) = 9800 t/s
Just scaler and AVX2 for now.
PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on
Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and
the model being 10% larger.
PP is decent with 131 t/s (q4_0 has 150 t/s).
TG is better than last commit but still bad at 33.1 t/s
(in comparison q4_0 gets 52.3 t/s).
I had to go to the (0, 1, 2) table. Apple Silicon clearly
does not like operations with signs.