Commit Graph

3330 Commits

Author SHA1 Message Date
Kawrakow
005674cecc Fix "make it work for row sizes that are multiple of 4 on NEON" 2024-07-24 08:04:47 +02:00
Kawrakow
847588cc92 Update README.md 2024-07-23 18:05:05 +02:00
Kawrakow
97680f602c Update README.md 2024-07-23 12:23:06 +02:00
Kawrakow
8bf126c1d6 When tokenizer info is missing in the model, use llama3 by default 2024-07-19 12:29:01 +03:00
Kawrakow
6a94ca46ad iqk_mul_mat(f16): make it work for row sizes that are multiple of 4 on NEON
Here the performance gain is more modest compared to AVX2: we get
PP-512 = 200 t/s up from 190 t/s for iq1_bn-quantized Bitnet-3B
running on M2 Max.
2024-07-18 13:55:51 +02:00
Kawrakow
4d1e83f8b8 iqk_mul_mat: attentions matrix multiplications
K*Q and KQ*V are n_kv_embed x n_token x n_head matrix multiplications.
Before this PR, this meant n_head calls to iqk_mul_mat to perform
n_kv_embed x n_token 2D multiplications, each using nth threads.
Instead, in this PR, if n_head is a multiple of nth, each thread
does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices.
This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from
409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B,
we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from
139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.
2024-07-18 14:00:56 +03:00
Kawrakow
c14a6a6862 iqk_mul_mat(float): make it work for row sizes that are multiple of 4 on AVX2
I was trying to understand where the Bitnet bottleneck is, and at
some point noticed the Q*K matrixt multiplication where Q and K
have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for
floats rerquiers that the row size is a multiple of the SIMD vector size
(so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this
matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat
float kernel to handle row sizes that are a multiple of 4 (via __m128
for the last values in a row) resulted in nearly a 20% performance boost
for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance
increases by nearly 70%!
2024-07-18 11:39:32 +03:00
Kawrakow
d556b1d809 Fix Makefile, add GGML_USE_IQK_MULMAT ifdefs to iqk-quantize 2024-07-17 16:51:34 +03:00
Kawrakow
6f0805a3c7 iq1bn: faster scalar dot product
At the end of the day, lookup is still better when not using simd.
This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X
with 16 threads (up from 10.5 t/s).
2024-07-17 16:09:01 +03:00
Kawrakow
02dc036187 iq1bn: fix scalar dot product
The fix makes it faster on the Ryzen-7950X (10.5 t/s vs 8.2 t/s)
but slower on the M2 (6.8 t/s vs 8.6 t/s before).
2024-07-17 13:37:18 +03:00
Kawrakow
04decf3fc5 iq1bn: faster AVX2
Instead of shuffling quant data into a 128-bit register containing
8-bit ints, and then converting to 16 bit, we directly shuffle into
a 256-bit register containing 16 bit ints.

TG-128 @ 2 threads goes from 18.3 to 21.6 t/s.
TG-128 performance now saturates already at 8 threads getting 60.4 t/s.
There is almost no impact on PP-512 (322 -> 323 t/s). I guess,
we amortize dequantization cost pretty well, so we don't gain much
there.

We get close to 100 GB/s single-threaded float32 throuput:

./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn
iq1_bn
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      3.87
      avg cycles/32 vals   :      4.40
      float32 throughput   :     98.27 GB/s
      quantized throughput :      4.99 GB/s
2024-07-17 10:17:05 +03:00
Kawrakow
2d4fee2312 Remove the no longer used iq1bn_grid_u16 2024-07-17 10:16:50 +03:00
Kawrakow
0194639b6b iq1bn: adjust scalar dot product and some cleanup 2024-07-17 08:44:46 +02:00
Kawrakow
2881bdf220 iq1bn(no lookup): better version
We have 4 groups of 16 in a block of 64 quants.
For each group of 16 we have 3 groups of 5, each using 8 bits.
The remaining 16'th quants of the 4 groups of 16 are encoded
with 8 bits using the same encoding as the groups of 5.
The only kernel where we have complications is the CUDA dequantize
kernel (because we are dequantizing 8 quants there, and we have
different encoding for the 1st and 2nd group of 8 in a group of 16).

Ths achieves better performance on all tested platforms than
any previous 1.625 bpw attempt. We have:

| model            |       size |     params | backend    | threads |          test |              t/s |
| ---------------- | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | CUDA       |       8 |         pp512 |  9613.02 ± 24.54 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | CUDA       |       8 |         tg128 |    229.85 ± 0.33 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |      16 |         pp512 |    322.59 ± 1.00 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |      16 |         tg128 |     59.79 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       8 |         tg128 |     57.62 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       4 |         tg128 |     33.66 ± 0.29 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       2 |         tg128 |     18.30 ± 0.01 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | Metal      |       8 |         pp512 |    698.13 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | Metal      |       8 |         tg128 |     68.88 ± 0.24 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       8 |         pp512 |    196.80 ± 0.50 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       8 |         tg128 |     51.58 ± 0.41 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       4 |         tg128 |     30.80 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       2 |         tg128 |     16.89 ± 0.01 |

It is still slower than 2 bpw Bitnet, but the difference now is not as
dramatic.
2024-07-17 08:54:11 +03:00
Kawrakow
d84748b71b iq1bn(no lookup): Metal
In summary, compared to lookup, the multiplication based approach is
* Much better on AVX2
* Slightly better on CUDA
* Slightly worse on Metal
* Much worse on NEON
2024-07-16 09:12:15 +02:00
Kawrakow
d0f9d146b8 iq1bn(no lookup): NEON attempts
We are at TG-128 = 25.7 t/s, which is quite a bit worse than
lookup.
2024-07-16 08:32:15 +02:00
Kawrakow
597ea12970 iq1bn(no lookup): NEON
Pretty bad.
2024-07-15 20:40:14 +02:00
Kawrakow
cd8fffc3cd iq1bn(no lookup): CUDA
Not good. We only get ~160 t/s.
2024-07-15 19:56:51 +03:00
Kawrakow
1f3dbbcc19 iq1bn(no lookup): somewhat better
We now have for Bitnet-3B:
| threads |          test |              t/s |
| ------: | ------------: | ---------------: |
|      16 |         pp512 |    308.97 ± 1.89 |
|      16 |         tg128 |     58.80 ± 0.07 |
|       8 |         tg128 |     49.79 ± 1.23 |
|       4 |         tg128 |     28.85 ± 0.02 |
|       2 |         tg128 |     15.39 ± 0.01 |
2024-07-15 13:46:07 +03:00
Kawrakow
98be184c23 iq1bn: attempt without a lookup table 2024-07-15 11:02:41 +03:00
Kawrakow
43f4c58376 Remove all workflows 2024-06-27 09:45:56 +03:00
Kawrakow
aaec3c1f60 imatrix: be able to specify the name of the output tensor
For some models the same tensor is used for token embeddings and
output. This tensor tends to be named token_embedding.weight rather
than output.weight, which prevernts us from collecting imatrix data
for this tensor. With this commit we can tell the name of the
output tensor to the imatrix tool.
2024-06-26 17:38:18 +03:00
Kawrakow
be36ca872f bitnet: fold V scale into rms_norm 2024-06-26 12:05:57 +02:00
Kawrakow
6467358fd4 RoPE(Neox, Metal): don't use power functions in a loop
Speeds up Bitnet by ~2% on Metal.
2024-06-26 11:22:47 +02:00
Kawrakow
d280bf30c4 Typo 2024-06-25 19:17:14 +03:00
Kawrakow
9918542658 bitnet: remove iq1_bn lookup table storing +/- signs
The AVX2 implementation was the only one left using it, so
I decided to see if we can get a performant implementation
using the 0,1,2 lookup table. Turns out we can, and it is
even slightly faster than the sign based table. We now
get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads
on the Ryzen-7950X.

With only one lookup table left for iq1_bn, I renamed it to
iq1bn_grid_u16.
2024-06-25 18:19:11 +03:00
Kawrakow
12e97f1f1f bitnet: simdify q8_K64 quantization on AVX
Doesn't make a real difference in performance.
2024-06-25 17:20:34 +03:00
Kawrakow
cb12b6f253 bitnet: NEON improvements for iq1_bn
With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.
2024-06-25 13:48:29 +02:00
Kawrakow
636dbd03c5 bitnet: remove the now unused iq1bn_grid_u16 2024-06-25 12:41:43 +02:00
Kawrakow
cd2f60c89a Bitnet: adapt NEON and Metal to the alternative grid 2024-06-25 11:16:13 +02:00
Kawrakow
ef16135920 Bitnet: trying an alternative iq1_bn grid
Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.
2024-06-25 11:32:48 +03:00
Kawrakow
90a6071a93 bitnet: fix scalar dot product for 1.625 bpw
I had not adjusted after going to 4 q8 scales per row.
2024-06-25 08:31:12 +02:00
Kawrakow
ee6565fdeb Bitnet: slightly faster 1.625 bpw variant for AVX512VL 2024-06-25 08:33:00 +03:00
Kawrakow
8542b4f359 Bitnet: tiny bity faster 1.625 bpw variant on Metal
We get 70.7 t/s for TG-128 vs 69.5 t/s before.
2024-06-24 16:42:30 +02:00
Kawrakow
f2a82090df Adding add_4, mul_4, div_4 kernels to Metal
This gives ~2% speedup for Bitnet on Metal
2024-06-24 10:22:10 +02:00
Kawrakow
c9ddaf2fa3 bitnet: qnfs tests
Q8_0 fails because as per design the reference quantization
is different from the vecdot quantization.
2024-06-22 12:02:53 +03:00
Kawrakow
b1fb7df6a5 bitnet: replace ggml_mul with ggml_scale to apply the scales
Also save one scale operation in the ffn network by adjusting
rms_eps. We gain up to 3% in performance by doing this, but it
is a bit of a hack (we store the tensor scales in op_params
while loading the model).
2024-06-22 12:02:52 +03:00
Kawrakow
0fe0d54be6 iqk_mul_mat: add IQ4_NL also on NEON
PPL seems somewhat higher? For llama-v2-7B iwe are still
~0.04 higher compared to hat we expect after ~30 batches.
2024-06-22 12:02:52 +03:00
Kawrakow
32ec107237 iqk_mul_mat: add IQ4_NL
I never use it, so I had completely forgotten about it.
2024-06-22 12:02:52 +03:00
Kawrakow
912d6d9ce1 bitnet(scale in a separate tensor): CPU tweaks
A somewhat nicer iq2_bn implementation on AVX2.
2024-06-22 12:02:52 +03:00
Kawrakow
f53d89dd53 bitnet(scale in a separate tensor): CPU tweaks
I had ruined TG performance on AVX2 with the last commit.
Was just testing at 8 threads and there we are totally memory
bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950.
Back to 51 t/s with this commit.
2024-06-22 12:02:52 +03:00
Kawrakow
52ad5764dd bitnet(scale in a separate tensor): more CPU improvements
It seems it is enough to have 4 scales per row for Q8.
I get PPL = 8.5470 with this, which is slightly higher than
the 8.5430 we get with 1 scale per 128 activations, but still
OK, I think.
With this, we get the following performance:

Systema  | quant  |  PP-512     |  TG-128a     | quant |    PP-512    |   TG-12s   |
M2 Max   | iq2bn  229.02 ± 0.37  78.75 ± 0.61  | iq1bn | 146.67 ± 2.85  33.12 ± 0.03
Ryzen7950| iq2bn  379.36 ± 1.03  49.08 ± 0.18  | iq1bn | 247.12 ± 1.53  32.80 ± 0.02
Ryzen5975| iq2bn  465.28 ± 0.57  39.17 ± 0.02  | iq1bn | 325.86 ± 0.46  26.60 ± 0.10
2024-06-22 12:02:52 +03:00
Kawrakow
167489ef6c bitnet(scale in a separate tensor): CPU improvements
Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat
to deal with that. This improves PP speef by a few percent.
2024-06-22 12:02:52 +03:00
Kawrakow
8b31c14e0d bitnet(scale in a separate tensor): mul -> scale on the CPU 2024-06-22 12:02:52 +03:00
Kawrakow
e423af855f bitnet(scale in a separate tensor): mul -> scale on CUDA
On CUDA we do not have access to the tensor data until we
hit the kernel. That's why this hack.
In any case, iq2_bn goes back up to 228 t/s, which is close
to the 234 t/s we have without the extra scale operation.
PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s
we get without making the mul -> scale replacement.
2024-06-22 12:02:52 +03:00
Kawrakow
f72db4769b bitnet(scale in a separate tensor): mul -> scale on Metal
Do the mul -> scale replacement on the fly in the Metal backend.
This recovers the PP performace and cuts the TG performance
degradation in half.
2024-06-22 12:02:52 +03:00
Kawrakow
30fc9b5753 Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale"
This reverts commit f83381371b61e0863b55c60e5f5df139126a496d.
When using CUDA, the tensor contents have not been loaded yet,
so we crash when trying to access the scale when building the
graph. There must be a better way.
2024-06-22 12:02:52 +03:00
Kawrakow
f024804b9a bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale
This recovers part of the performance loss. On Metal TG-128 is now
92 t/s, still short of the ~100 t/s with scales applied on the fly.
2024-06-22 12:02:52 +03:00
Kawrakow
3c5cd34a05 bitnet(scale in a separate tensor): Metal
iq2_bn TG-128 drops to 84 t/s, while I see in the logs
that we had 97 t/s. If true, that's a pretty massive
performance penalty for TG. Let me guess: ggml_mul is not
exactly the most performant operation on Metal.
2024-06-22 12:02:52 +03:00
Kawrakow
14081ee2ef bitnet(scale in a separate tensor): CUDA 2024-06-22 12:02:52 +03:00