Commit Graph

3346 Commits

Author SHA1 Message Date
Iwan Kawrakow
db6b0f6dab Update README with the new CUDA/Meat performance 2024-07-26 09:06:22 +02:00
Iwan Kawrakow
fbafe0989f bitnet: put token embeddings on the GPU 2024-07-26 09:50:52 +03:00
Iwan Kawrakow
c2158c15d9 iqk_mul_mat(NEON): adding forgotten fp16 matrix x vector implementation 2024-07-25 08:37:13 +02:00
Kawrakow
28fb349db4 Update README.md 2024-07-24 19:55:06 +02:00
Kawrakow
eb246cd0ae Update README.md
Trying to avoid line breaks in table
2024-07-24 19:44:52 +02:00
Kawrakow
fc07ca7847 Update README.md 2024-07-24 19:20:46 +02:00
Iwan Kawrakow
770f3585c2 Add copyright notices
Only on the files where I have contributed in a significant way,
or the files I wrote myself.
2024-07-24 20:11:42 +03:00
Iwan Kawrakow
9eee03f4ee Remove unused file 2024-07-24 19:33:19 +03:00
Iwan Kawrakow
3d83f58654 Remove security 2024-07-24 19:25:21 +03:00
Iwan Kawrakow
b64275ca4e Correct spelling in README 2024-07-24 19:22:43 +03:00
Kawrakow
4192244242 Update README.md
Adding some more details
2024-07-24 17:38:37 +02:00
Kawrakow
47c1243e3c Update README.md
Adding MoE and Bitnet performance tables
2024-07-24 16:49:00 +02:00
Kawrakow
8fe7e04456 Update README.md
I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.
2024-07-24 11:18:50 +02:00
Kawrakow
a5c39e9476 Update README.md
Added performance comparison tables
2024-07-24 11:01:16 +02:00
Iwan Kawrakow
6b4167164c iqk_mul_mat(NEON): special case for n not divisible by 8
Else fp16 PP performance drops by nearly a factor of 2 compared to
what we had before.
2024-07-24 08:04:47 +02:00
Iwan Kawrakow
2e49f0172f ggml: thread syncronization on Arm
For x86 slaren was genereous enough to add _mm_pause() to the busy
spin wait loop in ggml_barrier(), but everything else just busy
spins, loading an atomic int on every iteration, thus forcing cache
sync between the cores. This results in a massive drop in performance
on my M2-Max laptop when using 8 threads. The closest approximation
to _mm_pause() on Arm seems to be
     __asm__ __volatile__("isb\n");
After adding this to the busy spin loop, performance for 8 threads
recovers back to expected levels.
2024-07-24 08:04:47 +02:00
Iwan Kawrakow
abb740c9a4 Fix "make it work for row sizes that are multiple of 4 on NEON" 2024-07-24 08:04:47 +02:00
Kawrakow
0117e386b3 Update README.md 2024-07-23 18:05:05 +02:00
Kawrakow
11e2472c64 Update README.md 2024-07-23 12:23:06 +02:00
Iwan Kawrakow
99119ec29c When tokenizer info is missing in the model, use llama3 by default 2024-07-19 12:29:01 +03:00
Iwan Kawrakow
30b8bcf1a3 iqk_mul_mat(f16): make it work for row sizes that are multiple of 4 on NEON
Here the performance gain is more modest compared to AVX2: we get
PP-512 = 200 t/s up from 190 t/s for iq1_bn-quantized Bitnet-3B
running on M2 Max.
2024-07-18 13:55:51 +02:00
Iwan Kawrakow
8db01c0804 iqk_mul_mat: attentions matrix multiplications
K*Q and KQ*V are n_kv_embed x n_token x n_head matrix multiplications.
Before this PR, this meant n_head calls to iqk_mul_mat to perform
n_kv_embed x n_token 2D multiplications, each using nth threads.
Instead, in this PR, if n_head is a multiple of nth, each thread
does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices.
This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from
409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B,
we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from
139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.
2024-07-18 14:00:56 +03:00
Iwan Kawrakow
744eb9ffa9 iqk_mul_mat(float): make it work for row sizes that are multiple of 4 on AVX2
I was trying to understand where the Bitnet bottleneck is, and at
some point noticed the Q*K matrixt multiplication where Q and K
have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for
floats rerquiers that the row size is a multiple of the SIMD vector size
(so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this
matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat
float kernel to handle row sizes that are a multiple of 4 (via __m128
for the last values in a row) resulted in nearly a 20% performance boost
for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance
increases by nearly 70%!
2024-07-18 11:39:32 +03:00
Iwan Kawrakow
6a132862fd Fix Makefile, add GGML_USE_IQK_MULMAT ifdefs to iqk-quantize 2024-07-17 16:51:34 +03:00
Iwan Kawrakow
a4017cc047 iq1bn: faster scalar dot product
At the end of the day, lookup is still better when not using simd.
This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X
with 16 threads (up from 10.5 t/s).
2024-07-17 16:09:01 +03:00
Iwan Kawrakow
a0df4002fc iq1bn: fix scalar dot product
The fix makes it faster on the Ryzen-7950X (10.5 t/s vs 8.2 t/s)
but slower on the M2 (6.8 t/s vs 8.6 t/s before).
2024-07-17 13:37:18 +03:00
Iwan Kawrakow
7024ecfeb4 iq1bn: faster AVX2
Instead of shuffling quant data into a 128-bit register containing
8-bit ints, and then converting to 16 bit, we directly shuffle into
a 256-bit register containing 16 bit ints.

TG-128 @ 2 threads goes from 18.3 to 21.6 t/s.
TG-128 performance now saturates already at 8 threads getting 60.4 t/s.
There is almost no impact on PP-512 (322 -> 323 t/s). I guess,
we amortize dequantization cost pretty well, so we don't gain much
there.

We get close to 100 GB/s single-threaded float32 throuput:

./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn
iq1_bn
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      3.87
      avg cycles/32 vals   :      4.40
      float32 throughput   :     98.27 GB/s
      quantized throughput :      4.99 GB/s
2024-07-17 10:17:05 +03:00
Iwan Kawrakow
febb8bbea0 Remove the no longer used iq1bn_grid_u16 2024-07-17 10:16:50 +03:00
Iwan Kawrakow
ba00f23ea1 iq1bn: adjust scalar dot product and some cleanup 2024-07-17 08:44:46 +02:00
Iwan Kawrakow
873a790ee2 iq1bn(no lookup): better version
We have 4 groups of 16 in a block of 64 quants.
For each group of 16 we have 3 groups of 5, each using 8 bits.
The remaining 16'th quants of the 4 groups of 16 are encoded
with 8 bits using the same encoding as the groups of 5.
The only kernel where we have complications is the CUDA dequantize
kernel (because we are dequantizing 8 quants there, and we have
different encoding for the 1st and 2nd group of 8 in a group of 16).

Ths achieves better performance on all tested platforms than
any previous 1.625 bpw attempt. We have:

| model            |       size |     params | backend    | threads |          test |              t/s |
| ---------------- | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | CUDA       |       8 |         pp512 |  9613.02 ± 24.54 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | CUDA       |       8 |         tg128 |    229.85 ± 0.33 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |      16 |         pp512 |    322.59 ± 1.00 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |      16 |         tg128 |     59.79 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       8 |         tg128 |     57.62 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       4 |         tg128 |     33.66 ± 0.29 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       2 |         tg128 |     18.30 ± 0.01 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | Metal      |       8 |         pp512 |    698.13 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | Metal      |       8 |         tg128 |     68.88 ± 0.24 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       8 |         pp512 |    196.80 ± 0.50 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       8 |         tg128 |     51.58 ± 0.41 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       4 |         tg128 |     30.80 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       2 |         tg128 |     16.89 ± 0.01 |

It is still slower than 2 bpw Bitnet, but the difference now is not as
dramatic.
2024-07-17 08:54:11 +03:00
Iwan Kawrakow
52a25e307c iq1bn(no lookup): Metal
In summary, compared to lookup, the multiplication based approach is
* Much better on AVX2
* Slightly better on CUDA
* Slightly worse on Metal
* Much worse on NEON
2024-07-16 09:12:15 +02:00
Iwan Kawrakow
6393e26827 iq1bn(no lookup): NEON attempts
We are at TG-128 = 25.7 t/s, which is quite a bit worse than
lookup.
2024-07-16 08:32:15 +02:00
Iwan Kawrakow
26a1a689c6 iq1bn(no lookup): NEON
Pretty bad.
2024-07-15 20:40:14 +02:00
Iwan Kawrakow
ef39ca6a2c iq1bn(no lookup): CUDA
Not good. We only get ~160 t/s.
2024-07-15 19:56:51 +03:00
Iwan Kawrakow
e4dc3babb5 iq1bn(no lookup): somewhat better
We now have for Bitnet-3B:
| threads |          test |              t/s |
| ------: | ------------: | ---------------: |
|      16 |         pp512 |    308.97 ± 1.89 |
|      16 |         tg128 |     58.80 ± 0.07 |
|       8 |         tg128 |     49.79 ± 1.23 |
|       4 |         tg128 |     28.85 ± 0.02 |
|       2 |         tg128 |     15.39 ± 0.01 |
2024-07-15 13:46:07 +03:00
Iwan Kawrakow
a4bbd36905 iq1bn: attempt without a lookup table 2024-07-15 11:02:41 +03:00
Iwan Kawrakow
01397535b3 Remove all workflows 2024-06-27 09:45:56 +03:00
Iwan Kawrakow
0a3a2c4cd4 imatrix: be able to specify the name of the output tensor
For some models the same tensor is used for token embeddings and
output. This tensor tends to be named token_embedding.weight rather
than output.weight, which prevernts us from collecting imatrix data
for this tensor. With this commit we can tell the name of the
output tensor to the imatrix tool.
2024-06-26 17:38:18 +03:00
Iwan Kawrakow
71725a918f bitnet: fold V scale into rms_norm 2024-06-26 12:05:57 +02:00
Iwan Kawrakow
641dd6bc68 RoPE(Neox, Metal): don't use power functions in a loop
Speeds up Bitnet by ~2% on Metal.
2024-06-26 11:22:47 +02:00
Iwan Kawrakow
767bce7caf Typo 2024-06-25 19:17:14 +03:00
Iwan Kawrakow
753dbaeeb0 bitnet: remove iq1_bn lookup table storing +/- signs
The AVX2 implementation was the only one left using it, so
I decided to see if we can get a performant implementation
using the 0,1,2 lookup table. Turns out we can, and it is
even slightly faster than the sign based table. We now
get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads
on the Ryzen-7950X.

With only one lookup table left for iq1_bn, I renamed it to
iq1bn_grid_u16.
2024-06-25 18:19:11 +03:00
Iwan Kawrakow
8b436a84c5 bitnet: simdify q8_K64 quantization on AVX
Doesn't make a real difference in performance.
2024-06-25 17:20:34 +03:00
Iwan Kawrakow
c906c4c4fe bitnet: NEON improvements for iq1_bn
With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.
2024-06-25 13:48:29 +02:00
Iwan Kawrakow
49bacf2288 bitnet: remove the now unused iq1bn_grid_u16 2024-06-25 12:41:43 +02:00
Iwan Kawrakow
7de9559cf2 Bitnet: adapt NEON and Metal to the alternative grid 2024-06-25 11:16:13 +02:00
Iwan Kawrakow
aa14a06b44 Bitnet: trying an alternative iq1_bn grid
Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.
2024-06-25 11:32:48 +03:00
Iwan Kawrakow
cc44d4a5c3 bitnet: fix scalar dot product for 1.625 bpw
I had not adjusted after going to 4 q8 scales per row.
2024-06-25 08:31:12 +02:00
Iwan Kawrakow
3d61866f0a Bitnet: slightly faster 1.625 bpw variant for AVX512VL 2024-06-25 08:33:00 +03:00
Iwan Kawrakow
707d087927 Bitnet: tiny bity faster 1.625 bpw variant on Metal
We get 70.7 t/s for TG-128 vs 69.5 t/s before.
2024-06-24 16:42:30 +02:00