Files
ik_llama.cpp/ggml-cuda
Kawrakow 2881bdf220 iq1bn(no lookup): better version
We have 4 groups of 16 in a block of 64 quants.
For each group of 16 we have 3 groups of 5, each using 8 bits.
The remaining 16'th quants of the 4 groups of 16 are encoded
with 8 bits using the same encoding as the groups of 5.
The only kernel where we have complications is the CUDA dequantize
kernel (because we are dequantizing 8 quants there, and we have
different encoding for the 1st and 2nd group of 8 in a group of 16).

Ths achieves better performance on all tested platforms than
any previous 1.625 bpw attempt. We have:

| model            |       size |     params | backend    | threads |          test |              t/s |
| ---------------- | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | CUDA       |       8 |         pp512 |  9613.02 ± 24.54 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | CUDA       |       8 |         tg128 |    229.85 ± 0.33 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |      16 |         pp512 |    322.59 ± 1.00 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |      16 |         tg128 |     59.79 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       8 |         tg128 |     57.62 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       4 |         tg128 |     33.66 ± 0.29 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | AVX2       |       2 |         tg128 |     18.30 ± 0.01 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | Metal      |       8 |         pp512 |    698.13 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | Metal      |       8 |         tg128 |     68.88 ± 0.24 |

| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       8 |         pp512 |    196.80 ± 0.50 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       8 |         tg128 |     51.58 ± 0.41 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       4 |         tg128 |     30.80 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB |     3.32 B | NEON       |       2 |         tg128 |     16.89 ± 0.01 |

It is still slower than 2 bpw Bitnet, but the difference now is not as
dramatic.
2024-07-17 08:54:11 +03:00
..
2024-07-17 08:54:11 +03:00
2024-06-05 16:53:00 +02:00
2024-03-29 17:45:46 +02:00
2024-04-30 12:16:08 +03:00
2024-06-22 12:02:52 +03:00
2024-06-05 11:29:20 +03:00
2024-06-17 00:23:04 +02:00
2024-06-17 00:23:04 +02:00
2024-07-17 08:54:11 +03:00