Commit Graph

3453 Commits

Author SHA1 Message Date
Iwan Kawrakow
a553eb191a Make the entire project c++17 2024-10-04 14:23:21 +03:00
Iwan Kawrakow
84ed711eec Slightly better 2024-10-04 14:18:44 +03:00
Kawrakow
0bf4d99774 Do not quantize activations if not necessary (#79)
* Do not quantize activations if not necessary

* Do not quantize activations if not necessary also for MoE models

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04 11:22:57 +03:00
Kawrakow
ba392802ef q6_0: Slightly faster Zen4/AVX2 (#78)
* Faster q6_0 on AVX2

PP-512 goes up by 3.4%.

* q6_0: this is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 18:09:47 +03:00
Kawrakow
50b5e90112 Fused unary(x)*y (#70)
* Adding fused y*unary(x) op

* Fused y*unary(x) op: CUDA

* Fused y*unary(x) op: dedicated CPU implementation for silu and gelu

* Fused y*unary(x) op: Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 17:05:56 +03:00
Kawrakow
cce49832c1 Adding Q6_0 (#77)
* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 15:22:13 +03:00
Kawrakow
d6909ed6f0 iq4_nl: faster quantization (#76)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 08:17:00 +03:00
Kawrakow
0999f77e5b Fix Q5_0 flash attention (#75)
When I changed iqk_mul_mat to use type-1 dot products for type-0
legacy quants, I forgot to also change the vec_dot_type when
the dot product is done via ggml as in flash attention.
This commit fixes it.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 15:52:35 +03:00
Iwan Kawrakow
970df4b467 Fix last commit
Did not re-check on AVX2/Zen4 after NEON related changes and,
sure enough, I broke AVX2/Zen4.
2024-10-01 14:48:44 +03:00
Kawrakow
e7f5a86a41 IQ4_NL kv-cache on the CPU (Zen4/AVX2/ARM_NEON) (#74)
* Be able to use IQ4_NL for KV cache on AVX2/Zen4

* Be able to use IQ4_NL for KV cache on ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 14:46:40 +03:00
Kawrakow
8457a26f83 CUDA: faster float -> iq4_nl conversion (#73)
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 12:28:29 +03:00
Kawrakow
c2ff4f936a iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 (#72)
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Fix AVX2

In addition to fixing iq4_nl, it seems I never adhusted the AVX2
implementation for iq2_tn to the block scale removal?
This commit also fixes that.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 10:56:50 +03:00
Kawrakow
8cba4789da iqk_mul_mat: better srategy when nrc_y not divisible by ny (#71)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 08:57:34 +03:00
Kawrakow
fd20638bbc Allow bf16 kv-cache (#69)
On the CPU I get the exact same PPL with and without FA
using bf16 for kv-cache. But on CUDA the bf16 kv-cache
result is about the same as the fp16 kv-cache CPU result,
so I'm missing some conversion somewhere.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-29 09:03:52 +03:00
Kawrakow
1b789c983a Time to fix replace_all (#68)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28 17:59:47 +03:00
Kawrakow
7abcc6cc0b CUDA non-contiguous RoPE (#66)
In this way we can avoid the Q, K, V copies being made
after multiplication with the QKV tensor in, e.g., Phi-3.5-mini.
This results in a 6-7% speedup of PP-512(Phi-3.5-mini)
on CUDA (RTX-4080)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28 17:41:21 +03:00
Kawrakow
737514fd81 Adding SWIGLU unary op (#65)
* Adding GGML_UNARY_OP_SWIGLU

This commit implements the ggml op and CPU compute
forward. I see ~3-4% speedup of PP-512 for Phi-3.5-mini.

* GGML_UNARY_OP_SWIGLU: CUDA implementation

I observe ~12% speedup for PP-512(Phi-3.5-mini).

* GGML_UNARY_OP_SWIGLU: Metal implementation

We get ~2% speedup for PP-512(Phi-3.5-mini).

* GGML_UNARY_OP_SWIGLU: minor improvement on Metal

* GGML_UNARY_OP_SWIGLU: cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28 13:37:25 +03:00
Kawrakow
1f61e91862 Better sub-3-bit quantization mixes with a qkv tensor (#64)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28 08:17:19 +03:00
Kawrakow
6dec4af4b6 Adding ability to have meta data per tensor row (#61)
* POC: per row scale

This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.

* POC per row scale: iq2_tn on NEON

* POC per row scale: iq2_tn on Metal

* Per row scale Metal templates

* iq1_tn: shrink to 1.625 bpw (NEON and Metal)

* POC per row scale: CUDA

* POC per row scale: add CUDA TODOs

There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.

* Per row scales - CUDA

The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.

* Update IQ1_TN and IQ2_TN bpw shown to user

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27 08:16:06 +03:00
Kawrakow
546f3ef349 Use fp32 for K*Q in Metal FA implementation (#62)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-25 13:08:55 +03:00
Iwan Kawrakow
be57912955 Minor 2024-09-19 11:14:53 +03:00
Kawrakow
12bbdb8ce7 Fix compiler warnings (#58)
* Fix C++ compilation warnings caused by ggml-common.h

* Disable c99-extensions warning

I get tons of those on macOS due to the arm_neon.h header.

* Disable c99-extensions warning only for APPLE

* Fix warnings in iqk_quantize.cpp

Also add GGML_ABORT when implementation is missing.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-17 14:31:29 +03:00
Kawrakow
4ee889f158 BF16 support on Metal (#56)
* BF16 support on Metal

* Faster BF16 Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-17 10:54:42 +03:00
Kawrakow
2874b98400 iqk_mul_mat(ARM_NEON): adding bf16 support (#41)
It looks like ArmV8 ISA has support for bf16, but my M2 Max
does not have it, so resorting to bf16 -> f32 conversion and
computations in f32. This is 2x slower than f16, but 8x better
compared to what I get if I try to run a bf16 model on the M2
(NEON and Metal).

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-16 16:47:36 +03:00
Iwan Kawrakow
20f3e6fd2d Minor 2024-09-15 12:59:14 +03:00
Kawrakow
6f11c95994 Adding bf16 support to CUDA (#40)
* Adding bf16 support to CUDA - matrix multipications

* Adding bf16 support to CUDA - cleanup

* Adapt to latest master

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14 20:02:32 +03:00
Kawrakow
76be98fdec Improve Q5_0 performance (#55)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14 19:47:26 +03:00
Kawrakow
064b99365c Improve Q4_0 and Q8_0 performance on AVX2/Zen4 (#54)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14 13:53:50 +03:00
Kawrakow
43b934b19f Quantization mixes tweaks (#53)
* Some tweaks for i-quants

Improve Gemma2 PPL while reducing size

* Some tweaks for iq2_k and iq3_k

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-14 10:29:44 +03:00
Iwan Kawrakow
ec1cbc8884 Minor 2024-09-13 15:46:36 +03:00
Kawrakow
f853f6c6a5 Fix bug and D < 128 case for Q8_0 k-cache (#52)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-13 07:19:47 +03:00
Kawrakow
5017f8b3f0 Quantized Flash Attention for all supported CPU platforms (#51)
* NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1

* NEON Flash Attention: quantized K*Q for q4_0

I could finally take advantage of the matrix multiplication
templates. We get quite a bit of speedup that way for q4_0:
For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step>
results in PP-2048 = 287 t/s vs 268 t/s when converting the
q4_0 k-cache and Q to fp16 and using fp16 multiplication.

* NEON Flash Attention: quantized K*Q for q4_1

* NEON Flash Attention: quantized K*Q for q8_0

This makes quite a bit of difference:
For Gemma2-2b PP-8192 is 228 t/s with quantized K*Q vs
178 t/s when converting things to fp16 and using fp16
matrix multiplication.
We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the
performance of PP-512. In contrast, llama.cpp with Q8_0
cache is 38% of PP-512.

* Zen4 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0

* AVX2 Flash Attention: quantized K*Q for q4_0, q4_1, q8_0

* Tidy up FlashMS

* Delete no longer used stuff

With the usage of quantized matrix multiplications for
quantized k- and/or v-cache, we no longer need the
helper methods loading entire rows.

* Disallow mixing bf16 with other types for kv caches

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-12 19:03:20 +03:00
Kawrakow
c920195edd AVX2 Flash Attention 2 (#50)
* AVX2 Flash Attention: add ability to use Q8_0 for kv-cache

* AVX2 Flash Attention: add ability to use Q4_0 for kv-cache

* AVX2 Flash Attention: add ability to use Q4_1 for kv-cache

* Fix Zen4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-11 19:55:42 +03:00
Kawrakow
d98a6753a6 ARM_NEON Flash Attention (#49)
* NEON Flash Attention - first working version

Simply reuse the Zen4/AVX2 implementation, but use
f16 for the K*Q multiplication and V*softmax(K*Q) accumulation.
This makes the FlashMS portion somewhat awkward because we
do not have fast f16 implementations for expf (and tanh when
softcap is enabled), so we need to convert back-and-fort
to f32.

FA is slightly faster than no-FA for the 4B TriLM model,
but lightly slower for Gemma-2b.

* NEON Flash Attention - convert Q to f16 before computing Q*K

* NEON Flash Attention - use fp32 for K*Q operations

Else I get wrong results for LLaMA-3.1-8B (but it works for
Gemma-2b).

* Delete commented out stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-11 10:26:49 +03:00
Kawrakow
72f5dfe12a AVX2 Flash Attention (#48)
* First version of AVX2 Flash attention

I simply took the Zen4 implementation and converted
platform specific stuff to methods of a struct providing
data loading/storing, conversions, multiply, add, etc.

Most likely not optimal as the Zen4 strategy has been
designed based on having 32 512-bit registers, so basically
we can have 4X more data stored in vector registers compared
to AVX2 with 16 x 256-bit.

It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b.

* Fix Zenn4 parts broken via the AVX2 change

* Try smaller q_step - no improvement

* Fix ARM_NEON

I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-10 19:17:04 +03:00
Kawrakow
d17d0c4426 iq2_tn: slightly better performance on AVX2 (#47)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-10 16:21:57 +03:00
Kawrakow
a1f7a03f50 IQ1_TN Metal implementation (#46)
* iq1_tn: Metal implementation

Rquires to change the get_rows and matrix multiplication kernels
to use a dequantizer type rather than a dequantization function.
But once this is done, we can simply reuse the iq1_bn implementation.
This change will also allow to add other quantization types that
have meta data (such as a row scale) stored at the beginning of
a row (or change existing quantization types to row-wise scales).

* Some cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-10 09:43:05 +03:00
Kawrakow
918ada20fa Add CUDA support for IQ1_TN (#45)
* iq1_tn: adding CUDA dequantize

* iq1_tn: adding CUDA dot product

* Delete commented out stuff

* Delete forgotten TODO

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09 21:17:17 +03:00
Kawrakow
8c86231f93 Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44)
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models

* iq1_tn: NEON

* iq1_tn: faster NEON

* iq2_bn: improve performance on NEON

We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!

* iq1_tn: improve AVX2

PP-512 goes to 533 t/s up from 455.
TG-128 @ 2 threads goes to 16.6 t/s up from 14.2.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq1_tn: improve Zen4

PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380.
TG-128 @ 1 thread goes to 12.4 t/s up from 10.4.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq2_bn: improve on Zen4

We now get PP-512 = 614 t/s up from 542 t/s

* iq2_bn: improve AVX2 implementation

We now get PP-512 = 753 t/s up from 680 t/s.

* Remove unnecessary barrier in ggml_compute_forward_mul_mat

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09 14:56:34 +03:00
Kawrakow
bf4b19b474 iq2_tn: slightly faster PP (#43)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-08 12:41:44 +03:00
Kawrakow
6136a4b803 Adding fused rms_norm (#42)
* Fused rms_norm: works on the CPU

* Fused rms_norm WIP

* Fused rms_norm WIP

* Fused rms_norm WIP

* Fused rms_norm WIP

* Fused rms_norm WIP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-08 10:19:21 +03:00
Kawrakow
0087008d29 Add support for bf16 to iqk_mul_mat (#39)
* WIP: adding BF16 support to iqk_mul_mat

* Minor

* Improve TG speed (when not memory bound)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-05 07:48:27 +03:00
Kawrakow
7b1b2b2c06 Zen4 Flash Attention - bf16 support (#38)
* Zen4 Flash Attnetion: WIP bf16

* Zen4 Flash Attnetion: bf16 seems to be working

* Zen4 Flash Attnetion: improving bf16

* Zen4 Flash Attnetion: improving bf16

It is better (slightly faster) to first convert Q
to bf16 before processing each block of q_step rows.
This requires D*q_step*sizeof(bf16) bytes, so at
most 4 kb for the head sizes we support, so we can
just allocate on the stack instead of reserving and
passing a work buffer in ggml.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-05 07:46:47 +03:00
Kawrakow
f17d0d72f5 Performance improvements for legacy quants on ARM_NEON (#37)
* WIP: trying to improve legacy quants

* WIP: trying to improve legacy quants

With this commit PP-512 for LlaMA-3.1-8B goes from
72 t/s to 87.2 t/s for q4_0, and from 61.5 t/s to 73.9 t/s
for q4_1, so 20+% improvement for both.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-04 07:24:04 +03:00
Kawrakow
8c94dcd433 Zen4 Flash Attnetion 2 (#36)
* Zen4 Flash Attnetion: WIP generalize to other types

Now loading of data from K and V is done via a template parameter,
so this should make it easy to generalize to typ[es other than
F16 for the K and V cache.

* Zen4 Flash Attnetion: it works for q4_0 and q8_0

* Zen4 Flash Attnetion: small q8_0 performance improvement

* Zen4 Flash Attnetion: add q4_1

* Delete unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-04 07:20:55 +03:00
Kawrakow
9b53c2533f Fix Zen4 Flash Attention (#35)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-02 15:54:24 +03:00
Kawrakow
5518e24be8 Do not process prompts containing binary data for escapes (#33)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-02 09:18:48 +03:00
Kawrakow
dc023bc3be Zen4 Flash Attention (#32)
* Zen4 flash attention: moving useful parts from the kq_fused_softmax branch

* Add flash attention with soft-cap and fix D = 256 case

* Flash attention refinements

* Update FlashAttn comment

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-01 16:08:21 +03:00
Kawrakow
dbb1db9899 Fix build when iqk_mul_mat is disabled (#31)
Ref #29

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-31 09:11:42 +03:00
Kawrakow
c7e99c88a2 Faster Gemma2 (#27)
* soft_cap_max: initial CPU version of fused softcap + soft_max

With this vanilla CPU implementation I'm already getting a ~3% speedup
for Gemma-2-9b and a prompt of 8192 tokens.

* soft_cap_max: WIP - something is wrong with CUDA

* soft_cap_max: looks good on CPU and CUDA

* Add softcap to flash attention

Just CPU and CUDA for now (but, as we know, flash attention
on the CPU is useless in llama.cpp).

On CUDA this improves PP performance quite a bit, especially for
long contexts. E.g., for PP-16384, I now get 3777 t/s.
Without this change, one cannot use FA, and one gets 2300 t/s
(after fusing softcap and softmax), or 2000 t/s without the
fused softcap+softmax.

In comparison, mainline llama.cpp has PP-16384 = 1549 t/s before
PR-8542 (where Johannes Gaessler has also added softcap to FA),
and PP-16384 = 3097 t/s after this PR.

* soft_cap_max: Metal

* Flash attention with softcap: Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-27 17:40:59 +03:00