Commit Graph

3478 Commits

Author SHA1 Message Date
Kawrakow
cd96f6c4e5 Bitnet: use the fused mul-silu in the FFN network (#110)
I had forgotten that build_bitnet() does not use the standerd
llm_build_ffn function, so the fused mul-silu didn't get used
for Bitnet when I added it to llm_build_ffn.

This gives us another ~1% speedup for TG-128.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 17:40:32 +02:00
Kawrakow
8ccd9bc7e5 Bitnet CUDA improvements (#109)
* iq1_bn: improve CUDA TG

On RTX-3080 TG-128(Bitnet-1.58b-3B) goes from 318 t/s to 340 t/s.
I see I have on the front page 301 t/s, so pretty nice improvement
since then.

* iq2_bn(CUDA): quants are not 4-byte aligned

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 16:26:04 +02:00
Kawrakow
3b5fa426f1 Improve Bitnet PP on Metal (#108)
iq1_bn goes from 702 t/s to 716 t/s
iq2_bn goes from 714 t/s to 743 t/s

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 15:13:45 +02:00
Kawrakow
fdfbd98022 Faster IQ1_BN Metal implementation (#107)
* iq1_bn: faster Metal dot product

82 t/s -> 87.9 t/s

* iq1_bn(Metal): 87.9 -> 89.0 t/s for TG-128

* iq1_bn(Metal): 89.0 -> 94.7 t/s for TG-128

So, total improvement is ~15%. Not bad.

* iq1_bn(Metal): 686 -> 702 t/s for PP-512

* iq2_bn(Metal): 710 -> 714 t/s for PP-512

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-26 10:59:59 +02:00
Kawrakow
856376a9af Remove forgotten IQ1_TN, IQ2_TN enum values 2024-10-25 14:14:56 +03:00
Kawrakow
4b35340f45 Bitnet changes (#106)
* Adapting iq2_bn to work without separate scale tensors

Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).

* Adapting iq1_bn to work without separate scale tensors

* Adapting iq2_bn: CUDA dequantize

* Adapting iq2_bn: CUDA works

* Adapting iq1_bn: CUDA works

* Adapting iq1_bn, iq2_bn: NEON

* Adapting iq1_bn, iq2_bn: Metal

Dequantize works, but there is still something wrong
with the dot products.

* WIP

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.

* Remove iq1_tn and iq2_tn - Part 1

Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.

* Remove iq1_tn and iq2_tn - Part 2

* Bitnet: use the standard llm_build_kv to build self attention

My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).

* Revert "Avoid rebuild of GGML graph for each token (#98)"

This reverts commit f2d315b46f.
As far as I can tell, the commit breaks Metal TG.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25 13:08:43 +02:00
Kawrakow
b535dcd416 Fix quantized k-cache without FA (#105)
* Added Johannes' changes, still getting NaNs with quantized k-cache.

Also getting NaN's on Johannes's mainline branch.

* This fixes it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-24 12:20:30 +02:00
Kawrakow
33a582466d Add support for Granite and GraniteMoE models (#102)
* Add Granite and GranoteMoE models

* Granite: avoid NaNs on CUDA by scaling Q before K*Q multiplication

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-22 17:28:14 +02:00
Kawrakow
0f3a424166 Enable q6_0 for flash attention (#101)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-22 11:34:49 +02:00
Kawrakow
7c5a91daf1 Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99)
* Enable IQ4_NL for V-cache in token generation

* We don't need these

* Update printour of allowed quantized KV-cache combinations

* Add IQ4_NL + IQ4_NL to FA

This is a better alternative than Q4_0 + Q4_0 for the VRAM poor.

* Remove file added by mistake

* Fix typo, which is not really a bug

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-21 12:16:54 +02:00
agray3
d336410509 Avoid rebuild of GGML graph for each token (#98)
Introduces caching of GGML graph to avoid unnecessary full rebuild between each token.
KV cache parameters, which change with each token, are updated directly in cached GGML
graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.
2024-10-20 08:36:16 +02:00
Kawrakow
b091a3513e Bitnet: make the scale tensors optional (#97)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-19 18:52:58 +02:00
Nexes the Elder
b94179b741 Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S (#96)
* attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S

Pattern worth to be tested on more quants and on L3 8B.
PPL 512 = -0.024 for 70b ; - 0.005 for 8b
Size = - 640MiB for 70b ; - 64MiB for 8b

70b Q5_K_S now beats Q5_K_M by -0.012 ppl

I suspect that it goes for L3 as well, which was quite insensitive to attn_q quantization.

* indent
2024-10-19 17:24:43 +02:00
Kawrakow
a049537904 Attempt to blindly fix Windows build failure (#93)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-19 11:43:04 +02:00
Nexes the Elder
2b1af6bade CLI - Specify GGML_TYPE to quantize for the main tensors. (#91)
To complement the token_embd.weight and output.weight :

attn_v.weight
attn_k.weight.
attn_q_weight
attn_output.weight
attn_qkv.weight
ffn_gate
ffn_down
ffn_up
2024-10-18 09:48:15 +02:00
Kawrakow
f369c6f921 Adding IQ4_KSS: 4.0 bpw quants (#89)
* iq4_kss: WIP

* iq4_kss: CUDA dequantize works

So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.

* iq4_kss: slightly better quantization

* iq4_kss: another small quantization improvement

* iq4_kss: CUDA works

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.

* iq4_kss: new bit arrangement - CUDA and Zen4 work

Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.

* iq4_kss: ARM_NEON. Predictably very slow

* iq4_kss: Metal

PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.

* iq4_kss: somewhat faster Metal dot product

45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0

* iq4_kss: AVX2

Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.

* iq4_kss: very slightly faster Metal dot product

48.7 t/s -> 49.3 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-16 15:18:26 +03:00
Kawrakow
a09de6eaef iq4_ks: faster dot product on Metal (#90)
TG-128(LLaMA-3.1-8B) goes to 52.5 t/s up from 48.4 t/s.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-16 14:13:03 +03:00
Kawrakow
1882040c70 Minor iq3_k tweak 2024-10-14 18:13:11 +03:00
Kawrakow
250c325e7e iq3_k: fix and optimize Metal dot product (#87)
* iq3_k: fix Metal dot product

I was accessing the scales as 4-byte aligned, but iq3_k is
not 4-byte aligned. Instead of throwing an error (as it happens
on CUDA when one makes this mistake), Metal silently accepts
and we get garbage.

* iq3_k: slightly faster Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-14 10:46:41 +03:00
Kawrakow
f61bd33a04 Fix and optimize iq2k Metal implementation (#86)
* I somehow broke iq2_k on Metal? - fix dequantize

* I somehow broke iq2_k on Metal? - fix dot product

* iq2_k: optimize Metal dot product

42.6 t/s -> 46.2 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13 14:30:30 +03:00
Kawrakow
67817fb5b9 IQ2_KS: 2.1875 bpw non-linear quantization (#85)
* Experimenting

* iq2k: Try make_qx_quants for the scale

Slightly better for LLaMA-3.1, Gemma-2, slightly worse for
Qwen2.5

* iq2k with make_qx_quants: adjust scale

* iq2ks: basics

* iq2_ks: CUDA works

* iq2_ks: WIP

* iq2_ks: WIP

* iq2_ks: Zen4

* iq2_ks: AVX2

* iq2_ks: scalar dot product

* iq2_ks: ARM_NEON

* iq2_ks: Metal

* iq2_ks: faster Metal

LLaMA-3.1-8B:
PP-512 = 475.22 ± 0.37 t/s
TG-128 =  45.32 ± 0.03 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13 13:34:30 +03:00
Kawrakow
c4c70af543 Minor: printf -> LLAMA_LOG_INFO 2024-10-11 12:49:47 +03:00
Kawrakow
6a16fe2f4e Better model info (#84)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-10 18:21:24 +03:00
Kawrakow
a10ccd65f3 New SOTA quantization: 4.25 bpw IQ4_KS (#83)
* iq4_k_xxs: basics

* WIP + adding iq3_kl quantization mix

* iq4_xxs: this looks very viable compared to iq4_xs

At the same 4.25 bpw PPL is always better, for some models
significantly better. I'll rename to iq4_ks and keep it.

* iq4_xxs: CUDA dot product

We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.

* iq4_xxs: scalar CPU dot product

Also fix the breakage I caused with the dedicated work buffer
quantization portion when the multiplication is not done
via iqk_mul_mat.

* iq4_xxs: Zen4

I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2).
Again the same mistake of packing int32_t back to int16_t,
which overflows occasionally (just occasionally, that's why the
result doesn't look completely wrong, so I didn't notice).

* Fix iq4_xs (Zen4)

* iq4_xxs: AVX2

* iq4_xxs: ARM_NEON

* iq4_xxs: Metal

* iq4_xxs: slightly faster TG on Metal

* iq4_xxs: rename to iq4_ks

After all, tt is a smaller variant of iq4_k.

* iq3_kl: use iq4_ks instead of iq4_k/iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-09 12:54:40 +03:00
Kawrakow
6648952ed8 Fix compiler warnings 2024-10-04 16:17:36 +03:00
Kawrakow
65575488d9 Move scale fudge factors to quantization (#81)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04 16:16:01 +03:00
Kawrakow
d2b53228f5 Move to c++17 projectwide (#80)
* Slightly better

* Make the entire project c++17

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04 14:43:26 +03:00
Kawrakow
f1066edc4e Do not quantize activations if not necessary (#79)
* Do not quantize activations if not necessary

* Do not quantize activations if not necessary also for MoE models

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04 11:22:57 +03:00
Kawrakow
b44d05dbe0 q6_0: Slightly faster Zen4/AVX2 (#78)
* Faster q6_0 on AVX2

PP-512 goes up by 3.4%.

* q6_0: this is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 18:09:47 +03:00
Kawrakow
4390096212 Fused unary(x)*y (#70)
* Adding fused y*unary(x) op

* Fused y*unary(x) op: CUDA

* Fused y*unary(x) op: dedicated CPU implementation for silu and gelu

* Fused y*unary(x) op: Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 17:05:56 +03:00
Kawrakow
104e7e26c4 Adding Q6_0 (#77)
* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 15:22:13 +03:00
Kawrakow
6dec4112a1 iq4_nl: faster quantization (#76)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 08:17:00 +03:00
Kawrakow
d2c74a369b Fix Q5_0 flash attention (#75)
When I changed iqk_mul_mat to use type-1 dot products for type-0
legacy quants, I forgot to also change the vec_dot_type when
the dot product is done via ggml as in flash attention.
This commit fixes it.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 15:52:35 +03:00
Kawrakow
5e0614f9e5 Fix last commit
Did not re-check on AVX2/Zen4 after NEON related changes and,
sure enough, I broke AVX2/Zen4.
2024-10-01 14:48:44 +03:00
Kawrakow
6123e1700b IQ4_NL kv-cache on the CPU (Zen4/AVX2/ARM_NEON) (#74)
* Be able to use IQ4_NL for KV cache on AVX2/Zen4

* Be able to use IQ4_NL for KV cache on ARM_NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 14:46:40 +03:00
Kawrakow
7d9d275fdd CUDA: faster float -> iq4_nl conversion (#73)
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Speed up float -> iq4_nl conversion on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 12:28:29 +03:00
Kawrakow
ee274a4148 iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 (#72)
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2

PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up
from 133.2 t/s.

* Fix AVX2

In addition to fixing iq4_nl, it seems I never adhusted the AVX2
implementation for iq2_tn to the block scale removal?
This commit also fixes that.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 10:56:50 +03:00
Kawrakow
480a405a9c iqk_mul_mat: better srategy when nrc_y not divisible by ny (#71)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-01 08:57:34 +03:00
Kawrakow
cd7e7b6bbc Allow bf16 kv-cache (#69)
On the CPU I get the exact same PPL with and without FA
using bf16 for kv-cache. But on CUDA the bf16 kv-cache
result is about the same as the fp16 kv-cache CPU result,
so I'm missing some conversion somewhere.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-29 09:03:52 +03:00
Kawrakow
f55789e50a Time to fix replace_all (#68)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28 17:59:47 +03:00
Kawrakow
54b1c97878 CUDA non-contiguous RoPE (#66)
In this way we can avoid the Q, K, V copies being made
after multiplication with the QKV tensor in, e.g., Phi-3.5-mini.
This results in a 6-7% speedup of PP-512(Phi-3.5-mini)
on CUDA (RTX-4080)

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28 17:41:21 +03:00
Kawrakow
947a348990 Adding SWIGLU unary op (#65)
* Adding GGML_UNARY_OP_SWIGLU

This commit implements the ggml op and CPU compute
forward. I see ~3-4% speedup of PP-512 for Phi-3.5-mini.

* GGML_UNARY_OP_SWIGLU: CUDA implementation

I observe ~12% speedup for PP-512(Phi-3.5-mini).

* GGML_UNARY_OP_SWIGLU: Metal implementation

We get ~2% speedup for PP-512(Phi-3.5-mini).

* GGML_UNARY_OP_SWIGLU: minor improvement on Metal

* GGML_UNARY_OP_SWIGLU: cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28 13:37:25 +03:00
Kawrakow
843de005d6 Better sub-3-bit quantization mixes with a qkv tensor (#64)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-28 08:17:19 +03:00
Kawrakow
733660accd Adding ability to have meta data per tensor row (#61)
* POC: per row scale

This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.

* POC per row scale: iq2_tn on NEON

* POC per row scale: iq2_tn on Metal

* Per row scale Metal templates

* iq1_tn: shrink to 1.625 bpw (NEON and Metal)

* POC per row scale: CUDA

* POC per row scale: add CUDA TODOs

There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.

* Per row scales - CUDA

The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.

* Update IQ1_TN and IQ2_TN bpw shown to user

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27 08:16:06 +03:00
Kawrakow
1cdb6993ee Use fp32 for K*Q in Metal FA implementation (#62)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-25 13:08:55 +03:00
Kawrakow
15b231cb74 Minor 2024-09-19 11:14:53 +03:00
Kawrakow
d373900e99 Fix compiler warnings (#58)
* Fix C++ compilation warnings caused by ggml-common.h

* Disable c99-extensions warning

I get tons of those on macOS due to the arm_neon.h header.

* Disable c99-extensions warning only for APPLE

* Fix warnings in iqk_quantize.cpp

Also add GGML_ABORT when implementation is missing.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-17 14:31:29 +03:00
Kawrakow
bd4243bfbf BF16 support on Metal (#56)
* BF16 support on Metal

* Faster BF16 Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-17 10:54:42 +03:00
Kawrakow
fc8920282f iqk_mul_mat(ARM_NEON): adding bf16 support (#41)
It looks like ArmV8 ISA has support for bf16, but my M2 Max
does not have it, so resorting to bf16 -> f32 conversion and
computations in f32. This is 2x slower than f16, but 8x better
compared to what I get if I try to run a bf16 model on the M2
(NEON and Metal).

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-16 16:47:36 +03:00
Kawrakow
2d532f85d6 Minor 2024-09-15 12:59:14 +03:00