Commit Graph

32 Commits

Author SHA1 Message Date
Iwan Kawrakow
f1fb59b44b iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
435eb9bdd3 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
08503cec7d WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
977f94b3e0 Forgotten change 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4774788136 Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b3dfe9984b iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
36e9c922b8 iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
a4f1ac8da4 iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
2024-11-21 08:16:40 +02:00
Kawrakow
6b968f3894 Bitnet changes (#106)
* Adapting iq2_bn to work without separate scale tensors

Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).

* Adapting iq1_bn to work without separate scale tensors

* Adapting iq2_bn: CUDA dequantize

* Adapting iq2_bn: CUDA works

* Adapting iq1_bn: CUDA works

* Adapting iq1_bn, iq2_bn: NEON

* Adapting iq1_bn, iq2_bn: Metal

Dequantize works, but there is still something wrong
with the dot products.

* WIP

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.

* Remove iq1_tn and iq2_tn - Part 1

Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.

* Remove iq1_tn and iq2_tn - Part 2

* Bitnet: use the standard llm_build_kv to build self attention

My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).

* Revert "Avoid rebuild of GGML graph for each token (#98)"

This reverts commit f2d315b46f.
As far as I can tell, the commit breaks Metal TG.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25 13:08:43 +02:00
Kawrakow
76b97c8064 Adding IQ4_KSS: 4.0 bpw quants (#89)
* iq4_kss: WIP

* iq4_kss: CUDA dequantize works

So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.

* iq4_kss: slightly better quantization

* iq4_kss: another small quantization improvement

* iq4_kss: CUDA works

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.

* iq4_kss: new bit arrangement - CUDA and Zen4 work

Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.

* iq4_kss: ARM_NEON. Predictably very slow

* iq4_kss: Metal

PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.

* iq4_kss: somewhat faster Metal dot product

45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0

* iq4_kss: AVX2

Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.

* iq4_kss: very slightly faster Metal dot product

48.7 t/s -> 49.3 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-16 15:18:26 +03:00
Iwan Kawrakow
ff23008ed4 Minor iq3_k tweak 2024-10-14 18:13:11 +03:00
Kawrakow
910a134094 IQ2_KS: 2.1875 bpw non-linear quantization (#85)
* Experimenting

* iq2k: Try make_qx_quants for the scale

Slightly better for LLaMA-3.1, Gemma-2, slightly worse for
Qwen2.5

* iq2k with make_qx_quants: adjust scale

* iq2ks: basics

* iq2_ks: CUDA works

* iq2_ks: WIP

* iq2_ks: WIP

* iq2_ks: Zen4

* iq2_ks: AVX2

* iq2_ks: scalar dot product

* iq2_ks: ARM_NEON

* iq2_ks: Metal

* iq2_ks: faster Metal

LLaMA-3.1-8B:
PP-512 = 475.22 ± 0.37 t/s
TG-128 =  45.32 ± 0.03 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13 13:34:30 +03:00
Kawrakow
b30c9e10d8 New SOTA quantization: 4.25 bpw IQ4_KS (#83)
* iq4_k_xxs: basics

* WIP + adding iq3_kl quantization mix

* iq4_xxs: this looks very viable compared to iq4_xs

At the same 4.25 bpw PPL is always better, for some models
significantly better. I'll rename to iq4_ks and keep it.

* iq4_xxs: CUDA dot product

We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.

* iq4_xxs: scalar CPU dot product

Also fix the breakage I caused with the dedicated work buffer
quantization portion when the multiplication is not done
via iqk_mul_mat.

* iq4_xxs: Zen4

I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2).
Again the same mistake of packing int32_t back to int16_t,
which overflows occasionally (just occasionally, that's why the
result doesn't look completely wrong, so I didn't notice).

* Fix iq4_xs (Zen4)

* iq4_xxs: AVX2

* iq4_xxs: ARM_NEON

* iq4_xxs: Metal

* iq4_xxs: slightly faster TG on Metal

* iq4_xxs: rename to iq4_ks

After all, tt is a smaller variant of iq4_k.

* iq3_kl: use iq4_ks instead of iq4_k/iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-09 12:54:40 +03:00
Kawrakow
fe36930c8b Move scale fudge factors to quantization (#81)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04 16:16:01 +03:00
Kawrakow
6dec4af4b6 Adding ability to have meta data per tensor row (#61)
* POC: per row scale

This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.

* POC per row scale: iq2_tn on NEON

* POC per row scale: iq2_tn on Metal

* Per row scale Metal templates

* iq1_tn: shrink to 1.625 bpw (NEON and Metal)

* POC per row scale: CUDA

* POC per row scale: add CUDA TODOs

There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.

* Per row scales - CUDA

The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.

* Update IQ1_TN and IQ2_TN bpw shown to user

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27 08:16:06 +03:00
Kawrakow
12bbdb8ce7 Fix compiler warnings (#58)
* Fix C++ compilation warnings caused by ggml-common.h

* Disable c99-extensions warning

I get tons of those on macOS due to the arm_neon.h header.

* Disable c99-extensions warning only for APPLE

* Fix warnings in iqk_quantize.cpp

Also add GGML_ABORT when implementation is missing.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-17 14:31:29 +03:00
Kawrakow
8c86231f93 Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44)
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models

* iq1_tn: NEON

* iq1_tn: faster NEON

* iq2_bn: improve performance on NEON

We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!

* iq1_tn: improve AVX2

PP-512 goes to 533 t/s up from 455.
TG-128 @ 2 threads goes to 16.6 t/s up from 14.2.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq1_tn: improve Zen4

PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380.
TG-128 @ 1 thread goes to 12.4 t/s up from 10.4.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq2_bn: improve on Zen4

We now get PP-512 = 614 t/s up from 542 t/s

* iq2_bn: improve AVX2 implementation

We now get PP-512 = 753 t/s up from 680 t/s.

* Remove unnecessary barrier in ggml_compute_forward_mul_mat

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09 14:56:34 +03:00
Kawrakow
dbb1db9899 Fix build when iqk_mul_mat is disabled (#31)
Ref #29

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-31 09:11:42 +03:00
Kawrakow
a73702d93b AVX2 quantization for Q8_K (#22)
It has been there for a while, but forgot to add here.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-19 15:33:27 +03:00
Iwan Kawrakow
f5d1af61d7 Fix Makefile
I always use cmake, so had forgotten to pay attention to the
Makefile.
2024-08-09 16:31:04 +02:00
Iwan Kawrakow
f0d7a0d53b Fix Zen4 implementation of iq3_k, iq4_k, iq5_k
See comments in f3a823ce72
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
c3f5e4d9a7 iq6_k: CUDA dequantize
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K
(0.14% vs 0.26% quantization error).
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
a9b3f4a54b iq6_k: WIP (quantize/dequantize) 2024-08-09 16:00:31 +02:00
Iwan Kawrakow
cfb0410067 iq6_k: WIP (nothing works) 2024-08-09 16:00:31 +02:00
Kawrakow
a9f302ebe2 Adding IQ2_TN for use with ternary models (#13)
* iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product.

I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.

* iq2_tn: AVX512

Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.

* iq2_tn: AVX512

With this tweak we get to PP-512 = 431 t/s.

* iq2_tn: AVX512

With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.

* iq2_tn: AVX2

PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.

* iq2_tn: initial NEON version

* iq2_tn: NEON

For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.

* iq2_tn: Metal

For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.

* iq2_tn: CUDA

For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.

* iq2_tn: AVX2 PP improvement

We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.

* iq2_tn: small NEON improvement

For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-07 07:56:09 +02:00
Iwan Kawrakow
6901b3bf14 iq3_k, iq5_k: faster quantization
Just use the same trick as iq4_k
2024-08-05 07:18:18 +02:00
Iwan Kawrakow
e830f4a5f7 iq4_k: speedup quantization by a factor of ~2 2024-08-03 18:38:39 +02:00
Iwan Kawrakow
4f237d44f6 iq3_k: Basics
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
5d341757bc iq5_k: Basics
Quantize/dequantize, CUDA dequantize
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c85e139c68 iq2_k: Basics
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-08-01 09:38:06 +02:00
Kawrakow
291066e6df IQ4_K: SOTA 4-bit quantization (#6)
* iq4_k: basics

* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
  PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
  In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
  products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.

* iq4_k: TG now works on CUDA

* iq4_k: AVX512 implementation

For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.

* iq4_k: AVX2 implementation

For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.

* iq4_k: NEON implementation

For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.

* iq4_k: Metal implementation

For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.

* iq4_k: scalar dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-28 12:11:59 +02:00
Kawrakow
154e0d75fc Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00